CHEATSHEET · 01RAG · master cheatsheet
Embedding model picks · 2026
- ·OpenAI text-embedding-3-large — default for hosted English. Matryoshka dims (256/512/1024/3072).
- ·Voyage voyage-3-large — top of MTEB; voyage-code-3 for code; voyage-multimodal for images+text.
- ·Cohere embed-v4.0 — multilingual + multimodal, native Matryoshka.
- ·BGE-M3 — open SOTA, multilingual, dense + sparse + colbert in ONE model. Self-host.
- ·Nomic-embed-text-v2-MoE — fully open weights + data + code, on-prem friendly.
- ·Snowflake Arctic Embed L 2.0 — strong open-source, 1024 dims.
Chunking strategies
- ·Recursive (LangChain default) — character split with semantic separators. Safe baseline.
- ·Semantic — split where embedding similarity drops (LlamaIndex SemanticSplitter).
- ·Late chunking (Jina) — embed the whole doc first, then slice the token vectors.
- ·Contextual retrieval (Anthropic, Sept 2024) — prepend chunk-level context summary; ~49% retrieval-error reduction.
- ·Parent-document / MultiVector — index small chunks, return the parent paragraph at retrieval time.
- ·Markdown / code / table-aware (Unstructured.io, LlamaParse, Docling) — never split mid-table or mid-function.
Vector store picks · 2026
- ·pgvector + Postgres — default if you already run Postgres. Simple, ACID, hybrid search via tsvector.
- ·Qdrant — fast Rust core, payload filters, native hybrid (Query API), gRPC. Sweet spot 1M-1B vectors.
- ·Weaviate — modules + multi-tenancy, GraphQL, BM25 + vector hybrid built in.
- ·Milvus / Zilliz Cloud — billion-scale, GPU index, RaBitQ quantization.
- ·Pinecone Serverless — managed, zero ops, pricing scales with usage. Hybrid (sparse+dense) supported.
- ·LanceDB — embedded (single-file lance format); great for laptops + edge + multimodal.
- ·Vespa — enterprise hybrid + ranking; Yahoo-grade scale. Steeper curve.
- ·Turbopuffer / OpenSearch / Elastic / Mongo Atlas — when colocation with your existing search infra wins.
Hybrid search & rewriting
- ·BM25 + dense vector + RRF (Reciprocal Rank Fusion, k≈60) — battle-tested baseline.
- ·HyDE — embed a hypothetical answer; retrieve against it. Helps short / under-specified queries.
- ·Multi-query — LLM generates N paraphrases, union the top-K. Always-on for low-latency budgets.
- ·Decomposition / step-back — break a hard question into sub-questions, retrieve each, synthesise.
- ·Self-query — LLM extracts metadata filters from the query (date range, author, tag).
Re-ranking
- ·Cohere rerank-v3.5 — multilingual SOTA, ~$2/1k searches, latency ~30-100ms.
- ·Jina reranker-v2-base-multilingual — open-weights, run on CPU.
- ·BGE-reranker-v2-m3 — open SOTA, GPU-class for big batches.
- ·ColPali / ColBERT — late-interaction multi-vector; fits visually-rich docs (PDFs as images).
- ·Always trim 50 → 8 (or 100 → 12) before the LLM. Token spend drops, faithfulness rises.
Production guardrails
- ·Version embeddings — (model_id, dim, chunker_version) on every row.
- ·Blue/green re-index. Never re-embed in place.
- ·Detect retrieval-time prompt injection (instructions embedded in retrieved docs).
- ·PII redaction at ingest AND retrieval-time policy filter.
- ·Cache embeddings + reranker hits keyed by (model, normalized_text).
- ·Token + latency budget per query. Fail fast past the cap.
- ·Per-tenant scopes — every search query gets a tenant filter, no cross-tenant retrieval ever.
CHEATSHEET · 02Eval · what to measure and how
Retrieval metrics (eval first)
- ·Recall@K — does any relevant chunk appear in top-K? Ground truth from a labelled Q→chunk set.
- ·MRR@K — reciprocal rank of first relevant chunk; rewards being EARLY.
- ·nDCG@K — graded relevance; for multi-relevance corpora.
- ·Context precision (Ragas) — fraction of retrieved chunks that are actually relevant.
- ·Context recall (Ragas) — does retrieved context cover the ground-truth answer?
Generation metrics
- ·Faithfulness (Ragas / TruLens) — is every claim in the answer grounded in retrieved context?
- ·Answer relevance — does the answer address the question?
- ·Answer correctness — exact-match / judge-scored vs ground-truth answer.
- ·Hallucination rate — % of claims NOT supported by any chunk.
Operational metrics
- ·p50 / p95 / p99 latency for retrieval, rerank, generation — separately.
- ·$ per query — embedding + vector search + rerank + LLM tokens, summed.
- ·Top-K hit rate — % of queries where the answer's chunk is in K. Watch for drift.
- ·Negative answer rate — when context is empty, do we say 'I don't know'? Should be ≥95%.
Eval frameworks
- ·Ragas — Python, ground-truth + judge-scored. Well-maintained as of 2026.
- ·TruLens — feedback functions, RAG triad (groundedness/relevance/answer relevance).
- ·Arize Phoenix — OpenInference-native, deep traces + eval in one UI.
- ·DeepEval — pytest-style, easy CI gating.
- ·LangSmith — best with LangChain / LangGraph stacks.
- ·OpenAI Evals — JSONL specs, judge or programmatic graders.
What to gate in CI
- ·Recall@5 ≥ baseline · 0.98 (no regression bigger than 2 points)
- ·Faithfulness ≥ baseline · 0.97
- ·p95 latency ≤ baseline · 1.20
- ·$/query ≤ baseline · 1.20
- ·Negative-answer rate ≥ 0.95 on the negative-eval set