RAGCourse

RAG, vector DBs & enterprise search

Lessons10modules
Total105mfull study
Quick7mtrailer
Projects8docker labs
CHEATSHEET · 01RAG · master cheatsheet
Embedding model picks · 2026
  • ·OpenAI text-embedding-3-large — default for hosted English. Matryoshka dims (256/512/1024/3072).
  • ·Voyage voyage-3-large — top of MTEB; voyage-code-3 for code; voyage-multimodal for images+text.
  • ·Cohere embed-v4.0 — multilingual + multimodal, native Matryoshka.
  • ·BGE-M3 — open SOTA, multilingual, dense + sparse + colbert in ONE model. Self-host.
  • ·Nomic-embed-text-v2-MoE — fully open weights + data + code, on-prem friendly.
  • ·Snowflake Arctic Embed L 2.0 — strong open-source, 1024 dims.
Chunking strategies
  • ·Recursive (LangChain default) — character split with semantic separators. Safe baseline.
  • ·Semantic — split where embedding similarity drops (LlamaIndex SemanticSplitter).
  • ·Late chunking (Jina) — embed the whole doc first, then slice the token vectors.
  • ·Contextual retrieval (Anthropic, Sept 2024) — prepend chunk-level context summary; ~49% retrieval-error reduction.
  • ·Parent-document / MultiVector — index small chunks, return the parent paragraph at retrieval time.
  • ·Markdown / code / table-aware (Unstructured.io, LlamaParse, Docling) — never split mid-table or mid-function.
Vector store picks · 2026
  • ·pgvector + Postgres — default if you already run Postgres. Simple, ACID, hybrid search via tsvector.
  • ·Qdrant — fast Rust core, payload filters, native hybrid (Query API), gRPC. Sweet spot 1M-1B vectors.
  • ·Weaviate — modules + multi-tenancy, GraphQL, BM25 + vector hybrid built in.
  • ·Milvus / Zilliz Cloud — billion-scale, GPU index, RaBitQ quantization.
  • ·Pinecone Serverless — managed, zero ops, pricing scales with usage. Hybrid (sparse+dense) supported.
  • ·LanceDB — embedded (single-file lance format); great for laptops + edge + multimodal.
  • ·Vespa — enterprise hybrid + ranking; Yahoo-grade scale. Steeper curve.
  • ·Turbopuffer / OpenSearch / Elastic / Mongo Atlas — when colocation with your existing search infra wins.
Hybrid search & rewriting
  • ·BM25 + dense vector + RRF (Reciprocal Rank Fusion, k≈60) — battle-tested baseline.
  • ·HyDE — embed a hypothetical answer; retrieve against it. Helps short / under-specified queries.
  • ·Multi-query — LLM generates N paraphrases, union the top-K. Always-on for low-latency budgets.
  • ·Decomposition / step-back — break a hard question into sub-questions, retrieve each, synthesise.
  • ·Self-query — LLM extracts metadata filters from the query (date range, author, tag).
Re-ranking
  • ·Cohere rerank-v3.5 — multilingual SOTA, ~$2/1k searches, latency ~30-100ms.
  • ·Jina reranker-v2-base-multilingual — open-weights, run on CPU.
  • ·BGE-reranker-v2-m3 — open SOTA, GPU-class for big batches.
  • ·ColPali / ColBERT — late-interaction multi-vector; fits visually-rich docs (PDFs as images).
  • ·Always trim 50 → 8 (or 100 → 12) before the LLM. Token spend drops, faithfulness rises.
Production guardrails
  • ·Version embeddings — (model_id, dim, chunker_version) on every row.
  • ·Blue/green re-index. Never re-embed in place.
  • ·Detect retrieval-time prompt injection (instructions embedded in retrieved docs).
  • ·PII redaction at ingest AND retrieval-time policy filter.
  • ·Cache embeddings + reranker hits keyed by (model, normalized_text).
  • ·Token + latency budget per query. Fail fast past the cap.
  • ·Per-tenant scopes — every search query gets a tenant filter, no cross-tenant retrieval ever.
CHEATSHEET · 02Eval · what to measure and how
Retrieval metrics (eval first)
  • ·Recall@K — does any relevant chunk appear in top-K? Ground truth from a labelled Q→chunk set.
  • ·MRR@K — reciprocal rank of first relevant chunk; rewards being EARLY.
  • ·nDCG@K — graded relevance; for multi-relevance corpora.
  • ·Context precision (Ragas) — fraction of retrieved chunks that are actually relevant.
  • ·Context recall (Ragas) — does retrieved context cover the ground-truth answer?
Generation metrics
  • ·Faithfulness (Ragas / TruLens) — is every claim in the answer grounded in retrieved context?
  • ·Answer relevance — does the answer address the question?
  • ·Answer correctness — exact-match / judge-scored vs ground-truth answer.
  • ·Hallucination rate — % of claims NOT supported by any chunk.
Operational metrics
  • ·p50 / p95 / p99 latency for retrieval, rerank, generation — separately.
  • ·$ per query — embedding + vector search + rerank + LLM tokens, summed.
  • ·Top-K hit rate — % of queries where the answer's chunk is in K. Watch for drift.
  • ·Negative answer rate — when context is empty, do we say 'I don't know'? Should be ≥95%.
Eval frameworks
  • ·Ragas — Python, ground-truth + judge-scored. Well-maintained as of 2026.
  • ·TruLens — feedback functions, RAG triad (groundedness/relevance/answer relevance).
  • ·Arize Phoenix — OpenInference-native, deep traces + eval in one UI.
  • ·DeepEval — pytest-style, easy CI gating.
  • ·LangSmith — best with LangChain / LangGraph stacks.
  • ·OpenAI Evals — JSONL specs, judge or programmatic graders.
What to gate in CI
  • ·Recall@5 ≥ baseline · 0.98 (no regression bigger than 2 points)
  • ·Faithfulness ≥ baseline · 0.97
  • ·p95 latency ≤ baseline · 1.20
  • ·$/query ≤ baseline · 1.20
  • ·Negative-answer rate ≥ 0.95 on the negative-eval set