RAGCourse

RAG, vector DBs & enterprise search

Lessons10modules

Total105mfull study

Quick7mtrailer

Projects8docker labs

CHEATSHEET · 01RAG · master cheatsheet

Embedding model picks · 2026

·OpenAI text-embedding-3-large — default for hosted English. Matryoshka dims (256/512/1024/3072).
·Voyage voyage-3-large — top of MTEB; voyage-code-3 for code; voyage-multimodal for images+text.
·Cohere embed-v4.0 — multilingual + multimodal, native Matryoshka.
·BGE-M3 — open SOTA, multilingual, dense + sparse + colbert in ONE model. Self-host.
·Nomic-embed-text-v2-MoE — fully open weights + data + code, on-prem friendly.
·Snowflake Arctic Embed L 2.0 — strong open-source, 1024 dims.

Chunking strategies

·Recursive (LangChain default) — character split with semantic separators. Safe baseline.
·Semantic — split where embedding similarity drops (LlamaIndex SemanticSplitter).
·Late chunking (Jina) — embed the whole doc first, then slice the token vectors.
·Contextual retrieval (Anthropic, Sept 2024) — prepend chunk-level context summary; ~49% retrieval-error reduction.
·Parent-document / MultiVector — index small chunks, return the parent paragraph at retrieval time.
·Markdown / code / table-aware (Unstructured.io, LlamaParse, Docling) — never split mid-table or mid-function.

Vector store picks · 2026

·pgvector + Postgres — default if you already run Postgres. Simple, ACID, hybrid search via tsvector.
·Qdrant — fast Rust core, payload filters, native hybrid (Query API), gRPC. Sweet spot 1M-1B vectors.
·Weaviate — modules + multi-tenancy, GraphQL, BM25 + vector hybrid built in.
·Milvus / Zilliz Cloud — billion-scale, GPU index, RaBitQ quantization.
·Pinecone Serverless — managed, zero ops, pricing scales with usage. Hybrid (sparse+dense) supported.
·LanceDB — embedded (single-file lance format); great for laptops + edge + multimodal.
·Vespa — enterprise hybrid + ranking; Yahoo-grade scale. Steeper curve.
·Turbopuffer / OpenSearch / Elastic / Mongo Atlas — when colocation with your existing search infra wins.

Hybrid search & rewriting

·BM25 + dense vector + RRF (Reciprocal Rank Fusion, k≈60) — battle-tested baseline.
·HyDE — embed a hypothetical answer; retrieve against it. Helps short / under-specified queries.
·Multi-query — LLM generates N paraphrases, union the top-K. Always-on for low-latency budgets.
·Decomposition / step-back — break a hard question into sub-questions, retrieve each, synthesise.
·Self-query — LLM extracts metadata filters from the query (date range, author, tag).

Re-ranking

·Cohere rerank-v3.5 — multilingual SOTA, ~$2/1k searches, latency ~30-100ms.
·Jina reranker-v2-base-multilingual — open-weights, run on CPU.
·BGE-reranker-v2-m3 — open SOTA, GPU-class for big batches.
·ColPali / ColBERT — late-interaction multi-vector; fits visually-rich docs (PDFs as images).
·Always trim 50 → 8 (or 100 → 12) before the LLM. Token spend drops, faithfulness rises.

Production guardrails

·Version embeddings — (model_id, dim, chunker_version) on every row.
·Blue/green re-index. Never re-embed in place.
·Detect retrieval-time prompt injection (instructions embedded in retrieved docs).
·PII redaction at ingest AND retrieval-time policy filter.
·Cache embeddings + reranker hits keyed by (model, normalized_text).
·Token + latency budget per query. Fail fast past the cap.
·Per-tenant scopes — every search query gets a tenant filter, no cross-tenant retrieval ever.

CHEATSHEET · 02Eval · what to measure and how

Retrieval metrics (eval first)

·Recall@K — does any relevant chunk appear in top-K? Ground truth from a labelled Q→chunk set.
·MRR@K — reciprocal rank of first relevant chunk; rewards being EARLY.
·nDCG@K — graded relevance; for multi-relevance corpora.
·Context precision (Ragas) — fraction of retrieved chunks that are actually relevant.
·Context recall (Ragas) — does retrieved context cover the ground-truth answer?

Generation metrics

·Faithfulness (Ragas / TruLens) — is every claim in the answer grounded in retrieved context?
·Answer relevance — does the answer address the question?
·Answer correctness — exact-match / judge-scored vs ground-truth answer.
·Hallucination rate — % of claims NOT supported by any chunk.

Operational metrics

·p50 / p95 / p99 latency for retrieval, rerank, generation — separately.
·$ per query — embedding + vector search + rerank + LLM tokens, summed.
·Top-K hit rate — % of queries where the answer's chunk is in K. Watch for drift.
·Negative answer rate — when context is empty, do we say 'I don't know'? Should be ≥95%.

Eval frameworks

·Ragas — Python, ground-truth + judge-scored. Well-maintained as of 2026.
·TruLens — feedback functions, RAG triad (groundedness/relevance/answer relevance).
·Arize Phoenix — OpenInference-native, deep traces + eval in one UI.
·DeepEval — pytest-style, easy CI gating.
·LangSmith — best with LangChain / LangGraph stacks.
·OpenAI Evals — JSONL specs, judge or programmatic graders.

What to gate in CI