RAG, vector DBs & enterprise search — Quick Intro

INTROBLOCK · 01

RAG · 7 MIN PREVIEW

The pipeline IS the product.

Vector-DB use grew 377% YoY (Databricks · State of Data + AI 2026). 70% of enterprise AI features ship as RAG — not fine-tuning. The model is a commodity; the retrieval pipeline is your moat. This trailer shows what production RAG looks like in 2026 — and what trips most teams.

CONCEPTBLOCK · 02

RAG in one paragraph

RAG = Retrieve relevant chunks from your data → Augment the prompt with them → Generate an answer grounded only in those chunks. Done well it beats fine-tuning at most enterprise tasks. Done badly it returns confident nonsense at scale. The quality of your RAG is decided in the *pipeline*, not the LLM: chunking, embedding, indexing, hybrid search, query rewriting, re-ranking, eval, observability. Every link upgrades or erodes the next one.

TIPEval retrieval BEFORE generation. If your top-K doesn't contain the answer, no model on earth will save you.

WATCH OUTLong-context windows (Gemini 2M, GPT-5 1M) didn't kill RAG — they made it cheaper to be sloppy. The economics still favour retrieval at scale.

GOTCHARe-indexing without versioning is the single most common silent regression in enterprise RAG. Always tag (model, dim, chunker).

DIAGRAMBLOCK · 03

The 2026 RAG pipeline

Modern RAG is hybrid: dense (vectors) + sparse (BM25) → fused (RRF) → cross-encoder rerank → LLM. Each stage trims what the next sees.

CODEBLOCK · 04

30-line RAG with pgvector + reranker

PYTHON

1import os, psycopg, openai, cohere

2from psycopg.rows import dict_row

4openai_c = openai.OpenAI()

5cohere_c = cohere.ClientV2()

7def embed(text: str) -> list[float]:

8 r = openai_c.embeddings.create(

9 model="text-embedding-3-large", input=text,

10 dimensions=1024) # Matryoshka: ask for fewer dims

11 return r.data[0].embedding

13def retrieve(query: str, k: int = 50) -> list[dict]:

14 qv = embed(query)

15 with psycopg.connect(os.environ["PG_URL"], row_factory=dict_row) as cn:

16 return cn.execute(

17 "SELECT id, text FROM docs ORDER BY embedding <=> %s::vector LIMIT %s",

18 (qv, k)).fetchall()

20def rerank(query: str, candidates: list[dict], top: int = 8) -> list[dict]:

21 r = cohere_c.rerank(model="rerank-v3.5", query=query,

22 documents=[c["text"] for c in candidates], top_n=top)

23 return [candidates[h.index] for h in r.results]

25def rag(question: str) -> str:

26 ctx = "\n---\n".join(c["text"] for c in rerank(question, retrieve(question)))

27 return openai_c.chat.completions.create(

28 model="gpt-4o",

29 messages=[{"role": "system", "content": "Answer ONLY from the context. Cite chunk ids."},

30 {"role": "user", "content": f"<context>\n{ctx}\n</context>\n\nQ: {question}"}]

31 ).choices[0].message.content

Line 9: Matryoshka dims — text-embedding-3-large supports `dimensions` truncation; use 1024 instead of the default 3072 for ~3× cheaper storage. Line 21: Cohere rerank-v3.5 turns 50 candidates into 8 surgical ones — the highest-leverage 5 lines you'll add to a RAG.

CHEATSHEETBLOCK · 05

5 rules every 2026 RAG shipper knows

01Eval retrieval before generation. Recall@K is the leading indicator of answer quality.

02Hybrid (BM25 + vector + RRF) beats pure vector on keyword-heavy queries — and most enterprise corpora are keyword-heavy.

03Chunk by semantic boundary, not character count. For docs, contextual retrieval (Anthropic, Sept 2024) cuts retrieval errors ~49% cheaply.

04Re-rank top-50 down to top-8 with a cross-encoder before the LLM sees them. Cheapest answer-quality lift in the pipeline.

05Version your embeddings: (model_id, dim, chunker_version). Re-indexing without versioning is how you ship silent regressions.

MINIGAME · RAPIDFIRETFBLOCK · 06

RAG quick check — true or false?

RAG eliminates hallucinations.

CLAIM 1/5 · READY · scroll into view

CONCEPTBLOCK · 07

What you'll ship in the full study

Ten lessons. Eight Docker projects. By the end you'll have: — A pgvector + OpenAI starter you can put behind any internal FAQ. — A chunker-bench that picks the right chunker for YOUR data with hard numbers. — A vectordb-bench across pgvector / Qdrant / Weaviate / LanceDB. — A hybrid-search-uplift container that proves BM25+vector+RRF on Stack Overflow. — A rerank-uplift that quantifies Cohere rerank-v3.5 lift on your corpus. — A pytest + Ragas eval harness that gates CI for any RAG change. — An Agentic RAG (corrective + self-RAG) built on LangGraph 1.0. — A fully air-gapped, on-prem stack (Ollama + nomic-embed + pgvector + Open WebUI) for regulated industries. Every project ships with composeYaml, expectedStdout, and a 'lift to work' note explaining how to drop it into your team's repo.

INCLUDEDFree-tier students unlock Lesson 1 + this preview. Pro unlocks all 10 lessons + 8 Docker projects.

LESSON COMPLETEBLOCK · 08

That's the trailer.

NEXTLesson 1 · Embeddings — turning text into geometry

WHAT YOU'LL WALK AWAY WITH

Real skills, real career delta.

Skills you'll gain

Pick the right embedding model from MTEB + cost + privacyWorking
Choose between text-embedding-3-large (Matryoshka), voyage-3-large, Cohere embed-v4, BGE-M3, Nomic-embed-text-v2-MoE based on language, modality, deployment and budget. Defend the pick with MTEB scores.
Choose & justify a chunking strategy by data shapeProduction
Recursive vs semantic vs late-chunking vs Anthropic contextual retrieval vs parent-document. Code-aware and table-aware splitters for source code and structured docs.
Choose & justify a vector store from 8 production optionsProduction
pgvector, Qdrant, Weaviate, Milvus, Pinecone, LanceDB, Vespa, Mongo Atlas — pick by scale, ops model, hybrid support, payload filtering, multi-tenancy. Read VectorDBBench results.
Implement hybrid search with RRF on at least two storesProduction
BM25 + dense vector + RRF (k≈60). Score fusion vs rank fusion. Native hybrid in Qdrant/Weaviate/Pinecone vs hand-rolled with pgvector + tsvector.
Add query rewriting (HyDE, multi-query, decomposition)Working
Choose rewriting strategy from query shape — short/under-specified → HyDE; ambiguous → multi-query; complex compound → decomposition + step-back. Measure lift on Recall@K.
Re-rank with Cohere/Jina/BGE/ColPaliProduction
Cross-encoder rerank for surgical context. Open-source vs API. ColPali / ColBERT late-interaction for multi-vector retrieval. Trim 50 → 8 to cut token cost AND raise faithfulness.
Eval RAG with Ragas (faithfulness, answer relevance, context P/R)Production
Build a labelled Q→chunk → answer dataset. Run Ragas + TruLens RAG triad. Gate CI on Recall@5, Faithfulness, p95 latency, $/query.
Ship Agentic RAG (corrective + self-RAG)Advanced
LangGraph 1.0 state machine: retrieve → grade → rewrite/re-retrieve → generate → self-check → loop. Knows when to web-search, when to refuse, when to ask for clarification.
Index visually-rich PDFs with ColPali (no OCR)Advanced
Multimodal RAG over slides, scanned PDFs, screenshots using ColPali multi-vector embeddings — beats OCR + text retrieval on layout-heavy corpora.
Observe + budget RAG in Langfuse / PhoenixProduction
Trace every stage of a RAG call (embed → search → rerank → LLM). Per-stage latency + token attribution, $/query dashboards, cache-hit rate, negative-answer-rate.
Embedding-model migration with versioning + blue/green indexAdvanced
Tag (model_id, dim, chunker_version) on every row. Run shadow-index re-embed, dual-read with feature flag, A/B eval before cutover. Never re-embed in place.
Defend retrieval-time prompt injection + PIIAdvanced
Detect instructions hidden in retrieved chunks (OWASP LLM01). PII redaction at ingest, retrieval-time policy filters per tenant, signed retrieval audit trail.

Career & income delta

Career moves

Title yourself credibly as 'AI search engineer' or 'RAG platform engineer' — the 2026 hiring channel for senior IC roles at $180-380K (LinkedIn job-posting growth: +213% YoY for 'RAG' titled roles).
Lead an internal AI search platform — most series-B/C orgs are now staffing this team after their 'just call OpenAI' phase failed on enterprise data.
Pick up contracting at $200-400/hr fixing RAGs that retrieve but don't answer correctly. Most common 2026 inquiry on Toptal / Upwork's AI section.
Ship the 'AI over our docs' feature your CEO has been demoing for 6 months — and own that line item on your perf review.

Income impact

$15-40K bump for senior ICs adding production RAG to their resume in 2026.
$30-100K bump moving from a generic backend role to an AI search / RAG team.
Freelance / consulting rates: $200-400/hr — 'we have a RAG that hallucinates' is the canonical inquiry.
Enterprise deals: closing one 6-figure ACV often requires the eval harness in Lesson 7 to pass procurement.

Market resilience

RAG is the #1 enterprise AI use case (Databricks · State of Data + AI 2026; vector-DB use grew 377% YoY). The skill survives the next foundation-model consolidation — orgs always need someone who can ground a model in their data.
Vector DB skills are durable — the protocols (HNSW, RRF, cross-encoder rerank) outlive any single vendor. Pgvector + Qdrant + Weaviate cover ~70% of the market and are unlikely to all disappear.
Eval discipline carries forward to whatever the 2027 retrieval framework looks like.
On-prem / air-gapped RAG (Ollama + nomic-embed + pgvector) remains in demand for any regulated industry, no matter the model market.