Quick Intro~7 MIN· MMOD

Multimodal AI

Full Study

A scannable trailer of the 8-lesson course. Read top to bottom — no clicks needed.

INTROBLOCK · 01
MMOD · 7 MIN PREVIEW

Stop sending text to text-only models.

In 2026 the highest-leverage AI engineers are the ones shipping vision, audio, and video — not just text. Gemini 2.5 Pro reads hour-long videos. Pixtral extracts invoices without OCR. ColPali retrieves PDF pages without chunking. This trailer shows what 'multimodal' actually means in production today.

CONCEPTBLOCK · 02

What 'multimodal' really means in 2026

A multimodal model takes more than text as input or output: image, audio, video, sometimes action commands. The frontier today is **natively multimodal** — trained jointly across modalities (GPT-5, Gemini 2.5 Pro), versus **adapter-stitched** older designs (LLaVA-1.5-style: ViT + projector + LLM bolted together). The deployment surface that matters in 2026: — **Document AI**: invoice/receipt extraction with Pydantic-typed JSON, no separate OCR step. — **Multimodal RAG**: ColPali / late interaction — index PDF page images, skip OCR + chunking. — **Voice agents**: sub-300ms STT → LLM → TTS loops via Deepgram + ElevenLabs or OpenAI Realtime. — **Long-video Q&A**: Gemini 2.5 Pro's 1M-token context lets you ingest a full hour of video. — **Cross-modal search**: SigLIP 2 (Feb 2025, 109 langs) + Qdrant for billion-scale image retrieval. — **Air-gapped local**: Qwen2.5-VL-7B + whisper-large-v3-turbo + SigLIP 2 in Ollama.
TIPIf a colleague says 'we will just OCR the PDFs first,' you have a 2024 stack. The 2026 default is OCR-free: send the page image to a VLM, get back Pydantic-typed JSON.
WATCH OUTGPT-5 doesn't 'watch' video frame by frame at 30fps — it samples. Treat every closed multimodal model card carefully: modality support and frame rate are separate axes.
GOTCHAEmbedding PDF page images with text-only embeddings (text-embedding-3-small applied to byte buffers) ships in a depressingly large number of repos. Use SigLIP 2 or ColPali for image content; never byte-hash.
DIAGRAMBLOCK · 03

Anatomy of a VLM forward pass

patchesfeaturesimage tokensIMAGEVISION TOWERPROJECTORLLMTEXT TOKENS
Vision tower (ViT or SigLIP) → projector (linear or Q-Former) → image tokens flow into the LLM alongside text tokens. Native-multimodal models train this jointly; older designs bolt it on after.
CODEBLOCK · 04

Three lines: a JPG to structured JSON

PYTHON
1from openai import OpenAI
2from pydantic import BaseModel
3client = OpenAI()
4
5class Invoice(BaseModel):
6 vendor: str
7 total: float
8 currency: str
9 line_items: list[str]
10
11img = open("receipt.jpg", "rb").read()
12import base64; b64 = base64.b64encode(img).decode()
13resp = client.beta.chat.completions.parse(
14 model="gpt-5",
15 response_format=Invoice,
16 messages=[{"role": "user", "content": [
17 {"type": "text", "text": "Extract the receipt fields."},
18 {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{b64}"}}]}],
19)
20print(resp.choices[0].message.parsed)
Lines 4-8: Pydantic schema is the ENTIRE structured output spec. Line 12-13: response_format=Invoice forces the model to return a parseable Invoice — no JSON-repair regex. Line 18: parsed is already a typed Python object. This is the OCR-free document AI pattern.
CHEATSHEETBLOCK · 05

5 rules every 2026 multimodal shipper knows

01Native-multimodal beats adapter-stitched. Closed: Gemini 2.5 Pro / GPT-5. Open: Qwen3-VL.
02OCR-free is the default. Send page images, get Pydantic JSON. Florence-2 only for region-grounded reads.
03ColPali + Qdrant beats text RAG on PDFs with diagrams, tables, and screenshots.
04Voice loops live or die at 300ms. Deepgram + ElevenLabs Flash + a slim agent — measure every hop.
05Eval with the 5-bench scorecard: MMMU + MathVista + Video-MME + RePOPE + (DocVQA or MMLongBench).
MINIGAME · RAPIDFIRETFBLOCK · 06

Quick check — true or false?

GPT-5 watches video frame-by-frame at 30 fps.
CLAIM 1/5 · READY · scroll into view
CONCEPTBLOCK · 07

What you'll ship in the full study

Eight lessons. Eight docker projects. By the end you'll have: — A 3-line VLM playground that turns a JPG into typed JSON. — A model-comparison harness across Pixtral, Qwen2.5-VL, and Claude Opus on the same image. — An invoice/receipt extractor with Pydantic schema validation, lifted into your AP automation flow. — A ColPali + Qdrant multimodal RAG that beats text RAG on diagrammed PDFs. — A sub-300ms voice agent loop you can drop in front of a phone number tomorrow. — A long-video Q&A app over a 1-hour MP4, on Gemini 2.5 Pro. — A SigLIP 2 + Qdrant image search engine, multilingual, billion-scale-ready. — An eval harness running MMMU / MathVista / Video-MME / RePOPE in CI. — An air-gapped local multimodal stack (Ollama + Qwen2.5-VL + Whisper + SigLIP 2). Every docker project ships with composeYaml, expectedStdout, and a 'lift to work' note explaining how to drop it into your team's repo.
INCLUDEDAll projects are designed to be lifted — not demoed. The compose files use ${ENV} interpolation; specialist prompts are in versioned .md files; eval reports are HTML you can paste into Notion.
LESSON COMPLETEBLOCK · 08

That's the trailer.

NEXTLesson 1 · The VLM mental model
WHAT YOU'LL WALK AWAY WITH

Real skills, real career delta.

Skills you'll gain

10
  • Read VLM model cards criticallyWorking

    Identify vision tower / projector / decoder; modality coverage; native vs adapter-stitched; context window and resolution limits — before opening the API.

  • Pick a multimodal model from a 4-axis matrixProduction

    Decide along modalities x context length x open/closed x cost. Match GPT-5 / Gemini 2.5 Pro / Claude Opus 4.7 / Qwen3-VL / Llama 4 Scout / Pixtral to the job.

  • Build OCR-free document extraction → Pydantic JSONProduction

    Send base64 page images to GPT-5/Pixtral/Qwen2.5-VL with response_format=PydanticModel. Validate, retry, reject. Drop into AP automation flows.

  • Ship multimodal RAG with ColPali / late interactionProduction

    ColQwen2 + Qdrant multi-vector index over PDF page images. Skip OCR and chunking entirely. Measured: ~1s search on 25K pages.

  • Engineer a sub-300ms voice loopProduction

    Deepgram Nova-3 STT (~150ms first word) + GPT-5 streaming + ElevenLabs Flash v2.5 (~75ms TTFB), with VAD-driven barge-in. Or OpenAI Realtime API for the unified path.

  • Process long video with native 1M-token VLMsWorking

    Gemini 2.5 Pro for hour-long video Q&A (Video-MME 84.8). Frame-sampling fallback with decord/PyAV when content exceeds context.

  • Build cross-modal search at scale (SigLIP 2 + Qdrant)Production

    SigLIP 2 NaFlex embeddings (109 langs) → Qdrant HNSW. Billion-scale-ready. Replaces CLIP for any new build.

  • Run a 5-benchmark VLM evaluation in CIProduction

    MMMU + MathVista + Video-MME + DocVQA + RePOPE on every model swap. lmms-eval orchestrates; HTML report goes to Notion.

  • Detect & defend against multimodal hallucinationAdvanced

    POPE/RePOPE/HALLUSIONBench probes; grounded-prompting + cite-the-region prompts; refuse-when-unsure system prompts; eval gates in CI.

  • Deploy a fully air-gapped local multimodal stackAdvanced

    Ollama + Qwen2.5-VL-7B + whisper-large-v3-turbo + SigLIP 2 — on-prem, GPU-budgeted. The deployment regulated industries actually buy.

Career & income delta

Career moves
  • Title yourself credibly as 'multimodal AI engineer' — one of the 2026 hot search terms on senior IC postings.
  • Lead a Document AI / IDP initiative — the highest-ROI multimodal use case in 2026 ($27B+ market by 2030).
  • Own the voice-agent platform at a B2B SaaS — most product roadmaps have it; few teams have shipped it.
  • Pick up contracting work at $200-400/hr replacing fragile OCR + regex pipelines with one VLM call.
  • Move from a generic backend role into an AI-platform team — multimodal experience is the differentiator.
Income impact
  • $20-50K bump for senior ICs adding production multimodal to their resume in 2026.
  • $50-150K bump moving from a generic backend role to an AI-platform / IDP / voice-agent team.
  • Freelance / consulting rates: $200-400/hr — 'we have 5,000 PDFs and need them queryable' is the most common 2026 inquiry.
  • Enterprise demos / sales-engineering: closing one 6-figure deal per quarter often requires a working multimodal RAG over the customer's corpus.
  • Document AI specialists in regulated industries (finance, legal, healthcare) command 20-40% premiums over generic AI engineers.
Market resilience
  • Multimodal architecture skills (vision tower / projector / decoder mental model) survive every model swap.
  • ColPali / late-interaction is becoming a commodity skill — but the engineers who ship it FIRST own the platform decisions.
  • Air-gapped on-prem multimodal stacks remain in demand for any regulated industry — Ollama + Qwen2.5-VL is durable.
  • Eval discipline (5-bench scorecard, hallucination probes) carries forward to whatever 2027 model arrives.
  • Voice-agent latency engineering — sub-300ms loops — remains a moat; most teams will struggle to ship it.