Multimodal AI — Quick Intro

INTROBLOCK · 01

MMOD · 7 MIN PREVIEW

Stop sending text to text-only models.

In 2026 the highest-leverage AI engineers are the ones shipping vision, audio, and video — not just text. Gemini 2.5 Pro reads hour-long videos. Pixtral extracts invoices without OCR. ColPali retrieves PDF pages without chunking. This trailer shows what 'multimodal' actually means in production today.

CONCEPTBLOCK · 02

What 'multimodal' really means in 2026

A multimodal model takes more than text as input or output: image, audio, video, sometimes action commands. The frontier today is **natively multimodal** — trained jointly across modalities (GPT-5, Gemini 2.5 Pro), versus **adapter-stitched** older designs (LLaVA-1.5-style: ViT + projector + LLM bolted together). The deployment surface that matters in 2026: — **Document AI**: invoice/receipt extraction with Pydantic-typed JSON, no separate OCR step. — **Multimodal RAG**: ColPali / late interaction — index PDF page images, skip OCR + chunking. — **Voice agents**: sub-300ms STT → LLM → TTS loops via Deepgram + ElevenLabs or OpenAI Realtime. — **Long-video Q&A**: Gemini 2.5 Pro's 1M-token context lets you ingest a full hour of video. — **Cross-modal search**: SigLIP 2 (Feb 2025, 109 langs) + Qdrant for billion-scale image retrieval. — **Air-gapped local**: Qwen2.5-VL-7B + whisper-large-v3-turbo + SigLIP 2 in Ollama.

TIPIf a colleague says 'we will just OCR the PDFs first,' you have a 2024 stack. The 2026 default is OCR-free: send the page image to a VLM, get back Pydantic-typed JSON.

WATCH OUTGPT-5 doesn't 'watch' video frame by frame at 30fps — it samples. Treat every closed multimodal model card carefully: modality support and frame rate are separate axes.

GOTCHAEmbedding PDF page images with text-only embeddings (text-embedding-3-small applied to byte buffers) ships in a depressingly large number of repos. Use SigLIP 2 or ColPali for image content; never byte-hash.

DIAGRAMBLOCK · 03

Anatomy of a VLM forward pass

Vision tower (ViT or SigLIP) → projector (linear or Q-Former) → image tokens flow into the LLM alongside text tokens. Native-multimodal models train this jointly; older designs bolt it on after.

CODEBLOCK · 04

Three lines: a JPG to structured JSON

PYTHON

1from openai import OpenAI

2from pydantic import BaseModel

3client = OpenAI()

5class Invoice(BaseModel):

6 vendor: str

7 total: float

8 currency: str

9 line_items: list[str]

11img = open("receipt.jpg", "rb").read()

12import base64; b64 = base64.b64encode(img).decode()

13resp = client.beta.chat.completions.parse(

14 model="gpt-5",

15 response_format=Invoice,

16 messages=[{"role": "user", "content": [

17 {"type": "text", "text": "Extract the receipt fields."},

18 {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{b64}"}}]}],

19)

20print(resp.choices[0].message.parsed)

Lines 4-8: Pydantic schema is the ENTIRE structured output spec. Line 12-13: response_format=Invoice forces the model to return a parseable Invoice — no JSON-repair regex. Line 18: parsed is already a typed Python object. This is the OCR-free document AI pattern.

CHEATSHEETBLOCK · 05

5 rules every 2026 multimodal shipper knows

01Native-multimodal beats adapter-stitched. Closed: Gemini 2.5 Pro / GPT-5. Open: Qwen3-VL.

02OCR-free is the default. Send page images, get Pydantic JSON. Florence-2 only for region-grounded reads.

03ColPali + Qdrant beats text RAG on PDFs with diagrams, tables, and screenshots.

04Voice loops live or die at 300ms. Deepgram + ElevenLabs Flash + a slim agent — measure every hop.

05Eval with the 5-bench scorecard: MMMU + MathVista + Video-MME + RePOPE + (DocVQA or MMLongBench).

MINIGAME · RAPIDFIRETFBLOCK · 06

Quick check — true or false?

GPT-5 watches video frame-by-frame at 30 fps.

CLAIM 1/5 · READY · scroll into view

CONCEPTBLOCK · 07

What you'll ship in the full study

Eight lessons. Eight docker projects. By the end you'll have: — A 3-line VLM playground that turns a JPG into typed JSON. — A model-comparison harness across Pixtral, Qwen2.5-VL, and Claude Opus on the same image. — An invoice/receipt extractor with Pydantic schema validation, lifted into your AP automation flow. — A ColPali + Qdrant multimodal RAG that beats text RAG on diagrammed PDFs. — A sub-300ms voice agent loop you can drop in front of a phone number tomorrow. — A long-video Q&A app over a 1-hour MP4, on Gemini 2.5 Pro. — A SigLIP 2 + Qdrant image search engine, multilingual, billion-scale-ready. — An eval harness running MMMU / MathVista / Video-MME / RePOPE in CI. — An air-gapped local multimodal stack (Ollama + Qwen2.5-VL + Whisper + SigLIP 2). Every docker project ships with composeYaml, expectedStdout, and a 'lift to work' note explaining how to drop it into your team's repo.

INCLUDEDAll projects are designed to be lifted — not demoed. The compose files use ${ENV} interpolation; specialist prompts are in versioned .md files; eval reports are HTML you can paste into Notion.

LESSON COMPLETEBLOCK · 08

That's the trailer.

NEXTLesson 1 · The VLM mental model

WHAT YOU'LL WALK AWAY WITH

Real skills, real career delta.

Skills you'll gain

Read VLM model cards criticallyWorking
Identify vision tower / projector / decoder; modality coverage; native vs adapter-stitched; context window and resolution limits — before opening the API.
Pick a multimodal model from a 4-axis matrixProduction
Decide along modalities x context length x open/closed x cost. Match GPT-5 / Gemini 2.5 Pro / Claude Opus 4.7 / Qwen3-VL / Llama 4 Scout / Pixtral to the job.
Build OCR-free document extraction → Pydantic JSONProduction
Send base64 page images to GPT-5/Pixtral/Qwen2.5-VL with response_format=PydanticModel. Validate, retry, reject. Drop into AP automation flows.
Ship multimodal RAG with ColPali / late interactionProduction
ColQwen2 + Qdrant multi-vector index over PDF page images. Skip OCR and chunking entirely. Measured: ~1s search on 25K pages.
Engineer a sub-300ms voice loopProduction
Deepgram Nova-3 STT (~150ms first word) + GPT-5 streaming + ElevenLabs Flash v2.5 (~75ms TTFB), with VAD-driven barge-in. Or OpenAI Realtime API for the unified path.
Process long video with native 1M-token VLMsWorking
Gemini 2.5 Pro for hour-long video Q&A (Video-MME 84.8). Frame-sampling fallback with decord/PyAV when content exceeds context.
Build cross-modal search at scale (SigLIP 2 + Qdrant)Production
SigLIP 2 NaFlex embeddings (109 langs) → Qdrant HNSW. Billion-scale-ready. Replaces CLIP for any new build.
Run a 5-benchmark VLM evaluation in CIProduction
MMMU + MathVista + Video-MME + DocVQA + RePOPE on every model swap. lmms-eval orchestrates; HTML report goes to Notion.
Detect & defend against multimodal hallucinationAdvanced
POPE/RePOPE/HALLUSIONBench probes; grounded-prompting + cite-the-region prompts; refuse-when-unsure system prompts; eval gates in CI.
Deploy a fully air-gapped local multimodal stackAdvanced
Ollama + Qwen2.5-VL-7B + whisper-large-v3-turbo + SigLIP 2 — on-prem, GPU-budgeted. The deployment regulated industries actually buy.

Career & income delta

Career moves

Title yourself credibly as 'multimodal AI engineer' — one of the 2026 hot search terms on senior IC postings.
Lead a Document AI / IDP initiative — the highest-ROI multimodal use case in 2026 ($27B+ market by 2030).
Own the voice-agent platform at a B2B SaaS — most product roadmaps have it; few teams have shipped it.
Pick up contracting work at $200-400/hr replacing fragile OCR + regex pipelines with one VLM call.
Move from a generic backend role into an AI-platform team — multimodal experience is the differentiator.

Income impact

$20-50K bump for senior ICs adding production multimodal to their resume in 2026.
$50-150K bump moving from a generic backend role to an AI-platform / IDP / voice-agent team.
Freelance / consulting rates: $200-400/hr — 'we have 5,000 PDFs and need them queryable' is the most common 2026 inquiry.
Enterprise demos / sales-engineering: closing one 6-figure deal per quarter often requires a working multimodal RAG over the customer's corpus.
Document AI specialists in regulated industries (finance, legal, healthcare) command 20-40% premiums over generic AI engineers.

Market resilience

Multimodal architecture skills (vision tower / projector / decoder mental model) survive every model swap.
ColPali / late-interaction is becoming a commodity skill — but the engineers who ship it FIRST own the platform decisions.
Air-gapped on-prem multimodal stacks remain in demand for any regulated industry — Ollama + Qwen2.5-VL is durable.
Eval discipline (5-bench scorecard, hallucination probes) carries forward to whatever 2027 model arrives.
Voice-agent latency engineering — sub-300ms loops — remains a moat; most teams will struggle to ship it.