CHEATSHEET · 01Multimodal · master cheatsheet
Architectural primitives
- ·Vision tower: ViT (CLIP/SigLIP) or convolutional backbone — patchifies the image
- ·Projector: linear, MLP, or Q-Former — maps vision features into the LLM's embedding space
- ·LLM decoder: text generation conditioned on image + text tokens
- ·Native-multimodal: trained jointly across modalities (GPT-5, Gemini 2.5 Pro, Qwen3-VL)
- ·Adapter-stitched: vision frozen, projector trained — older but still common (LLaVA-1.5-style)
When to use which model (April 2026)
- ·GPT-5 — text + image + audio + video frames; best for general production with Realtime API
- ·Gemini 2.5 Pro — 1M-token context; best for long-video Q&A, audio-out, long-doc analysis
- ·Claude Opus 4.7 — highest image resolution (~3.75 MP); best for dense UI / diagram screenshots
- ·Qwen3-VL-235B — open SOTA on MMMU (80.6) and MathVista (85.8); use via vLLM
- ·Qwen2.5-VL-7B — sweet-spot open VLM; runs on a single A100; native dynamic resolution
- ·Llama 4 Scout — 10M-token context (open); experimental but unique in open weights
- ·Pixtral Large — Mistral's multimodal flagship; good document analytics; EU-friendly
Best-of-breed libraries
- ·transformers AutoModelForVision2Seq — local VLM inference
- ·vLLM — production VLM serving (Qwen, Pixtral, Llama 4)
- ·litellm — multimodal calls across providers (image_url + audio)
- ·LlamaIndex multimodal indexes — clean abstractions for image-page retrieval
- ·ColPali / ColQwen2 — late-interaction PDF retrieval (no OCR, no chunking)
- ·SigLIP 2 — multilingual image embeddings (Feb 2025, 109 langs)
- ·Florence-2-large — region-grounded OCR + dense captioning (MIT-licensed)
- ·Pixeltable — declarative incremental multimodal data ops
- ·Qdrant — vector DB with native ColPali multi-vector support
Production guardrails
- ·Always validate outputs with Pydantic — VLMs hallucinate field names too
- ·Cap image resolution + count per request — costs scale with pixel tokens
- ·Pre-resize images server-side to model spec (e.g. 2048x2048 max for GPT-5)
- ·Token-budget audio segments (Whisper: <30s chunks for stable transcription)
- ·Enforce JSON schema on the wire AND in code — model + parser
- ·Log original-input hashes for deterministic replay
- ·Run hallucination probes (POPE, RePOPE) on every model upgrade
CHEATSHEET · 02Eval & benchmark scorecard · 2026
The 5-benchmark scorecard (use this in CI)
- ·MMMU — college-exam multidomain VQA. Anchor: Qwen3-VL-235B 80.6; frontier 80-85
- ·MathVista — visual math reasoning. Anchor: Qwen3-VL-235B 85.8
- ·Video-MME — 11s to 1hr video Q&A. Anchor: Gemini 2.5 Pro 84.8 (current leader)
- ·DocVQA — document VQA (saturated). Anchor: top models 95+ — use as floor, not differentiator
- ·MMLongBench-Doc — long-document multimodal (still hard). Anchor: Qwen3-VL-235B 57.0
- ·RePOPE — re-annotated POPE for object hallucination (use over original POPE)
- ·HALLUSIONBench — language hallucination + visual illusion (still unsolved at frontier)
What to skip / deprecated
- ·Original POPE — superseded by RePOPE (released 22 Apr 2025)
- ·DocVQA alone — saturated; pair with MMLongBench-Doc to differentiate
- ·SEED-Bench — superseded by MMMU-Pro for capability-mix testing
- ·GPT-4 (text-only) treated as 'multimodal baseline' — use GPT-4o or GPT-5 instead
- ·Pixtral 12B — deprecated; use Pixtral Large or Qwen2.5-VL
Library picks for running evals
- ·lmms-eval — runs MMMU/MathVista/MMBench/POPE in one call (open)
- ·vlmeval — alt runner; good HF integration
- ·OpenCompass — multi-bench orchestrator if you're running 10+ at once
- ·Inspect (UK AISI) — for safety-style probes alongside capability evals
- ·Run on a single A100 in <30 min for a 7B model + the 5-bench scorecard
Latency anchors (production targets)
- ·Voice loop end-to-end: <300ms perceptual ceiling
- ·Deepgram Nova-3 STT first word: 150-184ms
- ·ElevenLabs Flash v2.5 TTS time-to-first-byte: ~75ms
- ·Deepgram Aura-2 TTS: ~90ms
- ·GPT-5 first-token typically 300-700ms — voice agents need streaming + interrupt handling
- ·ColPali retrieval over 25K pages: ~1s with HNSW-GPU