MMODCourse

Multimodal AI

Lessons8modules
Total82mfull study
Quick7mtrailer
Projects8docker labs
CHEATSHEET · 01Multimodal · master cheatsheet
Architectural primitives
  • ·Vision tower: ViT (CLIP/SigLIP) or convolutional backbone — patchifies the image
  • ·Projector: linear, MLP, or Q-Former — maps vision features into the LLM's embedding space
  • ·LLM decoder: text generation conditioned on image + text tokens
  • ·Native-multimodal: trained jointly across modalities (GPT-5, Gemini 2.5 Pro, Qwen3-VL)
  • ·Adapter-stitched: vision frozen, projector trained — older but still common (LLaVA-1.5-style)
When to use which model (April 2026)
  • ·GPT-5 — text + image + audio + video frames; best for general production with Realtime API
  • ·Gemini 2.5 Pro — 1M-token context; best for long-video Q&A, audio-out, long-doc analysis
  • ·Claude Opus 4.7 — highest image resolution (~3.75 MP); best for dense UI / diagram screenshots
  • ·Qwen3-VL-235B — open SOTA on MMMU (80.6) and MathVista (85.8); use via vLLM
  • ·Qwen2.5-VL-7B — sweet-spot open VLM; runs on a single A100; native dynamic resolution
  • ·Llama 4 Scout — 10M-token context (open); experimental but unique in open weights
  • ·Pixtral Large — Mistral's multimodal flagship; good document analytics; EU-friendly
Best-of-breed libraries
  • ·transformers AutoModelForVision2Seq — local VLM inference
  • ·vLLM — production VLM serving (Qwen, Pixtral, Llama 4)
  • ·litellm — multimodal calls across providers (image_url + audio)
  • ·LlamaIndex multimodal indexes — clean abstractions for image-page retrieval
  • ·ColPali / ColQwen2 — late-interaction PDF retrieval (no OCR, no chunking)
  • ·SigLIP 2 — multilingual image embeddings (Feb 2025, 109 langs)
  • ·Florence-2-large — region-grounded OCR + dense captioning (MIT-licensed)
  • ·Pixeltable — declarative incremental multimodal data ops
  • ·Qdrant — vector DB with native ColPali multi-vector support
Production guardrails
  • ·Always validate outputs with Pydantic — VLMs hallucinate field names too
  • ·Cap image resolution + count per request — costs scale with pixel tokens
  • ·Pre-resize images server-side to model spec (e.g. 2048x2048 max for GPT-5)
  • ·Token-budget audio segments (Whisper: <30s chunks for stable transcription)
  • ·Enforce JSON schema on the wire AND in code — model + parser
  • ·Log original-input hashes for deterministic replay
  • ·Run hallucination probes (POPE, RePOPE) on every model upgrade
CHEATSHEET · 02Eval & benchmark scorecard · 2026
The 5-benchmark scorecard (use this in CI)
  • ·MMMU — college-exam multidomain VQA. Anchor: Qwen3-VL-235B 80.6; frontier 80-85
  • ·MathVista — visual math reasoning. Anchor: Qwen3-VL-235B 85.8
  • ·Video-MME — 11s to 1hr video Q&A. Anchor: Gemini 2.5 Pro 84.8 (current leader)
  • ·DocVQA — document VQA (saturated). Anchor: top models 95+ — use as floor, not differentiator
  • ·MMLongBench-Doc — long-document multimodal (still hard). Anchor: Qwen3-VL-235B 57.0
  • ·RePOPE — re-annotated POPE for object hallucination (use over original POPE)
  • ·HALLUSIONBench — language hallucination + visual illusion (still unsolved at frontier)
What to skip / deprecated
  • ·Original POPE — superseded by RePOPE (released 22 Apr 2025)
  • ·DocVQA alone — saturated; pair with MMLongBench-Doc to differentiate
  • ·SEED-Bench — superseded by MMMU-Pro for capability-mix testing
  • ·GPT-4 (text-only) treated as 'multimodal baseline' — use GPT-4o or GPT-5 instead
  • ·Pixtral 12B — deprecated; use Pixtral Large or Qwen2.5-VL
Library picks for running evals
  • ·lmms-eval — runs MMMU/MathVista/MMBench/POPE in one call (open)
  • ·vlmeval — alt runner; good HF integration
  • ·OpenCompass — multi-bench orchestrator if you're running 10+ at once
  • ·Inspect (UK AISI) — for safety-style probes alongside capability evals
  • ·Run on a single A100 in <30 min for a 7B model + the 5-bench scorecard
Latency anchors (production targets)
  • ·Voice loop end-to-end: <300ms perceptual ceiling
  • ·Deepgram Nova-3 STT first word: 150-184ms
  • ·ElevenLabs Flash v2.5 TTS time-to-first-byte: ~75ms
  • ·Deepgram Aura-2 TTS: ~90ms
  • ·GPT-5 first-token typically 300-700ms — voice agents need streaming + interrupt handling
  • ·ColPali retrieval over 25K pages: ~1s with HNSW-GPU