MMODCourse

Multimodal AI

Lessons8modules

Total82mfull study

Quick7mtrailer

Projects8docker labs

CHEATSHEET · 01Multimodal · master cheatsheet

Architectural primitives

·Vision tower: ViT (CLIP/SigLIP) or convolutional backbone — patchifies the image
·Projector: linear, MLP, or Q-Former — maps vision features into the LLM's embedding space
·LLM decoder: text generation conditioned on image + text tokens
·Native-multimodal: trained jointly across modalities (GPT-5, Gemini 2.5 Pro, Qwen3-VL)
·Adapter-stitched: vision frozen, projector trained — older but still common (LLaVA-1.5-style)

When to use which model (April 2026)

·GPT-5 — text + image + audio + video frames; best for general production with Realtime API
·Gemini 2.5 Pro — 1M-token context; best for long-video Q&A, audio-out, long-doc analysis
·Claude Opus 4.7 — highest image resolution (~3.75 MP); best for dense UI / diagram screenshots
·Qwen3-VL-235B — open SOTA on MMMU (80.6) and MathVista (85.8); use via vLLM
·Qwen2.5-VL-7B — sweet-spot open VLM; runs on a single A100; native dynamic resolution
·Llama 4 Scout — 10M-token context (open); experimental but unique in open weights
·Pixtral Large — Mistral's multimodal flagship; good document analytics; EU-friendly

Best-of-breed libraries

Production guardrails

CHEATSHEET · 02Eval & benchmark scorecard · 2026

The 5-benchmark scorecard (use this in CI)

·MMMU — college-exam multidomain VQA. Anchor: Qwen3-VL-235B 80.6; frontier 80-85
·MathVista — visual math reasoning. Anchor: Qwen3-VL-235B 85.8
·Video-MME — 11s to 1hr video Q&A. Anchor: Gemini 2.5 Pro 84.8 (current leader)
·DocVQA — document VQA (saturated). Anchor: top models 95+ — use as floor, not differentiator
·MMLongBench-Doc — long-document multimodal (still hard). Anchor: Qwen3-VL-235B 57.0
·RePOPE — re-annotated POPE for object hallucination (use over original POPE)
·HALLUSIONBench — language hallucination + visual illusion (still unsolved at frontier)

What to skip / deprecated

·Original POPE — superseded by RePOPE (released 22 Apr 2025)
·DocVQA alone — saturated; pair with MMLongBench-Doc to differentiate
·SEED-Bench — superseded by MMMU-Pro for capability-mix testing
·GPT-4 (text-only) treated as 'multimodal baseline' — use GPT-4o or GPT-5 instead
·Pixtral 12B — deprecated; use Pixtral Large or Qwen2.5-VL

Library picks for running evals

Latency anchors (production targets)

·Voice loop end-to-end: <300ms perceptual ceiling
·Deepgram Nova-3 STT first word: 150-184ms
·ElevenLabs Flash v2.5 TTS time-to-first-byte: ~75ms
·Deepgram Aura-2 TTS: ~90ms
·GPT-5 first-token typically 300-700ms — voice agents need streaming + interrupt handling
·ColPali retrieval over 25K pages: ~1s with HNSW-GPU