MMODCourse

Multimodal AI

Lessons8modules
Total82mfull study
Quick7mtrailer
Projects8docker labs

Skills you'll gain

10
  • Read VLM model cards criticallyWorking

    Identify vision tower / projector / decoder; modality coverage; native vs adapter-stitched; context window and resolution limits — before opening the API.

  • Pick a multimodal model from a 4-axis matrixProduction

    Decide along modalities x context length x open/closed x cost. Match GPT-5 / Gemini 2.5 Pro / Claude Opus 4.7 / Qwen3-VL / Llama 4 Scout / Pixtral to the job.

  • Build OCR-free document extraction → Pydantic JSONProduction

    Send base64 page images to GPT-5/Pixtral/Qwen2.5-VL with response_format=PydanticModel. Validate, retry, reject. Drop into AP automation flows.

  • Ship multimodal RAG with ColPali / late interactionProduction

    ColQwen2 + Qdrant multi-vector index over PDF page images. Skip OCR and chunking entirely. Measured: ~1s search on 25K pages.

  • Engineer a sub-300ms voice loopProduction

    Deepgram Nova-3 STT (~150ms first word) + GPT-5 streaming + ElevenLabs Flash v2.5 (~75ms TTFB), with VAD-driven barge-in. Or OpenAI Realtime API for the unified path.

  • Process long video with native 1M-token VLMsWorking

    Gemini 2.5 Pro for hour-long video Q&A (Video-MME 84.8). Frame-sampling fallback with decord/PyAV when content exceeds context.

  • Build cross-modal search at scale (SigLIP 2 + Qdrant)Production

    SigLIP 2 NaFlex embeddings (109 langs) → Qdrant HNSW. Billion-scale-ready. Replaces CLIP for any new build.

  • Run a 5-benchmark VLM evaluation in CIProduction

    MMMU + MathVista + Video-MME + DocVQA + RePOPE on every model swap. lmms-eval orchestrates; HTML report goes to Notion.

  • Detect & defend against multimodal hallucinationAdvanced

    POPE/RePOPE/HALLUSIONBench probes; grounded-prompting + cite-the-region prompts; refuse-when-unsure system prompts; eval gates in CI.

  • Deploy a fully air-gapped local multimodal stackAdvanced

    Ollama + Qwen2.5-VL-7B + whisper-large-v3-turbo + SigLIP 2 — on-prem, GPU-budgeted. The deployment regulated industries actually buy.