MMODCourse

Multimodal AI

Lessons8modules
Total82mfull study
Quick7mtrailer
Projects8docker labs

Hello VLM — image to typed JSON

A 30-line VLM playground that turns any image into a Pydantic-typed answer. Onboarding-grade.

snap/multimodal:helloRepo · multimodal-hello-vlm
$git clonehttps://github.com/snap-dev/multimodal-hello-vlm.git
docker-compose.yml
# docker-compose.yml — hello-vlm
services:
  vlm:
    image: python:3.12-slim
    working_dir: /app
    volumes:
      - ./src:/app/src:ro
      - ./samples:/app/samples:ro
      - ./requirements.txt:/app/requirements.txt:ro
    environment:
      OPENAI_API_KEY: ${OPENAI_API_KEY:?set OPENAI_API_KEY in your shell}
      VLM_MODEL: ${VLM_MODEL:-gpt-5}
    command: >-
      bash -c "pip install -q -r requirements.txt &&
               python -m src.run --image samples/cat.jpg"
Run
~/multimodal-hello-vlm · zsh
$ docker compose run --rm vlm
[hello-vlm] result.summary='...' result.objects=[...] result.safe_for_work=True
What you'll observe
Container exits cleanly within 30 seconds
stdout shows a parsed Description with non-empty summary
result.objects is a non-empty list of strings
result.safe_for_work is a Python bool
Swapping VLM_MODEL=claude-opus-4-7 still works (litellm provider switch)
Lift this to your work

Drop in front of any image-input feature: a CMS that auto-tags uploads, a Slack bot that summarises pasted screenshots, a pre-moderation filter, or an admin tool that reads error screenshots. Change the Pydantic schema in src/schemas.py to match your domain — that's your wire contract.