MMODCourse

Multimodal AI

Lessons8modules

Total82mfull study

Quick7mtrailer

Projects8docker labs

Hello VLM — image to typed JSON

A 30-line VLM playground that turns any image into a Pydantic-typed answer. Onboarding-grade.

snap/multimodal:helloRepo · multimodal-hello-vlm

$git clonehttps://github.com/snap-dev/multimodal-hello-vlm.git

docker-compose.yml

# docker-compose.yml — hello-vlm
services:
  vlm:
    image: python:3.12-slim
    working_dir: /app
    volumes:
      - ./src:/app/src:ro
      - ./samples:/app/samples:ro
      - ./requirements.txt:/app/requirements.txt:ro
    environment:
      OPENAI_API_KEY: ${OPENAI_API_KEY:?set OPENAI_API_KEY in your shell}
      VLM_MODEL: ${VLM_MODEL:-gpt-5}
    command: >-
      bash -c "pip install -q -r requirements.txt &&
               python -m src.run --image samples/cat.jpg"

Run

~/multimodal-hello-vlm · zsh

$ docker compose run --rm vlm

[hello-vlm] result.summary='...' result.objects=[...] result.safe_for_work=True

What you'll observe

Container exits cleanly within 30 seconds

stdout shows a parsed Description with non-empty summary

result.objects is a non-empty list of strings

result.safe_for_work is a Python bool

Swapping VLM_MODEL=claude-opus-4-7 still works (litellm provider switch)

Lift this to your work

Drop in front of any image-input feature: a CMS that auto-tags uploads, a Slack bot that summarises pasted screenshots, a pre-moderation filter, or an admin tool that reads error screenshots. Change the Pydantic schema in src/schemas.py to match your domain — that's your wire contract.