DLMCourse

Domain LLM

Lessons8modules
Total84mfull study
Quick7mtrailer
Projects8docker labs

RAG vs Fine-tuning vs Hybrid bench harness

Same task, three implementations (prompt / RAG / SFT). One CSV your ADR can cite.

snap/domain-llm:rag-vs-ftRepo · domain-llm-rag-vs-ft
$git clonehttps://github.com/snap-dev/domain-llm-rag-vs-ft.git
docker-compose.yml
# docker-compose.yml — rag-vs-ft-bench
services:
  qdrant:
    image: qdrant/qdrant:v1.13.0
    ports: ["6333:6333"]
    volumes:
      - ./qdrant-data:/qdrant/storage
  vllm:
    image: vllm/vllm-openai:latest
    command: --model Qwen/Qwen2.5-7B-Instruct --enable-lora --lora-modules legal=./adapters/legal --max-loras 4 --port 8000
    volumes:
      - ./adapters:/app/adapters:ro
    ports: ["8000:8000"]
    deploy:
      resources:
        reservations:
          devices: [{ capabilities: ["gpu"] }]
  bench:
    image: python:3.12-slim
    working_dir: /app
    depends_on: [qdrant, vllm]
    volumes:
      - ./src:/app/src:ro
      - ./golden:/golden:ro
      - ./out:/out
      - ./requirements.txt:/app/requirements.txt:ro
    environment:
      OPENAI_API_KEY: ${OPENAI_API_KEY:?}
      QDRANT_URL: http://qdrant:6333
      VLLM_URL: http://vllm:8000/v1
      JUDGE_MODEL: ${JUDGE_MODEL:-claude-opus-4-7}
      TASK: ${TASK:-legal-clause-classify}
    command: bash -c "pip install -q -r requirements.txt && python -m src.bench --task /golden/$${TASK}.json --out /out/decision-report.csv"
Run
~/domain-llm-rag-vs-ft · zsh
$ docker compose up -d qdrant vllm && docker compose run --rm bench
[bench] prompt_only: ... rag: ... sft: ... · winner=...
What you'll observe
qdrant + vllm come up healthy with the legal LoRA loaded
out/decision-report.csv has one row per (approach x golden item) plus aggregate
Aggregate per-approach: accuracy, latency_p50/p95, $/call, win rate
JUDGE_MODEL is used for open-ended outputs when accuracy isn't a hard match
Adding a new task is one JSON file in /golden
Container exits with the winner approach + CSV at /out
Lift this to your work

Use this as the evidence pack for any 'should we fine-tune?' debate. Drop YOUR golden set into /golden, set TASK to the filename, share /out/decision-report.csv with eng + product. Closes the debate without a 90-minute meeting. For new domains, copy the JSONL schema and add 50-200 hand-labelled items.