DLMCourse

Domain LLM

Lessons8modules
Total84mfull study
Quick7mtrailer
Projects8docker labs
CHEATSHEET · 01Domain LLM · master cheatsheet
The decision matrix (in order)
  • ·Knowledge that changes weekly + needs citations → RAG
  • ·Output format / persona / structure won't stick after 50 prompt tries → SFT (usually LoRA)
  • ·Tone / refusal / safety alignment → DPO or KTO
  • ·Multi-step reasoning with verifiable reward (SQL, math, tests) → GRPO (RLVR)
  • ·Domain has its own vocabulary the base model never saw → CPT, then SFT
  • ·Knowledge AND voice → RAG + light SFT (the dominant 2026 production pattern)
Tooling picks (April 2026)
  • ·Single-GPU SFT/DPO → Unsloth + TRL (2x faster, 70% less VRAM)
  • ·Multi-node SFT → Axolotl + DeepSpeed Zero-3 OR torchtune + FSDP
  • ·Synthetic data → Distilabel + Magpie + Argilla curation
  • ·Eval → lm-eval-harness + Inspect AI + Ragas (RAG-specific)
  • ·Serving + LoRA hot-swap → vLLM `--enable-lora --max-loras N`
  • ·On-prem demo → Ollama (merged model) + Qdrant + Streamlit / Next.js UI
  • ·Training infra → SkyPilot / Modal / RunPod / Lambda / Together
Hyperparameter defaults (sane starting points)
  • ·LoRA: r=16, alpha=32, dropout=0.05
  • ·QLoRA: 4-bit NF4 + double quantization
  • ·Target modules: all attention (q,k,v,o) + all MLP (gate,up,down)
  • ·SFT: lr 2e-4, cosine schedule, 3 epochs, warmup 10 steps, packing on
  • ·DPO: beta 0.1 (KL coefficient), lr 5e-7, 1-2 epochs
  • ·GRPO: group size 8, beta 0.04, KL clip 0.2
  • ·Train on completion only — mask the prompt tokens
Data curation — the load-bearing skill
  • ·Quality > quantity. 1K great pairs beats 10K mediocre ones.
  • ·Dedupe with MinHash + LSH BEFORE training (`datasketch`).
  • ·Filter with a judge LLM (Claude/GPT) for instruction-following + correctness.
  • ·For SFT: include refusals + edge cases or the model loses safety.
  • ·For DPO: chosen/rejected pairs from real user feedback > synthetic.
  • ·For CPT: dedupe + quality-filter the corpus aggressively (datatrove).
CHEATSHEET · 02Domain evals · benchmark scorecard 2026
Generic capability anchors
  • ·MMLU-Pro (~80-85% frontier) — multidomain reasoning
  • ·GPQA-Diamond (~70-80% frontier) — graduate-level science
  • ·IFEval (~85-90% frontier) — instruction-following
  • ·HLE / Humanity's Last Exam (<30% frontier as of late 2025) — reasoning ceiling
  • ·ARC-AGI-2 — intentionally hard for current LLMs
  • ·LMArena Elo — human preference (most-cited public leaderboard)
Vertical / domain benchmarks
  • ·Legal: LegalBench (Stanford CRFM, 162 tasks), LexGLUE, MMLU jurisprudence
  • ·Medical: HealthBench (OpenAI 2025) — physician-rated; supersedes MedQA-USMLE for serious work
  • ·Finance: FinanceBench (Patronus), FLUE, ConvFinQA, FinQA
  • ·Code: SWE-Bench Verified (~65-72% frontier 2026), LiveCodeBench (refresh-monthly), BigCodeBench
  • ·Math: AIME 2024/2025 — saturated; use HLE math subset for differentiation
  • ·RAG faithfulness: Ragas faithfulness + answer-relevancy + context-precision
What to skip / deprecated as primary
  • ·MMLU original — too saturated to differentiate frontier models (use Pro)
  • ·MedQA-USMLE alone — saturated; pair with HealthBench-Hard
  • ·TruthfulQA — partly contaminated; use HaluEval / Ragas instead
  • ·MT-Bench — superseded by Arena-Hard and WildBench for LLM-judge benches
  • ·HumanEval / MBPP — floor tests; use SWE-Bench Verified or LiveCodeBench
Eval hygiene rules
  • ·Hold out a domain golden set the trainer NEVER sees (200-500 items).
  • ·Score with a frontier judge (Claude Opus 4.7 / GPT-5) — pin the version.
  • ·Run pairwise (chosen vs rejected) when possible — robust to scale drift.
  • ·Gate CI on regression: -2pp on any benchmark blocks merge.
  • ·Re-eval on every model upgrade AND on every dataset version bump.
  • ·Track contamination: rotate held-out splits quarterly.