DLMCourse

Domain LLM

Lessons8modules

Total84mfull study

Quick7mtrailer

Projects8docker labs

CHEATSHEET · 01Domain LLM · master cheatsheet

The decision matrix (in order)

·Knowledge that changes weekly + needs citations → RAG
·Output format / persona / structure won't stick after 50 prompt tries → SFT (usually LoRA)
·Tone / refusal / safety alignment → DPO or KTO
·Multi-step reasoning with verifiable reward (SQL, math, tests) → GRPO (RLVR)
·Domain has its own vocabulary the base model never saw → CPT, then SFT
·Knowledge AND voice → RAG + light SFT (the dominant 2026 production pattern)

Tooling picks (April 2026)

·Single-GPU SFT/DPO → Unsloth + TRL (2x faster, 70% less VRAM)
·Multi-node SFT → Axolotl + DeepSpeed Zero-3 OR torchtune + FSDP
·Synthetic data → Distilabel + Magpie + Argilla curation
·Eval → lm-eval-harness + Inspect AI + Ragas (RAG-specific)
·Serving + LoRA hot-swap → vLLM `--enable-lora --max-loras N`
·On-prem demo → Ollama (merged model) + Qdrant + Streamlit / Next.js UI
·Training infra → SkyPilot / Modal / RunPod / Lambda / Together

Hyperparameter defaults (sane starting points)

·LoRA: r=16, alpha=32, dropout=0.05
·QLoRA: 4-bit NF4 + double quantization
·Target modules: all attention (q,k,v,o) + all MLP (gate,up,down)
·SFT: lr 2e-4, cosine schedule, 3 epochs, warmup 10 steps, packing on
·DPO: beta 0.1 (KL coefficient), lr 5e-7, 1-2 epochs
·GRPO: group size 8, beta 0.04, KL clip 0.2
·Train on completion only — mask the prompt tokens

Data curation — the load-bearing skill

·Quality > quantity. 1K great pairs beats 10K mediocre ones.
·Dedupe with MinHash + LSH BEFORE training (`datasketch`).
·Filter with a judge LLM (Claude/GPT) for instruction-following + correctness.
·For SFT: include refusals + edge cases or the model loses safety.
·For DPO: chosen/rejected pairs from real user feedback > synthetic.
·For CPT: dedupe + quality-filter the corpus aggressively (datatrove).

CHEATSHEET · 02Domain evals · benchmark scorecard 2026

Generic capability anchors

·MMLU-Pro (~80-85% frontier) — multidomain reasoning
·GPQA-Diamond (~70-80% frontier) — graduate-level science
·IFEval (~85-90% frontier) — instruction-following
·HLE / Humanity's Last Exam (<30% frontier as of late 2025) — reasoning ceiling
·ARC-AGI-2 — intentionally hard for current LLMs
·LMArena Elo — human preference (most-cited public leaderboard)

Vertical / domain benchmarks

·Legal: LegalBench (Stanford CRFM, 162 tasks), LexGLUE, MMLU jurisprudence
·Medical: HealthBench (OpenAI 2025) — physician-rated; supersedes MedQA-USMLE for serious work
·Finance: FinanceBench (Patronus), FLUE, ConvFinQA, FinQA
·Code: SWE-Bench Verified (~65-72% frontier 2026), LiveCodeBench (refresh-monthly), BigCodeBench
·Math: AIME 2024/2025 — saturated; use HLE math subset for differentiation
·RAG faithfulness: Ragas faithfulness + answer-relevancy + context-precision

What to skip / deprecated as primary

·MMLU original — too saturated to differentiate frontier models (use Pro)
·MedQA-USMLE alone — saturated; pair with HealthBench-Hard
·TruthfulQA — partly contaminated; use HaluEval / Ragas instead
·MT-Bench — superseded by Arena-Hard and WildBench for LLM-judge benches
·HumanEval / MBPP — floor tests; use SWE-Bench Verified or LiveCodeBench

Eval hygiene rules

·Hold out a domain golden set the trainer NEVER sees (200-500 items).
·Score with a frontier judge (Claude Opus 4.7 / GPT-5) — pin the version.
·Run pairwise (chosen vs rejected) when possible — robust to scale drift.
·Gate CI on regression: -2pp on any benchmark blocks merge.
·Re-eval on every model upgrade AND on every dataset version bump.
·Track contamination: rotate held-out splits quarterly.