CHEATSHEET · 01Domain LLM · master cheatsheet
The decision matrix (in order)
- ·Knowledge that changes weekly + needs citations → RAG
- ·Output format / persona / structure won't stick after 50 prompt tries → SFT (usually LoRA)
- ·Tone / refusal / safety alignment → DPO or KTO
- ·Multi-step reasoning with verifiable reward (SQL, math, tests) → GRPO (RLVR)
- ·Domain has its own vocabulary the base model never saw → CPT, then SFT
- ·Knowledge AND voice → RAG + light SFT (the dominant 2026 production pattern)
Tooling picks (April 2026)
- ·Single-GPU SFT/DPO → Unsloth + TRL (2x faster, 70% less VRAM)
- ·Multi-node SFT → Axolotl + DeepSpeed Zero-3 OR torchtune + FSDP
- ·Synthetic data → Distilabel + Magpie + Argilla curation
- ·Eval → lm-eval-harness + Inspect AI + Ragas (RAG-specific)
- ·Serving + LoRA hot-swap → vLLM `--enable-lora --max-loras N`
- ·On-prem demo → Ollama (merged model) + Qdrant + Streamlit / Next.js UI
- ·Training infra → SkyPilot / Modal / RunPod / Lambda / Together
Hyperparameter defaults (sane starting points)
- ·LoRA: r=16, alpha=32, dropout=0.05
- ·QLoRA: 4-bit NF4 + double quantization
- ·Target modules: all attention (q,k,v,o) + all MLP (gate,up,down)
- ·SFT: lr 2e-4, cosine schedule, 3 epochs, warmup 10 steps, packing on
- ·DPO: beta 0.1 (KL coefficient), lr 5e-7, 1-2 epochs
- ·GRPO: group size 8, beta 0.04, KL clip 0.2
- ·Train on completion only — mask the prompt tokens
Data curation — the load-bearing skill
- ·Quality > quantity. 1K great pairs beats 10K mediocre ones.
- ·Dedupe with MinHash + LSH BEFORE training (`datasketch`).
- ·Filter with a judge LLM (Claude/GPT) for instruction-following + correctness.
- ·For SFT: include refusals + edge cases or the model loses safety.
- ·For DPO: chosen/rejected pairs from real user feedback > synthetic.
- ·For CPT: dedupe + quality-filter the corpus aggressively (datatrove).
CHEATSHEET · 02Domain evals · benchmark scorecard 2026
Generic capability anchors
- ·MMLU-Pro (~80-85% frontier) — multidomain reasoning
- ·GPQA-Diamond (~70-80% frontier) — graduate-level science
- ·IFEval (~85-90% frontier) — instruction-following
- ·HLE / Humanity's Last Exam (<30% frontier as of late 2025) — reasoning ceiling
- ·ARC-AGI-2 — intentionally hard for current LLMs
- ·LMArena Elo — human preference (most-cited public leaderboard)
Vertical / domain benchmarks
- ·Legal: LegalBench (Stanford CRFM, 162 tasks), LexGLUE, MMLU jurisprudence
- ·Medical: HealthBench (OpenAI 2025) — physician-rated; supersedes MedQA-USMLE for serious work
- ·Finance: FinanceBench (Patronus), FLUE, ConvFinQA, FinQA
- ·Code: SWE-Bench Verified (~65-72% frontier 2026), LiveCodeBench (refresh-monthly), BigCodeBench
- ·Math: AIME 2024/2025 — saturated; use HLE math subset for differentiation
- ·RAG faithfulness: Ragas faithfulness + answer-relevancy + context-precision
What to skip / deprecated as primary
- ·MMLU original — too saturated to differentiate frontier models (use Pro)
- ·MedQA-USMLE alone — saturated; pair with HealthBench-Hard
- ·TruthfulQA — partly contaminated; use HaluEval / Ragas instead
- ·MT-Bench — superseded by Arena-Hard and WildBench for LLM-judge benches
- ·HumanEval / MBPP — floor tests; use SWE-Bench Verified or LiveCodeBench
Eval hygiene rules
- ·Hold out a domain golden set the trainer NEVER sees (200-500 items).
- ·Score with a frontier judge (Claude Opus 4.7 / GPT-5) — pin the version.
- ·Run pairwise (chosen vs rejected) when possible — robust to scale drift.
- ·Gate CI on regression: -2pp on any benchmark blocks merge.
- ·Re-eval on every model upgrade AND on every dataset version bump.
- ·Track contamination: rotate held-out splits quarterly.