CHEATSHEET · 01Pick the right eval for the job
Capability — does it do the task?
- ·**SWE-bench Verified** (500 human-verified GH issues) — coding agents.
- ·**GPQA-Diamond** (198 PhD-level science) — reasoning ceiling.
- ·**FrontierMath** (~300 research-grade problems) — math frontier.
- ·**ARC-AGI-2** (visual abstract reasoning) — generalisation gap.
- ·**MMLU-Pro** (12k Q × 10 options) — broad knowledge, post-MMLU saturation.
- ·**Aider polyglot** (225 Exercism × 6 langs) — code-edit accuracy in a loop.
RAG / agent — does it ground & use tools?
- ·**Ragas** — faithfulness, answer-relevance, context-precision/recall, noise-sensitivity.
- ·**TruLens RAG Triad** — groundedness, answer-relevance, context-relevance.
- ·**DeepEval** — pytest-style RAG metrics for CI assertions.
- ·**LongBench v2** — long-context reasoning to 128k tokens.
- ·**GAIA** — general assistant; humans 92% / GPT-4 + plugins ~15% at launch.
- ·**AgentBench v2** — 8 environments; **ToolBench / StableToolBench** — 16k+ APIs.
Safety — does it refuse the right things?
- ·**MLCommons AILuminate v1.0** — 12 hazard cats, letter grade A–E.
- ·**HarmBench** — 510 harmful behaviours, attacker–defender protocol.
- ·**JailbreakBench** — 100 behaviours, versioned attacks (PAIR, GCG).
- ·**AgentHarm** (UK AISI) — 110 harmful agent tasks across 11 categories.
- ·**BBQ / BBNLI** — stereotype/social-bias for QA & NLI.
- ·**TruthfulQA** — imitative falsehoods (not factuality).
Red-team — can adversaries break it?
- ·**PyRIT 0.5+** — orchestrator: target/attacker/scorer triad, multi-turn memory.
- ·**Garak 0.10+** — 60+ probes, CVE-style probe IDs ('nmap for LLMs').
- ·**Promptfoo red-team** — YAML-driven harm packs, OWASP LLM Top 10.
- ·Map findings to **MITRE ATLAS** for a defensible taxonomy.
Frontier-risk — at scale, catastrophic uplift?
- ·**Anthropic RSP v3** (Feb 2026) — ASL-1 → ASL-4 tier gates; bioweapon, cyber, autonomy, AI R&D.
- ·**OpenAI Preparedness v2** (Apr 2025) — Bio/Chem, Cyber, AI Self-Improvement; High / Critical.
- ·**DeepMind FSF v3** (Apr 2026) — CCLs across CBRN, cyber, ML R&D, deceptive alignment.
- ·**METR HCAST + RE-Bench** — autonomy time-horizon (doubling ~7 months).
- ·**UK / US AISI Inspect suites** — pre-deployment access agreements.
CHEATSHEET · 02Statistical rigor: read leaderboards critically
Bootstrap confidence intervals
- ·Resample with replacement 1000–10000 times; BCa (bias-corrected & accelerated) preferred.
- ·200 items → ±~5–7pp 95% CI at 70–90% accuracy. 500 items → ±~3.5pp.
- ·Always report the CI alongside the mean, in tables AND plots.
- ·Use `scipy.stats.bootstrap` or the bootstrapped python lib.
Paired tests for two-model comparisons
- ·**Paired permutation** — shuffle which model 'owns' each item; compute null distribution. No normality assumption.
- ·**McNemar's test** — paired binary outcomes (A right / B right). The standard for 'did the new model fix things without breaking others.'
- ·Don't run an unpaired t-test on the same eval set — it overstates significance.
- ·Skip Welch's t-test on bounded scores like 0/1 accuracy; use exact tests.
Inter-rater reliability for human eval
- ·**Cohen's kappa** — two annotators, nominal scale.
- ·**Krippendorff's alpha** — any number of annotators, any scale, missing data fine.
- ·Alpha < 0.67 → guidelines unclear; redo before trusting any score.
- ·Compute IRR BEFORE you trust the labels as ground truth.
Multi-comparison correction
- ·Run 20 metrics, no correction → ~64% chance of one false positive at p < 0.05.
- ·**Bonferroni** — divide alpha by k tests; conservative.
- ·**Holm** — step-down Bonferroni; more powerful, same control.
- ·**Benjamini-Hochberg** — control FDR (less strict than FWER); good for exploratory dashboards.
Item Response Theory (IRT)
- ·Re-weight benchmark items by difficulty; HELM uses this approach in 2025+.
- ·Lets you compare models even when they faced slightly different items.
- ·Use `py-irt` or write a 2PL model in `pymc`/`numpyro`.
Sample-size rule of thumb
- ·For an honest 95% CI of ±2pp: ~2400 items at 70% accuracy.
- ·For ±5pp: ~400 items.
- ·If the leaderboard delta < your CI half-width, the delta is noise.
Drift checks in production
- ·Compute weekly metric deltas with paired-bootstrap CI.
- ·Alert on metric_t − metric_t-1 outside the historical 95% band.
- ·Re-run the golden eval set on every release; never trust 'looks fine in prod'.