GOVCourse

AI Governance & Evaluation

Lessons8modules

Total88mfull study

Quick7mtrailer

Projects8docker labs

CHEATSHEET · 01Pick the right eval for the job

Capability — does it do the task?

RAG / agent — does it ground & use tools?

·**Ragas** — faithfulness, answer-relevance, context-precision/recall, noise-sensitivity.
·**TruLens RAG Triad** — groundedness, answer-relevance, context-relevance.
·**DeepEval** — pytest-style RAG metrics for CI assertions.
·**LongBench v2** — long-context reasoning to 128k tokens.
·**GAIA** — general assistant; humans 92% / GPT-4 + plugins ~15% at launch.
·**AgentBench v2** — 8 environments; **ToolBench / StableToolBench** — 16k+ APIs.

Safety — does it refuse the right things?

Red-team — can adversaries break it?

·**PyRIT 0.5+** — orchestrator: target/attacker/scorer triad, multi-turn memory.
·**Garak 0.10+** — 60+ probes, CVE-style probe IDs ('nmap for LLMs').
·**Promptfoo red-team** — YAML-driven harm packs, OWASP LLM Top 10.
·Map findings to **MITRE ATLAS** for a defensible taxonomy.

Frontier-risk — at scale, catastrophic uplift?

·**Anthropic RSP v3** (Feb 2026) — ASL-1 → ASL-4 tier gates; bioweapon, cyber, autonomy, AI R&D.
·**OpenAI Preparedness v2** (Apr 2025) — Bio/Chem, Cyber, AI Self-Improvement; High / Critical.
·**DeepMind FSF v3** (Apr 2026) — CCLs across CBRN, cyber, ML R&D, deceptive alignment.
·**METR HCAST + RE-Bench** — autonomy time-horizon (doubling ~7 months).
·**UK / US AISI Inspect suites** — pre-deployment access agreements.

CHEATSHEET · 02Statistical rigor: read leaderboards critically

Bootstrap confidence intervals

·Resample with replacement 1000–10000 times; BCa (bias-corrected & accelerated) preferred.
·200 items → ±~5–7pp 95% CI at 70–90% accuracy. 500 items → ±~3.5pp.
·Always report the CI alongside the mean, in tables AND plots.
·Use `scipy.stats.bootstrap` or the bootstrapped python lib.

Paired tests for two-model comparisons

·**Paired permutation** — shuffle which model 'owns' each item; compute null distribution. No normality assumption.
·**McNemar's test** — paired binary outcomes (A right / B right). The standard for 'did the new model fix things without breaking others.'
·Don't run an unpaired t-test on the same eval set — it overstates significance.
·Skip Welch's t-test on bounded scores like 0/1 accuracy; use exact tests.

Inter-rater reliability for human eval

·**Cohen's kappa** — two annotators, nominal scale.
·**Krippendorff's alpha** — any number of annotators, any scale, missing data fine.
·Alpha < 0.67 → guidelines unclear; redo before trusting any score.
·Compute IRR BEFORE you trust the labels as ground truth.

Multi-comparison correction

·Run 20 metrics, no correction → ~64% chance of one false positive at p < 0.05.
·**Bonferroni** — divide alpha by k tests; conservative.
·**Holm** — step-down Bonferroni; more powerful, same control.
·**Benjamini-Hochberg** — control FDR (less strict than FWER); good for exploratory dashboards.

Item Response Theory (IRT)

Sample-size rule of thumb

Drift checks in production