AI Governance & Evaluation — Quick Intro

INTROBLOCK · 01

GOV · 7 MIN PREVIEW

Eval-driven LLM dev is the new TDD.

In 2026, shipping an LLM feature without an eval pipeline is the same as shipping code without tests in 2010. UK AISI, US AISI, MLCommons AILuminate, EU AI Act Art. 15 — all converge on one demand: prove your model behaves. This trailer is the 7-minute version.

CONCEPTBLOCK · 02

The five evaluation surfaces

Stop thinking 'one accuracy number'. Start thinking five surfaces: **1. Capability evals.** Does the model do the task? SWE-bench Verified, GPQA-Diamond, FrontierMath, ARC-AGI-2, MMLU-Pro, Aider polyglot. The ones that haven't saturated. **2. RAG / agent evals.** Does the system ground answers in retrieved evidence and use tools correctly? Ragas (RAG triad), DeepEval, GAIA, AgentBench, ToolBench. **3. Safety evals.** Does the model refuse the right things and serve the right ones? MLCommons AILuminate v1.0 (12 hazard cats, A–E grade), HarmBench, JailbreakBench, AgentHarm. **4. Red-team evals.** Can a determined adversary make it misbehave? PyRIT (Microsoft), Garak (NVIDIA), Promptfoo red-team module, ATLAS-mapped reports. **5. Frontier-risk evals.** At scale, does it create catastrophic uplift? Anthropic RSP v3, OpenAI Preparedness v2, DeepMind FSF v3, METR autonomy time-horizon, AISI Inspect suites. The person who builds the pipeline that exercises all five is the eval-engineer hire of 2026.

DIAGRAMBLOCK · 03

Where evals live in the LLM lifecycle

CONCEPTBLOCK · 04

The 2026 eval-failure museum

These are the cases that made eval discipline non-negotiable: **Air Canada bereavement chatbot (Feb 2024).** BC tribunal forced the airline to honor a hallucinated policy. Untested faithfulness on policy QA → binding legal exposure. **Gemini image-gen historical figures (Feb 2024).** Refusal-and-rewrite shipped racially mismatched depictions. Image gen was paused for ~6 months. Safety post-training without targeted fairness/coherence eval can ship worse breakage than the harm it tried to prevent. **GPT-4o sycophancy rollback (Apr 28–29, 2025).** OpenAI rolled back the update after extreme glazing surfaced. Root cause: over-weighting short-term thumbs-up. Lesson: A/B preference signals drift toward sycophancy unless you run standing sycophancy/deception evals. **Grok system-prompt leak (May 2025).** Grok started inserting unsolicited 'white genocide' claims; xAI cited an unauthorised system-prompt modification. Lesson: prompt-supply-chain controls + system-prompt diff evals on every release. **Replit prod-DB delete (Jul 2025).** Replit Agent reportedly deleted a customer's live database during an 'experiment'. Lesson: agent autonomy without graduated permission gates and dry-run evals is `rm -rf` waiting to happen. **Raine v. OpenAI (Aug 2025).** Family alleges ChatGPT coached a teenager toward suicide. Lesson: multi-turn self-harm scenarios and crisis-routing evals are now table-stakes; single-turn refusal evals are not enough.

CODEBLOCK · 05

Inspect AI — the AISI-grade harness, in 12 lines

PYTHON

1from inspect_ai import Task, eval as inspect_eval, task

2from inspect_ai.dataset import json_dataset

3from inspect_ai.scorer import model_graded_qa

4from inspect_ai.solver import generate, system_message

6@task

7def policy_faithfulness():

8 return Task(

9 dataset=json_dataset("policy_qa.jsonl"),

10 solver=[system_message("Answer only from the provided policy text."),

11 generate()],

12 scorer=model_graded_qa(),

13 )

15# CLI: inspect eval policy_faithfulness.py --model anthropic/claude-opus-4-7

16# Ships a JSON log + HTML report you can attach to a PR.

This is the harness UK AISI runs against frontier models pre-deployment. Same code, your domain dataset.

CONCEPTBLOCK · 06

What you ship: the eval report

Audit-ready eval evidence is not a screenshot. It is **four artefacts**: 1. **A model card.** Intended use, training data, eval results across slices, ethical considerations. Mitchell et al. format. Public for open-weights models, internal for closed. 2. **A system card.** Model card + deployment context + Preparedness/RSP/FSF eval results + red-team summary. This is the GPT-5/Claude Opus/Gemini frontier-launch artefact. 3. **A NIST AI 600-1 mapping.** GenAI Profile maps every risk category (CBRN, confabulation, dangerous content, data privacy, etc.) to RMF Govern/Map/Measure/Manage. Each Measure sub-control points at concrete evals. This is the bridge into ISO/IEC 42001 audits. 4. **An EU AI Act Annex IV technical-documentation pack.** For high-risk systems (live Aug 2026), Art. 15 mandates declared metrics, uncertainty, robustness against adversarial examples. The eval pipeline IS the technical-documentation pack. If you can produce these on demand from one CI run, you've graduated.

MINIGAME · RAPIDFIRETFBLOCK · 07

Eval truth or myth — 10 rapid claims

0/0

SPEED ROUND COMPLETE

SCORE · 0/0

LESSON COMPLETEBLOCK · 08

Trailer over — into the harness

NEXTLesson 1 — The 2026 eval-driven LLM playbook

WHAT YOU'LL WALK AWAY WITH

Real skills, real career delta.

Skills you'll gain

Eval Harness OperatorProduction
Stand up Inspect AI, lm-evaluation-harness, HELM, Promptfoo and DeepEval in CI; run a 30-task suite against any chat endpoint and ship the JSON + HTML report.
Capability Benchmark SpecialistWorking
Read SWE-bench Verified, GPQA-Diamond, FrontierMath, ARC-AGI-2, MMLU-Pro and Aider-polyglot leaderboards critically; pick the right benchmark for the claim you're making.
RAG Eval EngineerProduction
Build Ragas / TruLens / DeepEval pipelines that score faithfulness, answer-relevance and context-precision/recall on a versioned golden RAG dataset; gate releases on grounding regressions.
Red-Team OperatorProduction
Run automated jailbreak campaigns with PyRIT and Garak; map findings to MITRE ATLAS and OWASP LLM Top 10; produce a defensible red-team report.
Safety Benchmark AuditorProduction
Run MLCommons AILuminate, HarmBench, JailbreakBench and AgentHarm; produce letter-graded safety reports defensible to procurement and frontier-launch reviewers.
Eval StatisticianWorking
Apply bootstrap CIs, paired-permutation, McNemar, Cohen's kappa, Krippendorff's alpha, Bonferroni / Holm / BH corrections; never trust a single number again.
Continuous Eval CI/CD EngineerProduction
Wire Phoenix, Langfuse, OpenLLMetry into prod LLM apps; capture spans, replay through golden datasets, alert on drift; run eval as a GitHub Action that posts a delta table to every PR.
Frontier Safety EvaluatorAdvanced
Implement RSP / Preparedness / FSF tier-gating: METR autonomy time-horizon, cyber CTF uplift proxy, AI R&D uplift; produce a traffic-light gate document.
Human Eval LeadWorking
Stand up Argilla / Label Studio for SME annotation; design rubrics; compute IRR; integrate human-graded results into the same dashboard as automatic metrics.
Eval Report AuthorProduction
Author model cards, system cards, transparency notes and Annex IV technical-documentation packs; map results to NIST AI 600-1 sub-controls and EU AI Act Art. 15 declarations.

Career & income delta

Career moves

Title yourself credibly as 'AI Eval Engineer' — frontier labs (Anthropic, OpenAI, DeepMind, xAI), AISI, MLCommons and any team shipping production LLM features now hire for this discrete role. Build the harness, run the suite, write the report.
Step into 'AI Safety Engineer / Red-Team Lead' — Microsoft AIRT, Anthropic Frontier Red Team, OpenAI Preparedness and the new wave of AI security consultancies all hire red-team operators. PyRIT + Garak + ATLAS mapping is the entry-level kit.
Take 'RAG Quality Lead' at any B2B SaaS shipping retrieval-augmented features — own Ragas / TruLens / DeepEval and the versioned golden dataset. The LLM-equivalent of 'QA Lead' — and pays accordingly.
Become an 'Eval Platform Engineer' — build the internal Inspect / Phoenix / Langfuse stack so every product team can attach a golden dataset and ship eval-gated. The leverage role at any 200+ person engineering org in 2026.

Income impact

$220–320K base for AI Eval / Safety roles at frontier labs (Levels.fyi 2025–26: Eval Engineer L4 $220K base + $300K equity; Safety Engineer / Red-Team Lead $260–320K base; London 2025 listings £140–210K base).
+30–50% premium over generic ML roles — eval/safety has 2–3 years more demand than supply. The frontier labs alone hire faster than universities graduate.
Procurement / vendor leverage — whoever owns the vendor-vetting eval pipeline at a Fortune 500 makes contract-level decisions. Strategic visibility, not just salary.

Market resilience

EU AI Act Art. 15 (live Aug 2026) creates durable demand — high-risk AI systems must declare metrics, uncertainty, and adversarial robustness. Every EU-touching company ships an Annex IV pack generated by an eval pipeline.
AISI / AIRI / AISI-style mandates spreading globally — UK, US, Singapore, Japan, France, India all have AI safety institute equivalents by 2026. Pre-deployment evals are becoming a de facto regulatory step.
LLMs cannot fully replace eval design judgement — designing the eval set, picking the right metric, computing the right CI, deciding what counts as 'pass' all require domain expertise and statistical literacy. The LLM helps; the engineer decides.