GOVMOD.GOV-08 · v1.0

Eval-driven AI.
Defensible by
design.

8 micro-lessons · ~90 min · Real Docker images

GUARDRAIL CHAIN · LIVE

92%

THRESHOLD LINE

-12 dB

OUT

50%

INPUT

THRESHOLD

RATIO

OUTPUT

THRESHOLD

-12 dB

RATIO

4:1

-6 dB

EVAL CHAIN ARMED · clamping spikes

GOVDATA ENGINEERINGHOT

AI Governance & Evaluation

Eval-driven LLM dev: Inspect AI, Ragas, PyRIT, AILuminate, RSP gates — the harness you ship every release on.

WHY THIS MATTERS · UK AISI · MLCOMMONS · NIST AI 600-1 · EU AI ACT ART. 15

EU AI Act Art. 15 obliges high-risk AI systems to declare metrics, uncertainty, and adversarial robustness from Aug 2026. UK and US AISI run pre-deployment Inspect AI evaluations on every frontier-model release. MLCommons AILuminate v1.0 is the third-party safety scoreboard procurement teams reference. The course teaches the harness and the report.

WHAT YOU'LL LEARN

01The 2026 eval-driven LLM playbook

02Eval harness zoo & your first run

03Capability benchmarks that haven't saturated

04RAG eval with Ragas, TruLens, DeepEval

05Red-teaming as code: PyRIT, Garak, Promptfoo

06Safety benchmarks & the system card

07Statistical rigor & continuous eval in CI

08Frontier safety, human eval & audit-ready reports

YOU'LL BE ABLE TO

Run Inspect AI / Ragas / PyRIT / AILuminate / Phoenix end-to-end against any LLM endpoint.

Read leaderboards critically — bootstrap CIs, paired-permutation, Holm correction.

Build an internal RSP-shaped tier gate with METR autonomy + cyber + AgentHarm.

Author release-time bundles: model card, system card, NIST AI 600-1, EU AI Act Annex IV pack.

Convert prod traces into the next sprint's golden eval set.

SKILLS YOU'LL GAIN

Real skills, real career delta.

Skills you'll gain

Eval Harness OperatorProduction
Stand up Inspect AI, lm-evaluation-harness, HELM, Promptfoo and DeepEval in CI; run a 30-task suite against any chat endpoint and ship the JSON + HTML report.
Capability Benchmark SpecialistWorking
Read SWE-bench Verified, GPQA-Diamond, FrontierMath, ARC-AGI-2, MMLU-Pro and Aider-polyglot leaderboards critically; pick the right benchmark for the claim you're making.
RAG Eval EngineerProduction
Build Ragas / TruLens / DeepEval pipelines that score faithfulness, answer-relevance and context-precision/recall on a versioned golden RAG dataset; gate releases on grounding regressions.
Red-Team OperatorProduction
Run automated jailbreak campaigns with PyRIT and Garak; map findings to MITRE ATLAS and OWASP LLM Top 10; produce a defensible red-team report.
Safety Benchmark AuditorProduction
Run MLCommons AILuminate, HarmBench, JailbreakBench and AgentHarm; produce letter-graded safety reports defensible to procurement and frontier-launch reviewers.
Eval StatisticianWorking
Apply bootstrap CIs, paired-permutation, McNemar, Cohen's kappa, Krippendorff's alpha, Bonferroni / Holm / BH corrections; never trust a single number again.
Continuous Eval CI/CD EngineerProduction
Wire Phoenix, Langfuse, OpenLLMetry into prod LLM apps; capture spans, replay through golden datasets, alert on drift; run eval as a GitHub Action that posts a delta table to every PR.
Frontier Safety EvaluatorAdvanced
Implement RSP / Preparedness / FSF tier-gating: METR autonomy time-horizon, cyber CTF uplift proxy, AI R&D uplift; produce a traffic-light gate document.
Human Eval LeadWorking
Stand up Argilla / Label Studio for SME annotation; design rubrics; compute IRR; integrate human-graded results into the same dashboard as automatic metrics.
Eval Report AuthorProduction
Author model cards, system cards, transparency notes and Annex IV technical-documentation packs; map results to NIST AI 600-1 sub-controls and EU AI Act Art. 15 declarations.

RUNNABLE ON YOUR MACHINE

$ docker pull snap/governance-eval:0.1

$ docker run --rm -it snap/governance-eval:0.1

snap/governance-eval:0.1

QUICK PREVIEW · 7 MIN

VERIFIED ENGINEER REVIEWS

I built the inspect-eval-runner image into our CI in two days. PR-comment delta tables landed the next week. Instantly the most-cited eng artifact at our company.

@e.parkerVERIFY ON GITHUB

The RSP-tier-gate lab finally gave me a defensible answer to 'how do you evaluate?' from procurement. Letter grade + traffic light + Annex IV zip — done.

@a.shahVERIFY ON GITHUB

LESSONS8

HOURS~1.5

LEARNERS0

THIS WEEK+0%

Eval-driven AI.Defensible bydesign.

AI Governance & Evaluation

Real skills, real career delta.

Skills you'll gain

Eval-driven AI.
Defensible by
design.