GOVMOD.GOV-08 · v1.0

Eval-driven AI.
Defensible by
design.

8 micro-lessons · ~90 min · Real Docker images

GUARDRAIL CHAIN · LIVE
IN
92%
THRESHOLD LINE
-12 dB
OUT
50%
INPUT
THRESHOLD
RATIO
OUTPUT
THRESHOLD
-12 dB
RATIO
4:1
GR
-6 dB
EVAL CHAIN ARMED · clamping spikes
GOVDATA ENGINEERINGHOT

AI Governance & Evaluation

Eval-driven LLM dev: Inspect AI, Ragas, PyRIT, AILuminate, RSP gates — the harness you ship every release on.

WHY THIS MATTERS · UK AISI · MLCOMMONS · NIST AI 600-1 · EU AI ACT ART. 15
EU AI Act Art. 15 obliges high-risk AI systems to declare metrics, uncertainty, and adversarial robustness from Aug 2026. UK and US AISI run pre-deployment Inspect AI evaluations on every frontier-model release. MLCommons AILuminate v1.0 is the third-party safety scoreboard procurement teams reference. The course teaches the harness and the report.
WHAT YOU'LL LEARN
01The 2026 eval-driven LLM playbook
02Eval harness zoo & your first run
03Capability benchmarks that haven't saturated
04RAG eval with Ragas, TruLens, DeepEval
05Red-teaming as code: PyRIT, Garak, Promptfoo
06Safety benchmarks & the system card
07Statistical rigor & continuous eval in CI
08Frontier safety, human eval & audit-ready reports
YOU'LL BE ABLE TO
Run Inspect AI / Ragas / PyRIT / AILuminate / Phoenix end-to-end against any LLM endpoint.
Read leaderboards critically — bootstrap CIs, paired-permutation, Holm correction.
Build an internal RSP-shaped tier gate with METR autonomy + cyber + AgentHarm.
Author release-time bundles: model card, system card, NIST AI 600-1, EU AI Act Annex IV pack.
Convert prod traces into the next sprint's golden eval set.
SKILLS YOU'LL GAIN

Real skills, real career delta.

Skills you'll gain

10
  • Eval Harness OperatorProduction

    Stand up Inspect AI, lm-evaluation-harness, HELM, Promptfoo and DeepEval in CI; run a 30-task suite against any chat endpoint and ship the JSON + HTML report.

  • Capability Benchmark SpecialistWorking

    Read SWE-bench Verified, GPQA-Diamond, FrontierMath, ARC-AGI-2, MMLU-Pro and Aider-polyglot leaderboards critically; pick the right benchmark for the claim you're making.

  • RAG Eval EngineerProduction

    Build Ragas / TruLens / DeepEval pipelines that score faithfulness, answer-relevance and context-precision/recall on a versioned golden RAG dataset; gate releases on grounding regressions.

  • Red-Team OperatorProduction

    Run automated jailbreak campaigns with PyRIT and Garak; map findings to MITRE ATLAS and OWASP LLM Top 10; produce a defensible red-team report.

  • Safety Benchmark AuditorProduction

    Run MLCommons AILuminate, HarmBench, JailbreakBench and AgentHarm; produce letter-graded safety reports defensible to procurement and frontier-launch reviewers.

  • Eval StatisticianWorking

    Apply bootstrap CIs, paired-permutation, McNemar, Cohen's kappa, Krippendorff's alpha, Bonferroni / Holm / BH corrections; never trust a single number again.

  • Continuous Eval CI/CD EngineerProduction

    Wire Phoenix, Langfuse, OpenLLMetry into prod LLM apps; capture spans, replay through golden datasets, alert on drift; run eval as a GitHub Action that posts a delta table to every PR.

  • Frontier Safety EvaluatorAdvanced

    Implement RSP / Preparedness / FSF tier-gating: METR autonomy time-horizon, cyber CTF uplift proxy, AI R&D uplift; produce a traffic-light gate document.

  • Human Eval LeadWorking

    Stand up Argilla / Label Studio for SME annotation; design rubrics; compute IRR; integrate human-graded results into the same dashboard as automatic metrics.

  • Eval Report AuthorProduction

    Author model cards, system cards, transparency notes and Annex IV technical-documentation packs; map results to NIST AI 600-1 sub-controls and EU AI Act Art. 15 declarations.

RUNNABLE ON YOUR MACHINE
$ docker pull snap/governance-eval:0.1
$ docker run --rm -it snap/governance-eval:0.1
snap/governance-eval:0.1
QUICK PREVIEW · 7 MIN
VERIFIED ENGINEER REVIEWS
I built the inspect-eval-runner image into our CI in two days. PR-comment delta tables landed the next week. Instantly the most-cited eng artifact at our company.
@e.parkerVERIFY ON GITHUB
The RSP-tier-gate lab finally gave me a defensible answer to 'how do you evaluate?' from procurement. Letter grade + traffic light + Annex IV zip — done.
@a.shahVERIFY ON GITHUB
LESSONS8
HOURS~1.5
LEARNERS0
THIS WEEK+0%