Eval-driven AI.
Defensible by
design.
8 micro-lessons · ~90 min · Real Docker images
AI Governance & Evaluation
Eval-driven LLM dev: Inspect AI, Ragas, PyRIT, AILuminate, RSP gates — the harness you ship every release on.
Real skills, real career delta.
Skills you'll gain
10- Eval Harness OperatorProduction
Stand up Inspect AI, lm-evaluation-harness, HELM, Promptfoo and DeepEval in CI; run a 30-task suite against any chat endpoint and ship the JSON + HTML report.
- Capability Benchmark SpecialistWorking
Read SWE-bench Verified, GPQA-Diamond, FrontierMath, ARC-AGI-2, MMLU-Pro and Aider-polyglot leaderboards critically; pick the right benchmark for the claim you're making.
- RAG Eval EngineerProduction
Build Ragas / TruLens / DeepEval pipelines that score faithfulness, answer-relevance and context-precision/recall on a versioned golden RAG dataset; gate releases on grounding regressions.
- Red-Team OperatorProduction
Run automated jailbreak campaigns with PyRIT and Garak; map findings to MITRE ATLAS and OWASP LLM Top 10; produce a defensible red-team report.
- Safety Benchmark AuditorProduction
Run MLCommons AILuminate, HarmBench, JailbreakBench and AgentHarm; produce letter-graded safety reports defensible to procurement and frontier-launch reviewers.
- Eval StatisticianWorking
Apply bootstrap CIs, paired-permutation, McNemar, Cohen's kappa, Krippendorff's alpha, Bonferroni / Holm / BH corrections; never trust a single number again.
- Continuous Eval CI/CD EngineerProduction
Wire Phoenix, Langfuse, OpenLLMetry into prod LLM apps; capture spans, replay through golden datasets, alert on drift; run eval as a GitHub Action that posts a delta table to every PR.
- Frontier Safety EvaluatorAdvanced
Implement RSP / Preparedness / FSF tier-gating: METR autonomy time-horizon, cyber CTF uplift proxy, AI R&D uplift; produce a traffic-light gate document.
- Human Eval LeadWorking
Stand up Argilla / Label Studio for SME annotation; design rubrics; compute IRR; integrate human-graded results into the same dashboard as automatic metrics.
- Eval Report AuthorProduction
Author model cards, system cards, transparency notes and Annex IV technical-documentation packs; map results to NIST AI 600-1 sub-controls and EU AI Act Art. 15 declarations.