GOVCourse

AI Governance & Evaluation

Lessons8modules

Total88mfull study

Quick7mtrailer

Projects8docker labs

Inspect AI in CI — AISI-grade harness as a GitHub Action

Inspect AI 0.3.x harness wired with HarmBench + AILuminate + a custom domain set; runs as a GitHub Action; posts a Markdown eval report on every PR.

snap/inspect-eval-runner:0.1Repo · inspect-eval-runner

$git clonehttps://github.com/snap-dev/inspect-eval-runner.git

docker-compose.yml

version: '3.9'
services:
  runner:
    image: snap/inspect-eval-runner:0.1
    environment:
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
      - OPENAI_API_KEY=${OPENAI_API_KEY}
      - TARGET_MODEL=${TARGET_MODEL:-anthropic/claude-opus-4-7}
      - DATASET_PATH=/app/datasets/policy_qa.jsonl
      - THRESHOLD_FILE=/app/eval_thresholds.yaml
    volumes:
      - ./datasets:/app/datasets:ro
      - ./out:/app/out
      - ./eval_thresholds.yaml:/app/eval_thresholds.yaml:ro
    command: ["sh", "-lc", "inspect eval policy_faithfulness.py --model ${TARGET_MODEL} --log-dir /app/out/logs && python /app/render_summary.py"]

Run

~/inspect-eval-runner · zsh

$ docker compose up --abort-on-container-exit

Running policy_faithfulness on ${TARGET_MODEL} 100/100 samples scored Accuracy: 0.83 (CI95 [0.75, 0.89])

What you'll observe

JSON log per task in ./out/logs/<timestamp>/.

Markdown summary at ./out/SUMMARY.md ready to paste into a PR comment.

Exit code 1 if any task accuracy drops below the threshold in eval_thresholds.yaml.

Lift this to your work

Add a .github/workflows/eval.yml at your job's repo that mounts your golden dataset and pulls this image. Now every PR that touches the prompt or model config triggers an eval run; the report posts as a PR comment; merge is gated. That's eval-driven LLM dev as a product engineer experiences it. The same image scales to a nightly job against your staging endpoint.