Inspect AI in CI — AISI-grade harness as a GitHub Action
Inspect AI 0.3.x harness wired with HarmBench + AILuminate + a custom domain set; runs as a GitHub Action; posts a Markdown eval report on every PR.
version: '3.9'
services:
runner:
image: snap/inspect-eval-runner:0.1
environment:
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
- OPENAI_API_KEY=${OPENAI_API_KEY}
- TARGET_MODEL=${TARGET_MODEL:-anthropic/claude-opus-4-7}
- DATASET_PATH=/app/datasets/policy_qa.jsonl
- THRESHOLD_FILE=/app/eval_thresholds.yaml
volumes:
- ./datasets:/app/datasets:ro
- ./out:/app/out
- ./eval_thresholds.yaml:/app/eval_thresholds.yaml:ro
command: ["sh", "-lc", "inspect eval policy_faithfulness.py --model ${TARGET_MODEL} --log-dir /app/out/logs && python /app/render_summary.py"]
Add a .github/workflows/eval.yml at your job's repo that mounts your golden dataset and pulls this image. Now every PR that touches the prompt or model config triggers an eval run; the report posts as a PR comment; merge is gated. That's eval-driven LLM dev as a product engineer experiences it. The same image scales to a nightly job against your staging endpoint.
Inspect AI in CI — AISI-grade harness as a GitHub Action
Inspect AI 0.3.x harness wired with HarmBench + AILuminate + a custom domain set; runs as a GitHub Action; posts a Markdown eval report on every PR.
version: '3.9'
services:
runner:
image: snap/inspect-eval-runner:0.1
environment:
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
- OPENAI_API_KEY=${OPENAI_API_KEY}
- TARGET_MODEL=${TARGET_MODEL:-anthropic/claude-opus-4-7}
- DATASET_PATH=/app/datasets/policy_qa.jsonl
- THRESHOLD_FILE=/app/eval_thresholds.yaml
volumes:
- ./datasets:/app/datasets:ro
- ./out:/app/out
- ./eval_thresholds.yaml:/app/eval_thresholds.yaml:ro
command: ["sh", "-lc", "inspect eval policy_faithfulness.py --model ${TARGET_MODEL} --log-dir /app/out/logs && python /app/render_summary.py"]
Add a .github/workflows/eval.yml at your job's repo that mounts your golden dataset and pulls this image. Now every PR that touches the prompt or model config triggers an eval run; the report posts as a PR comment; merge is gated. That's eval-driven LLM dev as a product engineer experiences it. The same image scales to a nightly job against your staging endpoint.