INTROBLOCK · 01
QA · 7 MIN PREVIEW
AI for QA / Test Engineers
LLM-as-judge done right. Evaluation harnesses that bite. Adversarial probes in CI. Regression benches that survive model upgrades.
CONCEPTBLOCK · 02
Testing LLMs is testing distributions, not values
A traditional test asserts an exact value: assertEqual(2 + 2, 4). LLM tests assert a distribution: 'over 200 questions, recall@10 must be >= 0.78 and answer-faithfulness must be >= 0.85'. The unit-test reflex (one prompt, one expected output) is wrong — it's flaky and tells you nothing. The fix is to build evaluation suites with real labels, score them with a mix of reference-based metrics (BLEU, ROUGE, semantic similarity) and LLM-as-judge, and gate merges on aggregate scores moving the right direction.
TIPScore your evaluations with both a strict numeric metric AND an LLM judge. The numeric one catches drift; the judge catches subtle quality.
WATCH OUTLLM-as-judge with the same model that generated the answer is a correlation trap. Use a different family for judging.
DIAGRAMBLOCK · 03
Evaluation harness anatomy
Dual scoring (judge + metric). One gate. PR fails if scores drop.
CODEBLOCK · 04
LLM-as-judge in 12 lines
PYTHON1from openai import OpenAI
2import json
3client = OpenAI()
4
5JUDGE_PROMPT = """You are a strict grader. Given a question, a reference, and a candidate answer,
6return JSON {"faithful": 0/1, "complete": 0/1, "reason": "..."}."""
7
8def judge(q, ref, ans):
9 out = client.chat.completions.create(
10 model="gpt-4o", # different family from SUT
11 response_format={"type": "json_object"},
12 messages=[{"role": "system", "content": JUDGE_PROMPT},
13 {"role": "user", "content": f"Q: {q}\nREF: {ref}\nANS: {ans}"}])
14 return json.loads(out.choices[0].message.content)
Use a different model family for the judge. response_format=json_object guarantees parseable output.
CHEATSHEETBLOCK · 05
Five things to remember
01Score distributions, not single prompts. Aim for >=200 cases per suite.
02Dual-score: a numeric metric + an LLM judge.
03Judge with a different model family than the SUT.
04Pin the case set in git. Drift in the dataset = drift in the meaning of pass.
05Adversarial probes: known-bad prompts must fail closed (refusal, sanitisation).
MINIGAME · RAPIDFIRETFBLOCK · 06
True or false: 6 seconds each
An LLM can grade itself reliably.
CLAIM 1/5 · READY · scroll into view
LESSON COMPLETEBLOCK · 07
QA mental model: locked.
NEXTLLM-as-judge harness with pytest
WHAT YOU'LL WALK AWAY WITH
Real skills, real career delta.
Skills you'll gain
04- Build LLM-as-judge that doesn't hallucinateWorking
Outcome from completing the course: build llm-as-judge that doesn't hallucinate.
- Set up regression benches that biteWorking
Outcome from completing the course: set up regression benches that bite.
- Run adversarial probes in CIWorking
Outcome from completing the course: run adversarial probes in ci.
- Eval harnessesWorking
Covered in lesson sequence — drop-in ready.
Career & income delta
Career moves
- Lead a AI for QA / Test Engineers initiative on your team — most orgs have it on the roadmap and few have shipped it.
- Consulting work at $150-300/hr — 'QA shipped to production' is a sought-after specialty in 2026.
- Move from generic IC to platform/AI-platform team where AI for QA / Test Engineers expertise is the entry ticket.
Income impact
- $15-40K bump for senior ICs adding AI for QA / Test Engineers to their resume.
- Freelance / consulting demand for the same skill: $150-300/hr in 2026.
- Closing enterprise deals often hinges on demonstrating the production patterns from this course.
Market resilience
- AI for QA / Test Engineers is a durable skill across model and framework consolidations.
- Production guardrails (cost caps, observability, audit, evals) carry forward to whatever the 2027 stack is.
- Core patterns transfer to cloud, on-prem, and hybrid deployments.