ANDCourse

AI-native software development

Lessons10modules
Total104mfull study
Quick7mtrailer
Projects7docker labs

agent-dev-shell · CLI bake-off harness for 5 coding agents

Side-by-side: claude-code, cursor-agent, aider, codex CLI, cline against the SAME bug ticket. One run report.

snap/ai-native-dev:agent-dev-shellRepo · ai-native-dev-agent-shell
$git clonehttps://github.com/snap-dev/ai-native-dev-agent-shell.git
docker-compose.yml
services:
  shell:
    image: snap/ai-native-dev:agent-dev-shell
    working_dir: /work
    volumes:
      - ./repo:/work:rw
      - ./tasks:/tasks:ro
      - ./runs:/runs:rw
    environment:
      ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY:?set ANTHROPIC_API_KEY}
      OPENAI_API_KEY:    ${OPENAI_API_KEY:-}
      CURSOR_API_KEY:    ${CURSOR_API_KEY:-}
      AGENT:             ${AGENT:-claude}
      TASK:              ${TASK:-SAMPLE-1287}
    command: >-
      bash -lc "./scripts/run-agent.sh $${AGENT} /tasks/$${TASK}.md /runs/$${AGENT}-$${TASK}"
  compare:
    image: snap/ai-native-dev:agent-dev-shell
    depends_on: [shell]
    volumes: ["./runs:/runs:ro"]
    command: python /opt/compare.py /runs --out /runs/report.html
Run
~/ai-native-dev-agent-shell · zsh
$ AGENT=claude TASK=SAMPLE-1287 docker compose up --abort-on-container-exit shell
[read]/[plan]/[edit]/[verify] markers; per-agent diff + cost + verify status saved to /runs.
What you'll observe
Container exits 0 within 3 minutes for SAMPLE-1287
Chosen agent runs read → plan → edit → verify in order
/runs/<agent>-<task>/diff.patch contains the diff
/runs/<agent>-<task>/cost.json has tokens-in/out/usd
/runs/<agent>-<task>/verify.log shows full test output
Re-running with AGENT=cursor or AGENT=aider produces a comparable run
compare service emits report.html with cost / pass / latency
Lift this to your work

First step every team serious about coding agents takes — a reproducible bake-off harness so 'which agent should we use' becomes a measured question. Drop your real bug tickets into /tasks/, point /repo at your monorepo, run nightly. By week 2 you'll know which agent / model / mode wins for which kind of ticket in YOUR codebase. Keep this image alive forever; it pays for itself the first time someone says 'we should switch from X to Y'.