agent-dev-shell · CLI bake-off harness for 5 coding agents
Side-by-side: claude-code, cursor-agent, aider, codex CLI, cline against the SAME bug ticket. One run report.
services:
shell:
image: snap/ai-native-dev:agent-dev-shell
working_dir: /work
volumes:
- ./repo:/work:rw
- ./tasks:/tasks:ro
- ./runs:/runs:rw
environment:
ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY:?set ANTHROPIC_API_KEY}
OPENAI_API_KEY: ${OPENAI_API_KEY:-}
CURSOR_API_KEY: ${CURSOR_API_KEY:-}
AGENT: ${AGENT:-claude}
TASK: ${TASK:-SAMPLE-1287}
command: >-
bash -lc "./scripts/run-agent.sh $${AGENT} /tasks/$${TASK}.md /runs/$${AGENT}-$${TASK}"
compare:
image: snap/ai-native-dev:agent-dev-shell
depends_on: [shell]
volumes: ["./runs:/runs:ro"]
command: python /opt/compare.py /runs --out /runs/report.html
First step every team serious about coding agents takes — a reproducible bake-off harness so 'which agent should we use' becomes a measured question. Drop your real bug tickets into /tasks/, point /repo at your monorepo, run nightly. By week 2 you'll know which agent / model / mode wins for which kind of ticket in YOUR codebase. Keep this image alive forever; it pays for itself the first time someone says 'we should switch from X to Y'.
agent-dev-shell · CLI bake-off harness for 5 coding agents
Side-by-side: claude-code, cursor-agent, aider, codex CLI, cline against the SAME bug ticket. One run report.
services:
shell:
image: snap/ai-native-dev:agent-dev-shell
working_dir: /work
volumes:
- ./repo:/work:rw
- ./tasks:/tasks:ro
- ./runs:/runs:rw
environment:
ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY:?set ANTHROPIC_API_KEY}
OPENAI_API_KEY: ${OPENAI_API_KEY:-}
CURSOR_API_KEY: ${CURSOR_API_KEY:-}
AGENT: ${AGENT:-claude}
TASK: ${TASK:-SAMPLE-1287}
command: >-
bash -lc "./scripts/run-agent.sh $${AGENT} /tasks/$${TASK}.md /runs/$${AGENT}-$${TASK}"
compare:
image: snap/ai-native-dev:agent-dev-shell
depends_on: [shell]
volumes: ["./runs:/runs:ro"]
command: python /opt/compare.py /runs --out /runs/report.html
First step every team serious about coding agents takes — a reproducible bake-off harness so 'which agent should we use' becomes a measured question. Drop your real bug tickets into /tasks/, point /repo at your monorepo, run nightly. By week 2 you'll know which agent / model / mode wins for which kind of ticket in YOUR codebase. Keep this image alive forever; it pays for itself the first time someone says 'we should switch from X to Y'.