SREMOD.SRE-07 · v1.0
Make LLMs
observable.
Make incidents survivable.
7 micro-lessons · ~54 min · Real Docker images
MONITORING CONSOLE · LIVE
CONSOLE.A · LIVE
t+0s
API
LLM
DB
CACHE
QUEUE
LATENCY p95
142 ms
COST $/h
$4.20
ERRORS/min
12
RPS
1,840
SREROLE TRACK
AI for SREs / Platform
LLM observability, cost / latency monitoring, incident playbooks.
WHY THIS MATTERS · SNAP INTERNAL
Highest D-30 retention (76%) of any role track — SREs come back daily.
01LLM observability
02Cost / latency monitoring
03Incident playbooks
04Rate limit design
Wire LLM tracing into existing OTel
Bound cost + latency at the gateway
Run an LLM incident drill
SKILLS YOU'LL GAIN
Real skills, real career delta.
Skills you'll gain
07- Wire LLM tracing into existing OTelWorking
Outcome from completing the course: wire llm tracing into existing otel.
- Bound cost + latency at the gatewayWorking
Outcome from completing the course: bound cost + latency at the gateway.
- Run an LLM incident drillWorking
Outcome from completing the course: run an llm incident drill.
- LLM observabilityWorking
Covered in lesson sequence — drop-in ready.
- Cost / latency monitoringWorking
Covered in lesson sequence — drop-in ready.
- Incident playbooksWorking
Covered in lesson sequence — drop-in ready.
- Rate limit designWorking
Covered in lesson sequence — drop-in ready.
$ docker pull snap/ai-sre:lesson-01
$ docker run --rm -it snap/ai-sre:lesson-01
snap/ai-sre:lesson-01
Incident-drill lesson is now our quarterly tabletop.
@sre_mayaVERIFY ON GITHUB
OTel-for-LLMs lesson saved a vendor renewal.
@devops_julesVERIFY ON GITHUB
LESSONS7
HOURS~0.9
LEARNERS980
THIS WEEK+16%