SREMOD.SRE-07 · v1.0
Make LLMs
observable.
Make incidents survivable.
7 micro-lessons · ~54 min · Real Docker images
MONITORING CONSOLE · LIVE
CONSOLE.A · LIVE
t+0s
API
LLM
DB
CACHE
QUEUE
LATENCY p95
142 ms
COST $/h
$4.20
ERRORS/min
12
RPS
1,840
SREROLE TRACK
AI for SREs / Platform
LLM observability, cost / latency monitoring, incident playbooks.
WHY THIS MATTERS · SNAP INTERNAL
Highest D-30 retention (76%) of any role track — SREs come back daily.
01LLM observability
02Cost / latency monitoring
03Incident playbooks
04Rate limit design
Wire LLM tracing into existing OTel
Bound cost + latency at the gateway
Run an LLM incident drill
$ docker pull snap/ai-sre:lesson-01
$ docker run --rm -it snap/ai-sre:lesson-01
snap/ai-sre:lesson-01
Incident-drill lesson is now our quarterly tabletop.
@sre_mayaVERIFY ON GITHUB
OTel-for-LLMs lesson saved a vendor renewal.
@devops_julesVERIFY ON GITHUB
LESSONS7
HOURS~0.9
LEARNERS980
THIS WEEK+16%