SREMOD.SRE-07 · v1.0

Make LLMs
observable.
Make incidents survivable.

7 micro-lessons · ~54 min · Real Docker images

MONITORING CONSOLE · LIVE
CONSOLE.A · LIVE
t+0s
API
LLM
DB
CACHE
QUEUE
LATENCY p95
142 ms
COST $/h
$4.20
ERRORS/min
12
RPS
1,840
SREROLE TRACK

AI for SREs / Platform

LLM observability, cost / latency monitoring, incident playbooks.

WHY THIS MATTERS · SNAP INTERNAL
Highest D-30 retention (76%) of any role track — SREs come back daily.
WHAT YOU'LL LEARN
01LLM observability
02Cost / latency monitoring
03Incident playbooks
04Rate limit design
YOU'LL BE ABLE TO
Wire LLM tracing into existing OTel
Bound cost + latency at the gateway
Run an LLM incident drill
SKILLS YOU'LL GAIN

Real skills, real career delta.

Skills you'll gain

07
  • Wire LLM tracing into existing OTelWorking

    Outcome from completing the course: wire llm tracing into existing otel.

  • Bound cost + latency at the gatewayWorking

    Outcome from completing the course: bound cost + latency at the gateway.

  • Run an LLM incident drillWorking

    Outcome from completing the course: run an llm incident drill.

  • LLM observabilityWorking

    Covered in lesson sequence — drop-in ready.

  • Cost / latency monitoringWorking

    Covered in lesson sequence — drop-in ready.

  • Incident playbooksWorking

    Covered in lesson sequence — drop-in ready.

  • Rate limit designWorking

    Covered in lesson sequence — drop-in ready.

RUNNABLE ON YOUR MACHINE
$ docker pull snap/ai-sre:lesson-01
$ docker run --rm -it snap/ai-sre:lesson-01
snap/ai-sre:lesson-01
QUICK PREVIEW · 7 MIN
VERIFIED ENGINEER REVIEWS
Incident-drill lesson is now our quarterly tabletop.
@sre_mayaVERIFY ON GITHUB
OTel-for-LLMs lesson saved a vendor renewal.
@devops_julesVERIFY ON GITHUB
LESSONS7
HOURS~0.9
LEARNERS980
THIS WEEK+16%