AI for SREs / Platform

INTROBLOCK · 01

SRE · 7 MIN PREVIEW

LLM observability. Cost / latency monitoring at the gateway. Incident playbooks. Rate limits that bite without breaking UX.

CONCEPTBLOCK · 02

LLMs need new SLOs, not new SLO frameworks

Your SLO toolkit (error budgets, percentile latency, availability targets) all still apply — you just have new dimensions to observe: token throughput, cost per request, prompt + completion latency split, retries per call, hallucination rate (sampled). The trick is wiring these into your existing OTel + Prometheus + Grafana stack so on-call doesn't need a new tab. The OpenLLMetry semantic conventions give you the attribute names; the gateway is the natural enforcement point for cost + rate limits.

TIPSet both a cost SLO ($/1k requests) and a latency SLO (p95 TTFT). They drift in opposite directions when something is wrong.

WATCH OUTDon't trust per-route token counts from app code. The gateway is the single source of truth — apps can lie or forget.

DIAGRAMBLOCK · 03

LLM gateway: observe, throttle, fail over

One gateway. All policy. Nothing the app needs to know.

CODEBLOCK · 04

OpenLLMetry attribute names you should standardise

YAML

1# Standard attributes on every llm span

2llm.request.model: "gpt-4o-mini"

3llm.request.temperature: 0.0

4llm.usage.prompt_tokens: 1402

5llm.usage.completion_tokens: 318

6llm.usage.total_tokens: 1720

7llm.response.finish_reason: "stop"

8llm.cost_usd: 0.000387

9gen_ai.system: "openai"

10gen_ai.operation.name: "chat"

These map cleanly to Prometheus histograms (latency, tokens) and counters (cost). Use the gen_ai.* names for vendor-neutral dashboards.

CHEATSHEETBLOCK · 05

Five things to remember

01The gateway is the single source of truth for tokens + cost.

02Rate-limit by tenant + by model. Per-IP is meaningless for service traffic.

03Multi-provider failover is an SRE concern, not an app concern.

04Sample hallucinations with LLM-as-judge in production.

05An LLM incident drill belongs in your quarterly tabletop rotation.

MINIGAME · RAPIDFIRETFBLOCK · 06

True or false: 6 seconds each

OTel can capture LLM token counts as span attributes.

CLAIM 1/5 · READY · scroll into view

LESSON COMPLETEBLOCK · 07

LLM-platform mental model: locked.

NEXTLLM observability with OTel + Grafana

WHAT YOU'LL WALK AWAY WITH

Real skills, real career delta.

Skills you'll gain

Wire LLM tracing into existing OTelWorking
Outcome from completing the course: wire llm tracing into existing otel.
Bound cost + latency at the gatewayWorking
Outcome from completing the course: bound cost + latency at the gateway.
Run an LLM incident drillWorking
Outcome from completing the course: run an llm incident drill.
LLM observabilityWorking
Covered in lesson sequence — drop-in ready.
Cost / latency monitoringWorking
Covered in lesson sequence — drop-in ready.
Incident playbooksWorking
Covered in lesson sequence — drop-in ready.
Rate limit designWorking
Covered in lesson sequence — drop-in ready.

Career & income delta

Career moves

Lead a AI for SREs / Platform initiative on your team — most orgs have it on the roadmap and few have shipped it.
Consulting work at $150-300/hr — 'SRE shipped to production' is a sought-after specialty in 2026.
Move from generic IC to platform/AI-platform team where AI for SREs / Platform expertise is the entry ticket.

Income impact

$15-40K bump for senior ICs adding AI for SREs / Platform to their resume.
Freelance / consulting demand for the same skill: $150-300/hr in 2026.
Closing enterprise deals often hinges on demonstrating the production patterns from this course.

Market resilience

AI for SREs / Platform is a durable skill across model and framework consolidations.
Production guardrails (cost caps, observability, audit, evals) carry forward to whatever the 2027 stack is.
Core patterns transfer to cloud, on-prem, and hybrid deployments.