Skills you'll gain
12- Model selection & routingProduction
Pick frontier vs mini vs reasoning vs local models against latency, cost, and quality budgets — and route requests between them in one service.
- Cost-bounded LLM featuresProduction
Ship features with hard token budgets, max_tokens caps, and per-request $ tracking — defendable on a finance review.
- Prompt engineeringWorking
Author and maintain prompts that survive 3+ revisions: zero-shot, few-shot, CoT, structured-output, role design, anti-drift patterns.
- Structured outputProduction
Build JSON-mode + Pydantic + Instructor services that validate on every turn and retry on schema failure.
- Function calling & tool useWorking
Wire single and parallel tool-use, design idempotent tool contracts, and decide when an agent is the wrong answer.
- Eval-driven LLM developmentProduction
Write Promptfoo / DeepEval suites and gate releases on regression — turn prompts into testable, versioned artifacts.
- Streaming & latency engineeringWorking
Implement chunked SSE, partial-JSON streaming, and cut perceived chat latency from 3s+ to under 500ms.
- Caching (prompt + semantic)Production
Design cacheable prefixes, set up prompt caching and Redis-backed semantic cache — verified 40-70% spend reduction.
- LLM observabilityProduction
Stand up LiteLLM + Prometheus + Grafana + Loki to trace every call with prompt hash, tokens, cost, and provider id.
- Safety & prompt-injection defenceWorking
Apply NeMo Guardrails / Llama Guard, run a red-team drill on your own service, and ship a hardened input/output filter chain.
- Local-first deploymentWorking
Run Ollama / vLLM with small models (Phi-4, Llama-3.2) and route to hosted APIs only on overflow — works offline, beats compliance reviews.
- When NOT to generateProduction
Replace LLM calls with regex / SQL / classifiers / embeddings where deterministic — shown to cut spend 30%+ on real audits.