CHEATSHEET · 01GenAI · master cheatsheet
Model selection
- ·Smallest model that beats your eval bar wins
- ·Frontier models for hard reasoning, mini models for routing
- ·Reasoning models (o3, claude-opus-4-7-thinking) only when math/logic; they cost 3-8×
- ·Latency budget often matters more than the leaderboard
Cost discipline
- ·Always set max_tokens — it's the only hard cap
- ·Log prompt + completion tokens on every call
- ·Cache repeat prompts (system prompts, few-shot blocks)
- ·Output tokens cost 2-3× input — write prompts that ask for terse answers
- ·Route easy queries to mini, hard ones to frontier — 60-80% spend cut
Prompt engineering
- ·Role + task + format + examples + constraints — the 5-block template
- ·Temperature 0 for classification / extraction / agents
- ·Temperature 0.7-1.0 only for creative output
- ·Few-shot (3-5 examples) usually beats zero-shot by 20-30 accuracy points
- ·Order examples easy → hard; the model anchors on the last one
Structured output
- ·Use JSON-mode or response_format=json_schema — never regex parsing
- ·Pydantic + Instructor for retry-on-validation-failure
- ·Schema first — write the shape before the prompt
- ·On failure: log the bad output, don't silently 500
Production patterns
- ·Stream for >500-token outputs
- ·Retry with backoff on 429/503/529 — exponential, jittered
- ·Time-out at 30s — past that, something is wrong
- ·Eval-gate every prompt change in CI (Promptfoo / DeepEval)
- ·Trace every call — provider id, prompt hash, completion, tokens, $
- ·Fallback to a smaller / local model on overload
Safety & guardrails
- ·Treat user input as hostile — prompt injection is real
- ·Llama Guard / NeMo Guardrails on input AND output
- ·Strip PII before logs
- ·Never let an LLM emit raw SQL — parameterize
When NOT to generate
- ·Use regex / SQL / classifier when behavior is deterministic
- ·Embeddings for stable-class classification — 100× cheaper
- ·Classical ML for tabular + numeric
- ·Safety-critical paths shouldn't be generative
CHEATSHEET · 02Decision flowcharts you'll memorize
Pick a model in 4 questions
- ·1. Latency budget < 1s? → mini / local SLM
- ·2. Output > 500 tokens? → stream + frontier mini
- ·3. Math / logic / multi-step? → reasoning model (only if you can pay)
- ·4. Tabular / classification? → don't use an LLM, train a classifier
Pick a pattern in 3 questions
- ·1. Static knowledge? → fine-tune (rare; reach for last)
- ·2. Fresh / private docs? → RAG
- ·3. Same task, vary inputs? → prompt + few-shot
Diagnose a misbehaving LLM feature
- ·Drift in eval bar? → prompt regressed; bisect last commit
- ·Rising p99? → bigger prompt; trim system or cache prefix
- ·Cost spike? → check completion_tokens; set max_tokens
- ·Hallucinated facts? → no retrieval; add RAG or guardrail