LLMOps Curriculum
Agentic architecture, MCP tool design, Claude Code workflows, context management. ~70% of exam surface overlaps existing daily practice.
Foundational GenAI-on-AWS vocabulary. Pre-SAA warmup that fills the Bedrock/SageMaker gap SAA skims over.
Locked legibility cert. Standalone study (Cantrill + Tutorials Dojo); curriculum Phase 7 deploys provide hands-on.
You don't start here at zero. Name what you've built in LLMOps terms:
- reasonable-ux runs Opus as an "advisor" over Sonnet/Haiku "executors." That's model routing + tiered inference — literally chapter one of any LLMOps book.
- reasons-qagent pairs Playwright with a Claude vision model for scoring. That's grounded evals — the "measure quality with a judge model" pattern.
- An
agent-patterns/reference catalog is LLMOps knowledge management. - Claude Code skills (
/build-prompt,/session-review,/checkpoint) are prompt engineering + workflow discipline with feedback loops. - A post-commit pipeline (
sync.py→ validators) is observability thinking — pointed at a vault instead of an LLM.
What's missing is vocabulary, telemetry, and a gateway — not foundational skill.
- Reframe existing work in LLMOps vocabulary 20 min
- Where: today's daily note, short section titled "LLMOps vocabulary map"
- Output: 5-row table mapping each existing artifact to its LLMOps term
- Acceptance: you can say out loud "I already do X — it's called Y" for each row without hesitation
- Flag the one gap that feels embarrassing and write why 10 min
- Where: same daily note, one sentence
- Output: e.g., "I've been vibes-evaluating reasonable-ux output for months — no golden set, no deterministic assertions"
- Acceptance: gap is specific enough to become a Phase 3 task, not vague
Everything in LLMOps fits in one of four layers. If you can name the layer, you can shop for tools.
| Layer | What it does | First tool |
|---|---|---|
| Gateway | Abstracts "which model/provider" away from your code. Adds fallbacks + cost tracking. | LiteLLM |
| Observability | Records every call: prompt, response, cost, latency, who/when. | Langfuse |
| Eval | Scores output quality against a known-good set. Deterministic first, LLM-judge second. | Ragas (RAG) or DeepEval (general) |
| Structured outputs | Forces the LLM to return a typed object, not free-form text. | Instructor |
If a tool claims to "do LLMOps" and you can't place it in one of these four boxes, it's either an orchestration framework (higher layer) or hype.
- Read Applied LLMs sections 1 + 3 45 min
- Output: highlight 3 lines that surprise you; paste them into the daily note
- Acceptance: you could defend each highlight in one sentence
- Skim ZenML's 457 case study roundup 20 min
- Goal: confirm the "LangChain-as-never-extracted-scaffolding" pathology is real across industries
- Acceptance: you can name 2 other recurring pathologies from the writeup
- Build the 4-layer tool-slotting table 30 min
- Where:
40-reference/llmops-tools-map.md(new file — reference material, not a daily note) - Output: 10+ tools slotted into Gateway / Observability / Eval / Structured Outputs / Orchestration. Each row: tool name, layer, one-line purpose, install-now vs defer vs skip.
- Acceptance: given a new tool name you haven't seen, you can slot it in < 30 seconds
- Where:
Hands-on. Pick one project per session — don't try to retrofit both the same afternoon.
Recommended order: reasonable-ux first (most active, richest telemetry will follow). Retrofit reasons-qagent the following weekend (~1 hr — Langfuse already up, mostly copy-paste config).
- Wrap existing Claude calls through LiteLLM. Add Langfuse instrumentation. Wrap scoring outputs in Instructor.
- Save a session prompt in the project's CLAUDE.md: "this repo uses LiteLLM + Langfuse + Instructor — routing lives in
config/models.py." - Checkpoint: you can now see every call's cost + latency in a dashboard. You went from "it works" to "I can measure it."
- Stand up self-hosted Langfuse 30 min
- Where: run the docker-compose from the langfuse repo on the heavy-AI homelab host — not on the Mac
- Output: Langfuse UI reachable on LAN; project created; public/secret keys copied into a
.envyou git-ignore - Acceptance: you can open the Langfuse dashboard and see "0 traces"
- Install the three deps in reasonable-ux venv 15 min
- Command:
uv add litellm langfuse instructor(or pip equivalent) - Acceptance:
python -c "import litellm, langfuse, instructor"runs clean
- Command:
- Find and wrap Claude-call entry points through LiteLLM 1 hr
- How: grep for
anthropic.,Anthropic(,client.messages.createto find call sites - Replacement:
litellm.completion(model="claude-sonnet-4-6", messages=...)— LiteLLM speaks OpenAI schema, response parsing will need touching - Centralize: routing config lives in
config/models.py— tiers declared there, not sprinkled at call sites - Acceptance: one full reasonable-ux audit run completes end-to-end through LiteLLM
- How: grep for
- Add Langfuse instrumentation 30 min
- How:
@observe()decorator on the top-level audit entry point;langfuse_context.update_current_observation(...)at scoring boundaries for metadata - Acceptance: running one audit produces ≥1 trace in the Langfuse UI with cost + latency populated
- How:
- Wrap scoring outputs in Instructor 45 min
- How: declare Pydantic models for each scorer's expected output;
instructor.from_litellm(litellm.completion)as the client;response_model=YourModel - Acceptance: a malformed LLM response triggers Instructor's retry → validates → returns a typed object, not a string
- How: declare Pydantic models for each scorer's expected output;
- Document the stack in reasonable-ux CLAUDE.md 15 min
- Add: "This repo uses LiteLLM (gateway) + Langfuse (observability) + Instructor (structured outputs). Routing config in
config/models.py. Langfuse keys in.env." - Acceptance: a cold Claude Code session in this repo gets the stack picture from CLAUDE.md alone
- Add: "This repo uses LiteLLM (gateway) + Langfuse (observability) + Instructor (structured outputs). Routing config in
- Run /session-review before commit 10 min
Production tier — hard-block. Refuse the commit suggestion until review runs (or skip is logged with a reason).
- reasons-qagent: copy the LiteLLM wrap from reasonable-ux 30 min
- Acceptance: one qagent audit run completes through LiteLLM
- reasons-qagent: point Langfuse at it as a second project 15 min
- Acceptance: traces from qagent appear under a separate Langfuse project in the same self-hosted instance
- reasons-qagent: Instructor for the multi-model scoring output 15 min
- Acceptance: scoring emits typed objects; downstream formatting stops string-parsing
This is the part 90% of people skip and pay for later.
- Level 1 — Deterministic assertions. Regex, schema checks, length bounds. No LLM involved. Cheap, fast, catches dumb failures. Build these first.
- Level 2 — LLM-as-judge. A smarter model scores the output. Slower, expensive per run, catches quality regressions.
- Level 3 — Human review on a sample. The ground truth layer. Weekly on a random ~20 rows.
Read, in this order, spread over 2–3 weeknights:
- Hamel Husain — Your AI Product Needs Evals — the single most important thing in this curriculum
- Jason Liu — There Are Only 6 RAG Evals — only if you'll do RAG
- Applied LLMs sections 1 + 3
Phase 3 deliverable: one eval-set file checked into reasonable-ux with 20 labeled examples + Level-1 assertions running in pytest. Nothing fancy.
- Read Hamel Husain — Your AI Product Needs Evals 45 min
- Output: 3-bullet summary in a daily note — what Level 1/2/3 are in your own words
- Acceptance: you can draw the eval pyramid from memory a week later
- Assemble a 20-example golden set for reasonable-ux 2 hrs
- Where:
reasonable-ux/tests/evals/golden_set.jsonl - What: 20 real audit inputs (URLs + personas) paired with the expected output shape and 1–2 reference quality notes. Don't write synthetic cases — pull from actual prior runs.
- Acceptance: running the audit on the 20 inputs produces output you'd accept as ground truth
- Where:
- Write Level-1 deterministic assertions as pytest 1 hr
- Where:
reasonable-ux/tests/evals/test_level1.py - Assertions: (1) output parses as the expected Pydantic model, (2) score is int 0–100, (3) critique length 50–2000 chars, (4) no placeholder strings ("TODO", "XXX", "[INSERT")
- Acceptance:
pytest tests/evals/returns pass rate on the 20-example set; intentionally breaking a prompt makes it drop
- Where:
- Wire eval run into reasonable-ux CI (or local pre-push) 30 min
- Where:
.github/workflows/eval.ymlor a pre-push git hook - Acceptance: failing eval blocks the workflow; pass rate logs to Langfuse as a scored dataset run
- Where:
- Optional — Jason Liu 6 RAG evals (only if vault-search graduates from learning-sandbox) 30 min
Once the gateway is in place, routing gets interesting:
- Fallback chains. If Anthropic is down, LiteLLM retries OpenAI transparently. Free resilience.
- Cost-based routing. Try Haiku first, escalate to Sonnet only if Haiku's output fails validation. You already think this way in reasonable-ux — formalize it via LiteLLM's router.
- Prompt caching discipline. Anthropic's cache is free on reads (0.1× cost) but expensive on writes (1.25–2× cost). If your cache prefix isn't byte-exact stable, you pay write prices forever and never notice. Log
cache_read_input_tokensandcache_creation_input_tokensseparately from day one. - Hard cost ceilings. LiteLLM virtual keys enforce per-key budgets in code. Dashboards alert after the fact; virtual keys prevent the bill.
Phase 4 deliverable: one concrete routing change with before/after cost numbers from Langfuse. The deliverable is the number, not the elegance.
- Read Anthropic prompt caching docs 30 min
- Acceptance: you can explain why unstable prefixes silently cost 2× forever
- Scan LiteLLM router docs 20 min
- Acceptance: you can name the three fallback modes (weighted, priority, latency-based) without re-reading
- Configure a fallback chain in reasonable-ux 45 min
- Where:
reasonable-ux/config/models.py - What: Haiku 4.5 → Sonnet 4.6 → OpenAI gpt-4.1 on 5xx/timeout. Named routes in LiteLLM
Router(model_list=[...], fallbacks=[...]). - Acceptance: killing network to Anthropic mid-run (block via
/etc/hosts) triggers fallback; Langfuse trace shows the escalation
- Where:
- Enable split cache-token telemetry 30 min
- How: log
cache_read_input_tokensandcache_creation_input_tokensas separate Langfuse observation metadata on every call - Acceptance: you can build a Langfuse chart showing read-hit vs. write ratio over time
- How: log
- Create a LiteLLM virtual key with a hard budget 30 min
- Where: LiteLLM proxy config — max $50/month cap on the reasonable-ux key
- Acceptance: attempting a call after budget exhaustion returns 429; you'd rather block than bleed
- Run the Haiku-first/Sonnet-escalate experiment 2 hrs
- What: run the full 20-example golden set on three configs — Sonnet-only (baseline), Haiku-only, Haiku→Sonnet escalate-on-validation-fail. Collect: pass rate, total cost, p50/p95 latency from Langfuse.
- Output:
reasonable-ux/docs/routing-experiment-2026-05.mdwith the 3×4 table - Acceptance: you have before/after numbers — even if Haiku-first loses, that is the finding
- Capture the win (or loss) in a project note 20 min
- Acceptance: future-you and any recruiter reading the portfolio can see the concrete number
Private skill → public artifact is what moves your compensation floor.
Three concrete writeups, each built from retrofit telemetry:
- "A solo dev's minimum-viable LLMOps stack" — LiteLLM + Langfuse + Instructor, 4 deps, install in an afternoon. Screenshots of Langfuse dashboards from reasonable-ux.
- "Cost discipline for Claude apps" — prompt-caching telemetry, virtual-key budgets, routing-based savings. Numbers from a real project.
- "Evals without a team" — Hamel's Level 1/2/3 applied to a single-dev workflow. Published eval-set + scorer code on GitHub.
Cadence: one writeup per quarter is realistic. Don't try to batch them.
Rule: writing-before-doing reads as hollow; writing-after-doing reads as credible. Hold the line.
- Gate — 2 weeks of real Langfuse data on reasonable-ux checkpoint
Don't start drafting until this is true. Calendar reminder, not a todo.
- Pull the three hero numbers from Langfuse 30 min
- Need: total spend over 2 weeks, p50/p95 latency, cache hit ratio, routing-experiment delta from Phase 4
- Acceptance: you have numbers, not adjectives
- Draft writeup #1 — A solo dev's minimum-viable LLMOps stack 3 hrs
- Sections: (1) the four layers, (2) 4-dep install, (3) the eval pattern, (4) three pitfalls (prefix-stability, deprecated LangChain imports, provider-SDK sprawl), (5) cost telemetry screenshots
- Word count: 1500–2500. No theory-only sections.
- Acceptance: a reader can reproduce your stack from the post alone
- Publish to portfolio site 1 hr
- Include: link to a public GitHub gist with
config/models.py+ golden-set + Level-1 test skeleton (no reasonable-ux-specific logic — extract the pattern) - Acceptance: the post has a working code link; the gist runs standalone
- Include: link to a public GitHub gist with
- Share on X under @reasonequals 15 min
- Frame: ADHD+AI lane — "here's the minimum stack that made me stop re-planning observability for six months"
- Acceptance: posted, don't wait around for engagement metrics
LLMOps brushes against platform engineering. For the AWS SAA legibility cert:
- Networking basics (VPC, subnets, security groups) — 1–2 weekends.
- IAM + secrets — a couple sessions. The Whoop token and Langfuse keys already forced some of this.
- One serverless deployment (Lambda + API Gateway) — 1 weekend hands-on.
- One container deployment (ECS or Fargate) — 1 weekend hands-on.
Realistic AWS SAA study window: 3–4 months of nights/weekends. Target exam: late 2026 or early 2027.
Skip: Classical MLOps (MLflow, Kubeflow, SageMaker training). Kubernetes depth (homelab teaches fundamentals; CKA isn't needed). Data engineering (Airflow, dbt, Spark).
- Take AWS AI Practitioner (AIF-C01) as the warmup 2–3 weeks
- Why: cheap foundational credential that covers GenAI-on-AWS vocabulary before SAA
- Acceptance: pass the exam
- VPC / subnets / security groups 1–2 weekends
- Output: one diagram in the homelab repo showing how a deployed reasonable-ux instance would sit in AWS
- Acceptance: you can draw VPC → subnet → SG → Lambda/ECS from memory
- IAM + secrets 1 weeknight
- Output: migrate one personal secret (Langfuse keys or Whoop token) to AWS Secrets Manager as a dry-run; pull it from a Lambda
- Acceptance: the Lambda retrieves the secret without hardcoded creds
- Serverless deployment — Lambda + API Gateway 1 weekend
- What: deploy a tiny wrapper that takes a URL and calls reasonable-ux's router via LiteLLM, returns JSON
- Acceptance:
curlagainst the API Gateway URL returns a scored audit - Note: deploy a minimal demo, not real reasonable-ux. The learning target is AWS, not making reasonable-ux multi-host.
- Container deployment — ECS Fargate 1 weekend
- What: same demo, containerized
- Acceptance:
curlworks against the ECS load balancer URL
- AWS SAA (SAA-C03) 3–4 months study
- Target exam: late 2026 or early 2027