Why LLM Routing Matters
Every LLM-powered application has the same hidden problem: you're using one model for every task, even though tasks vary wildly in complexity.
A simple "classify this as spam or not spam" prompt doesn't need Claude Sonnet or GPT-4o. A /usr/bin/bash.04/MTok model handles it at 99% accuracy. But a complex multi-step reasoning task absolutely needs the flagship model, and getting cheap is just slow failure.
The result: you're either wasting money on over-provisioning, or getting silent failures from under-provisioning. Usually both at the same time, on different parts of your pipeline.
LLM routing solves this by classifying each prompt's complexity before routing it to the cheapest model that can actually handle it.
The Architecture
The multi-LLM cost optimizer I built uses three layers:
- Complexity Classifier (Pydantic AI + Claude Haiku)
- Model Router (LiteLLM + dynamic pricing lookup)
- Cost Tracker (Real-time spend logging)
The key insight: the classifier (using a cheap fast model) pays for itself when it prevents expensive routing on simple tasks.
The Core Pattern
Pydantic AI structured outputs are what make the classification reliable:
Without structured outputs, you are back to parsing free-text, and the classifier becomes another source of bugs. With Pydantic AI, you get a typed object back or an exception - no ambiguity.
The router then picks the model based on the classified category:
The Real Trade-offs
Classification latency adds overhead. The complexity classifier runs before every routed call - around 200-400ms depending on the model. For interactive apps, cache classifications by semantic similarity so repeated similar prompts skip the classifier.
Edge cases are real. Code-heavy prompts, domain-specific jargon, and ambiguous short prompts are where classifiers misfire. Build a feedback loop to log misclassifications so you can tune the routing thresholds over time.
Cheap models fail silently. A simple model routing a task it cannot handle won't throw an error - it will just give you a worse answer. Add output validation downstream, not just routing logic upstream.
Cold-start cost. LiteLLM manages provider connections. First call to a new provider has connection overhead. Warm up your most-used routes at startup.
When to Use This Pattern
This pattern is high-value when:
- You have mixed workloads: classification, summarization, generation, reasoning
- Your API costs are already meaningful and growing
- You have multiple providers available (Anthropic, OpenAI, Groq all supported)
- You want a single FastAPI endpoint that handles routing transparently
It adds complexity, so a single-model setup is fine when workloads are homogeneous or costs are still low.
The Template
I packaged this as a drop-in FastAPI + pydantic-ai template that you can have running in under 10 minutes. It includes the complexity classifier, LiteLLM router, cost tracker, and a /stats endpoint for real-time spend visibility.
Get it at: https://reactance0083.gumroad.com/l/ztmlv
If you have questions about the routing logic or want to adapt it to a specific use case, open an issue on the GitHub repo: https://github.com/Reactance0083/pydantic-ai-multi-llm-cost-optimizer













