The invoice that started it
In February, our monthly LLM inference bill was dominated by a single pattern: every service in the stack sent every request to the same model. JSON extraction, entity tagging, document summarisation, multi-step reasoning: all hitting a 27B parameter model (or worse, a cloud frontier API at $10/million tokens).
The 27B produced correct output for all of these tasks. But most of them did not need it.
The bookkeeper principle
A CFO can fill in accounts receivable. But a bookkeeper is 50x cheaper and does the job just as well. You would not pay CFO rates for bookkeeping.
LLM workloads follow the same distribution. In our production traffic, roughly 80% of requests were structured tasks: extract these fields from this document, summarise this text in three bullet points, classify this support ticket. A 9B model handles all of these with equivalent output quality.
The remaining 20% genuinely needed the larger model: multi-step reasoning, code generation with complex constraints, synthesis across long context windows.
The cost arithmetic
| Model | Cost per 1M tokens | Typical tasks |
|---|---|---|
| Local 9B (vLLM, quantised) | $0.005 | Extraction, classification, summarisation |
| Local 27B (vLLM, quantised) | $0.02 | Reasoning, code generation, creative writing |
| Gemini Flash (cloud) | $0.60 | Overflow, burst capacity |
| Gemini Pro / GPT-4 (cloud) | $10.00 | Reference validation only |
If 80% of your traffic moves from the $0.60 tier to the $0.005 tier, that is a 94% cost reduction on those requests. Even moving from $0.02 to $0.005 on the easy tasks frees the 27B's GPU capacity for the requests that actually benefit from it.
How the routing works
Kronaxis Router is a single Go binary that sits between your applications and your model backends. Every incoming request passes through a lightweight classifier (rule-based, no LLM call, under 1ms overhead) that assigns a task category:
- Structured extraction: JSON schema present, output format constrained, short expected output
- Summarisation: short expected output relative to input, condensation signals
- Classification/tagging: enumerated output set, single-label patterns
- Reasoning: multi-step instructions, long expected output, analysis keywords
- Code generation: code block formatting, language specifications
Each category maps to a model tier in config. The classifier is deliberately conservative: ambiguous cases route to the higher tier.
The quality validation loop
Trusting a cheap model blindly is a bad idea. Model performance varies across tasks, and what works today might degrade after a provider update or a data distribution shift.
Kronaxis Router samples a configurable percentage of routed requests (default 5%) and sends the same prompt to both the assigned model and a reference model. Results feed into a sliding window per task category. If the cheap model's quality drops below the configured threshold, that category auto-promotes to the next tier.
This closes the feedback loop. You get cost savings by default with an automatic safety net.
Response caching
Deterministic requests (same prompt, same parameters, temperature 0) get the same output. The router caches responses keyed on a hash of the request body. For extraction and classification tasks where the output should be identical for identical input, this eliminates redundant inference calls entirely.
In our workload, the cache hit rate on extraction tasks is around 30%, which is another meaningful cost reduction on top of the routing savings.
Budget enforcement
Set a daily dollar limit per service. When the limit is hit, the router does not fail. It downgrades to a cheaper model. Your pipeline keeps running, just on a smaller model, instead of returning 429 errors at 3pm.
Batch API routing: another 50% off
Seven providers (OpenAI, Anthropic, Gemini, Mistral, Groq, Together, Fireworks) offer 50% discounts on batch API requests. The catch is they require a different submission flow.
Kronaxis Router handles this transparently. Tag a request as bulk priority and it auto-submits to the provider's batch endpoint. For overnight enrichment jobs, training data generation, or any latency-insensitive workload, this halves your cloud costs on top of the routing savings.
How this compares to alternatives
We built this because nothing else solved the actual problem. Here is an honest comparison.
vs LiteLLM
LiteLLM is the most established open-source LLM gateway. It supports 100+ providers, has virtual API keys, team-based spend tracking, and a web UI. If you need to unify a dozen different provider APIs behind one endpoint, LiteLLM does that well.
Where it does not compete: it is a Python process (300MB+ memory, ~2K req/s vs our 22K), it does not do intelligent cost-based routing (you pick the model per request), no quality validation loop, no batch API aggregation, no response caching, no LoRA-aware routing. LiteLLM is a universal gateway. Kronaxis Router is a cost optimiser.
vs OpenRouter
Zero setup, huge model catalogue. But they add a margin on every request (typically 5-20%), your data goes through their servers, and you cannot route to local models. If cost reduction is the goal, adding a middleman margin goes in the wrong direction.
vs Portkey
Strong on observability: request logging, prompt management, guardrails, A/B testing. Paid plans start at $99/month. Not focused on cost-based routing, does not support local models or batch APIs.
vs Martian / Not Diamond
ML-trained classifiers to pick the best model per request. More sophisticated than rule-based classification. But both are SaaS (closed source, usage-based pricing), add network latency for classification, and do not support local models or batch routing.
Summary
| Feature | Kronaxis Router | LiteLLM | OpenRouter | Portkey |
|---|---|---|---|---|
| Self-hosted | Yes | Yes | No | No |
| Open source | Apache 2.0 | MIT | No | No |
| Cost-based routing | Automatic | Manual | Some | Manual |
| Quality validation | Closed loop | No | No | No |
| Batch API (50% off) | 7 providers | No | No | No |
| Response caching | Yes | No | No | No |
| Budget enforcement | Downgrade | Alerts | No | Alerts |
| LoRA routing | Yes | No | No | No |
| Local models | Ollama + vLLM | vLLM | No | No |
| Memory | 2MB | 300MB+ | N/A | N/A |
| Throughput | 22K req/s | ~2K req/s | N/A | N/A |
| Provider count | 4 types | 100+ | 200+ | 15+ |
| Price | Free | Free / $150+ | Margin | $99+/mo |
What it is not
It does not normalise 100 provider APIs. LiteLLM does that. It speaks OpenAI-compatible API (which covers vLLM, Ollama, and every major cloud provider).
It does not do prompt engineering, output parsing, or chain orchestration. It solves one problem: which model should handle this request, and what happens when that model is unavailable or underperforming.
Getting started
# Install (Linux/macOS)
curl -fsSL https://raw.githubusercontent.com/Kronaxis/kronaxis-router/main/install.sh | bash
# Auto-detect your backends and generate config
kronaxis-router init
# Start
kronaxis-router
The init command probes for Ollama and vLLM instances and checks your environment for cloud API keys. It generates a config with backends, routing rules, budgets, and rate limits. Also available via Homebrew, Go install, or Docker.
For Claude Code and Cursor users: kronaxis-router init --claude or kronaxis-router init --cursor configures the MCP server automatically.
Single binary. 81 tests. 22K req/s. 2MB memory. Apache 2.0.