← Back to Blog

Stop Paying Frontier Prices for Tasks a Local Model Handles Fine

The invoice that started it

In February, our monthly LLM inference bill was dominated by a single pattern: every service in the stack sent every request to the same model. JSON extraction, entity tagging, document summarisation, multi-step reasoning: all hitting a 27B parameter model (or worse, a cloud frontier API at $10/million tokens).

The 27B produced correct output for all of these tasks. But most of them did not need it.

The bookkeeper principle

A CFO can fill in accounts receivable. But a bookkeeper is 50x cheaper and does the job just as well. You would not pay CFO rates for bookkeeping.

LLM workloads follow the same distribution. In our production traffic, roughly 80% of requests were structured tasks: extract these fields from this document, summarise this text in three bullet points, classify this support ticket. A 9B model handles all of these with equivalent output quality.

The remaining 20% genuinely needed the larger model: multi-step reasoning, code generation with complex constraints, synthesis across long context windows.

The cost arithmetic

ModelCost per 1M tokensTypical tasks
Local 9B (vLLM, quantised)$0.005Extraction, classification, summarisation
Local 27B (vLLM, quantised)$0.02Reasoning, code generation, creative writing
Gemini Flash (cloud)$0.60Overflow, burst capacity
Gemini Pro / GPT-4 (cloud)$10.00Reference validation only

If 80% of your traffic moves from the $0.60 tier to the $0.005 tier, that is a 94% cost reduction on those requests. Even moving from $0.02 to $0.005 on the easy tasks frees the 27B's GPU capacity for the requests that actually benefit from it.

How the routing works

Kronaxis Router is a single Go binary that sits between your applications and your model backends. Every incoming request passes through a lightweight classifier (rule-based, no LLM call, under 1ms overhead) that assigns a task category:

Each category maps to a model tier in config. The classifier is deliberately conservative: ambiguous cases route to the higher tier.

The quality validation loop

Trusting a cheap model blindly is a bad idea. Model performance varies across tasks, and what works today might degrade after a provider update or a data distribution shift.

Kronaxis Router samples a configurable percentage of routed requests (default 5%) and sends the same prompt to both the assigned model and a reference model. Results feed into a sliding window per task category. If the cheap model's quality drops below the configured threshold, that category auto-promotes to the next tier.

This closes the feedback loop. You get cost savings by default with an automatic safety net.

Response caching

Deterministic requests (same prompt, same parameters, temperature 0) get the same output. The router caches responses keyed on a hash of the request body. For extraction and classification tasks where the output should be identical for identical input, this eliminates redundant inference calls entirely.

In our workload, the cache hit rate on extraction tasks is around 30%, which is another meaningful cost reduction on top of the routing savings.

Budget enforcement

Set a daily dollar limit per service. When the limit is hit, the router does not fail. It downgrades to a cheaper model. Your pipeline keeps running, just on a smaller model, instead of returning 429 errors at 3pm.

Batch API routing: another 50% off

Seven providers (OpenAI, Anthropic, Gemini, Mistral, Groq, Together, Fireworks) offer 50% discounts on batch API requests. The catch is they require a different submission flow.

Kronaxis Router handles this transparently. Tag a request as bulk priority and it auto-submits to the provider's batch endpoint. For overnight enrichment jobs, training data generation, or any latency-insensitive workload, this halves your cloud costs on top of the routing savings.

How this compares to alternatives

We built this because nothing else solved the actual problem. Here is an honest comparison.

vs LiteLLM

LiteLLM is the most established open-source LLM gateway. It supports 100+ providers, has virtual API keys, team-based spend tracking, and a web UI. If you need to unify a dozen different provider APIs behind one endpoint, LiteLLM does that well.

Where it does not compete: it is a Python process (300MB+ memory, ~2K req/s vs our 22K), it does not do intelligent cost-based routing (you pick the model per request), no quality validation loop, no batch API aggregation, no response caching, no LoRA-aware routing. LiteLLM is a universal gateway. Kronaxis Router is a cost optimiser.

vs OpenRouter

Zero setup, huge model catalogue. But they add a margin on every request (typically 5-20%), your data goes through their servers, and you cannot route to local models. If cost reduction is the goal, adding a middleman margin goes in the wrong direction.

vs Portkey

Strong on observability: request logging, prompt management, guardrails, A/B testing. Paid plans start at $99/month. Not focused on cost-based routing, does not support local models or batch APIs.

vs Martian / Not Diamond

ML-trained classifiers to pick the best model per request. More sophisticated than rule-based classification. But both are SaaS (closed source, usage-based pricing), add network latency for classification, and do not support local models or batch routing.

Summary

FeatureKronaxis RouterLiteLLMOpenRouterPortkey
Self-hostedYesYesNoNo
Open sourceApache 2.0MITNoNo
Cost-based routingAutomaticManualSomeManual
Quality validationClosed loopNoNoNo
Batch API (50% off)7 providersNoNoNo
Response cachingYesNoNoNo
Budget enforcementDowngradeAlertsNoAlerts
LoRA routingYesNoNoNo
Local modelsOllama + vLLMvLLMNoNo
Memory2MB300MB+N/AN/A
Throughput22K req/s~2K req/sN/AN/A
Provider count4 types100+200+15+
PriceFreeFree / $150+Margin$99+/mo

What it is not

It does not normalise 100 provider APIs. LiteLLM does that. It speaks OpenAI-compatible API (which covers vLLM, Ollama, and every major cloud provider).

It does not do prompt engineering, output parsing, or chain orchestration. It solves one problem: which model should handle this request, and what happens when that model is unavailable or underperforming.

Getting started

# Install (Linux/macOS)
curl -fsSL https://raw.githubusercontent.com/Kronaxis/kronaxis-router/main/install.sh | bash

# Auto-detect your backends and generate config
kronaxis-router init

# Start
kronaxis-router

The init command probes for Ollama and vLLM instances and checks your environment for cloud API keys. It generates a config with backends, routing rules, budgets, and rate limits. Also available via Homebrew, Go install, or Docker.

For Claude Code and Cursor users: kronaxis-router init --claude or kronaxis-router init --cursor configures the MCP server automatically.

Single binary. 81 tests. 22K req/s. 2MB memory. Apache 2.0.

GitHub: github.com/Kronaxis/kronaxis-router

Try Kronaxis Router

Single binary. 81 tests. Apache 2.0. Cut your LLM costs by 90%.

View on GitHub