Stop Paying Frontier Prices for Tasks a Local Model Handles Fine

Update May 2026: Router solves "which model gets this request". The other half of LLM cost is "what context goes into the request". We built Kronaxis Fabric to close that loop — a memory + coord + orchestrator layer on Postgres that drops the typical agent preload from ~120K tokens to ~500. Together: cheapest competent model + smallest right context = 100-600× per-session bill drop on the easy 80% of agent work. Read about Fabric →

The invoice that started it

In February, our monthly LLM inference bill was dominated by a single pattern: every service in the stack sent every request to the same model. JSON extraction, entity tagging, document summarisation, multi-step reasoning: all hitting a 27B parameter model (or worse, a cloud frontier API at $10/million tokens).

The 27B produced correct output for all of these tasks. But most of them did not need it.

The bookkeeper principle

A CFO can fill in accounts receivable. But a bookkeeper is 50x cheaper and does the job just as well. You would not pay CFO rates for bookkeeping.

LLM workloads follow the same distribution. In our production traffic, roughly 80% of requests were structured tasks: extract these fields from this document, summarise this text in three bullet points, classify this support ticket. A 9B model handles all of these with equivalent output quality.

The remaining 20% genuinely needed the larger model: multi-step reasoning, code generation with complex constraints, synthesis across long context windows.

The cost arithmetic

Model	Cost per 1M tokens	Typical tasks
Local 9B (vLLM, quantised)	$0.005	Extraction, classification, summarisation
Local 27B (vLLM, quantised)	$0.02	Reasoning, code generation, creative writing
Gemini Flash (cloud)	$0.60	Overflow, burst capacity
Gemini Pro / GPT-4 (cloud)	$10.00	Reference validation only

If 80% of your traffic moves from the $0.60 tier to the $0.005 tier, that is a 94% cost reduction on those requests. Even moving from $0.02 to $0.005 on the easy tasks frees the 27B's GPU capacity for the requests that actually benefit from it.

How the routing works

Kronaxis Router is a single Go binary that sits between your applications and your model backends. Every incoming request passes through a lightweight classifier (rule-based, no LLM call, under 1ms overhead) that assigns a task category:

Structured extraction: JSON schema present, output format constrained, short expected output
Summarisation: short expected output relative to input, condensation signals
Classification/tagging: enumerated output set, single-label patterns
Reasoning: multi-step instructions, long expected output, analysis keywords
Code generation: code block formatting, language specifications

Each category maps to a model tier in config. The classifier is deliberately conservative: ambiguous cases route to the higher tier.

The quality validation loop

Trusting a cheap model blindly is a bad idea. Model performance varies across tasks, and what works today might degrade after a provider update or a data distribution shift.

Kronaxis Router samples a configurable percentage of routed requests (default 5%) and sends the same prompt to both the assigned model and a reference model. Results feed into a sliding window per task category. If the cheap model's quality drops below the configured threshold, that category auto-promotes to the next tier.

This closes the feedback loop. You get cost savings by default with an automatic safety net.

Response caching

Deterministic requests (same prompt, same parameters, temperature 0) get the same output. The router caches responses keyed on a hash of the request body. For extraction and classification tasks where the output should be identical for identical input, this eliminates redundant inference calls entirely.

In our workload, the cache hit rate on extraction tasks is around 30%, which is another meaningful cost reduction on top of the routing savings.

Budget enforcement

Set a daily dollar limit per service. When the limit is hit, the router does not fail. It downgrades to a cheaper model. Your pipeline keeps running, just on a smaller model, instead of returning 429 errors at 3pm.

Batch API routing: another 50% off

Seven providers (OpenAI, Anthropic, Gemini, Mistral, Groq, Together, Fireworks) offer 50% discounts on batch API requests. The catch is they require a different submission flow.

Kronaxis Router handles this transparently. Tag a request as bulk priority and it auto-submits to the provider's batch endpoint. For overnight enrichment jobs, training data generation, or any latency-insensitive workload, this halves your cloud costs on top of the routing savings.

How this compares to alternatives

We built this because nothing else solved the actual problem. Here is an honest comparison.

vs LiteLLM

LiteLLM is the most established open-source LLM gateway. It supports 100+ providers, has virtual API keys, team-based spend tracking, and a web UI. If you need to unify a dozen different provider APIs behind one endpoint, LiteLLM does that well.

Where it does not compete: it is a Python process (300MB+ memory, ~2K req/s vs our 22K), it does not do intelligent cost-based routing (you pick the model per request), no quality validation loop, no batch API aggregation, no response caching, no LoRA-aware routing. LiteLLM is a universal gateway. Kronaxis Router is a cost optimiser.

vs OpenRouter

Zero setup, huge model catalogue. But they add a margin on every request (typically 5-20%), your data goes through their servers, and you cannot route to local models. If cost reduction is the goal, adding a middleman margin goes in the wrong direction.

vs Portkey

Strong on observability: request logging, prompt management, guardrails, A/B testing. Paid plans start at $99/month. Not focused on cost-based routing, does not support local models or batch APIs.

vs Martian / Not Diamond

ML-trained classifiers to pick the best model per request. More sophisticated than rule-based classification. But both are SaaS (closed source, usage-based pricing), add network latency for classification, and do not support local models or batch routing.

Summary

Feature	Kronaxis Router	LiteLLM	OpenRouter	Portkey
Self-hosted	Yes	Yes	No	No
Source available	BSL 1.1	MIT	No	No
Cost-based routing	Automatic	Manual	Some	Manual
Quality validation	Closed loop	No	No	No
Batch API (50% off)	7 providers	No	No	No
Response caching	Yes	No	No	No
Budget enforcement	Downgrade	Alerts	No	Alerts
LoRA routing	Yes	No	No	No
Local models	Ollama + vLLM	vLLM	No	No
Memory	2MB	300MB+	N/A	N/A
Throughput	22K req/s	~2K req/s	N/A	N/A
Provider count	4 types	100+	200+	15+
Price	Free	Free / $150+	Margin	$99+/mo

What it is not

It does not normalise 100 provider APIs. LiteLLM does that. It speaks OpenAI-compatible API (which covers vLLM, Ollama, and every major cloud provider).

It does not do prompt engineering, output parsing, or chain orchestration. It solves one problem: which model should handle this request, and what happens when that model is unavailable or underperforming.

Getting started

# Install (Linux/macOS)
curl -fsSL https://raw.githubusercontent.com/Kronaxis/kronaxis-router/main/install.sh | bash

# Auto-detect your backends and generate config
kronaxis-router init

# Start
kronaxis-router

The init command probes for Ollama and vLLM instances and checks your environment for cloud API keys. It generates a config with backends, routing rules, budgets, and rate limits. Also available via Homebrew, Go install, or Docker.

For Claude Code and Cursor users: kronaxis-router init --claude or kronaxis-router init --cursor configures the MCP server automatically.

Single binary. 81 tests. 22K req/s. 2MB memory. BSL 1.1 (converts to Apache 2.0 on 9 May 2031).

GitHub: github.com/Kronaxis/kronaxis-router

Try Kronaxis Router

Single binary. 81 tests. BSL 1.1. Cut your LLM costs by 90%.

View on GitHub