
Open-sourced a layer that cuts ~87% of LLM API input tokens (GPT-5.5 & Opus 4.8, real billed tokens) - proxy + MCP plugin for Claude Code/Codex
if you build on the LLM APIs, a big chunk of every request is tokens the model doesn't need - resent system prompts + history, whole files dumped into context, easy calls routed to the frontier model. i built a vendor-neutral layer that strips that, and measured it on the providers' own billed tokens (heavy tasks):
gpt-5.5: 16,875 -> 2,232 input tokens (86.8% fewer), quality 3/3 -> 3/3
opus 4.8: 26,573 -> 3,343 (87.4% fewer), 3/3 -> 3/3
two ways to drop it in:
- OpenAI/Anthropic-compatible proxy - point base_url at it, keep your key. every request gets the levers applied + an X-TRL-Tokens-Saved header.
- MCP plugin for Claude Code / Codex - the agent gets retrieve_code(query) / explain_symbol(name) and pulls only the relevant AST slices instead of dumping whole files. since Claude Code and Codex bill by tokens, that stretches your weekly cap.
four levers under the hood: prefix caching, tail compression with a deterministic guard that re-injects any number the compressor drops, AST/text retrieval, and cascading verifiable steps to a local model.
honest negatives, in the repo: static embeddings didn't beat plain keyword retrieval in my eval; a real 3B compressor dropped ~1/3 of load-bearing numbers before i added the guard; suites are small + favorable. Apache-2.0, free, reproducible benchmark included (validate/heavy_bench.py). repo: https://github.com/AryanGonsalves/trl-token-reduction - would love people to break it.