u/Substantial_Load_690

I tested privacy-aware routing with 4 AI agents: 2 stayed local, 2 went to Claude: Trooper
▲ 5 r/LLM_Gateways+2 crossposts

I tested privacy-aware routing with 4 AI agents: 2 stayed local, 2 went to Claude: Trooper

4 agents, mixed routing: some cloud, some local

Been experimenting with per-request privacy routing in Trooper. Wanted to see if it actually works when you need some requests to stay local but don't want to give up Claude for everything else.

Ran 4 agents. Two asked about public stuff (OAuth vulnerabilities, Redis vs Memcached). Two handled internal data (API keys, customer names).

Agent 1 - Claude:

"Top 3 OAuth2 vulnerabilities?"
Public knowledge, let Claude handle it.

Agent 2 - Qwen (local):

"Format this: api_key=sk-prod-xxxx, vault_url=https://vault.acme.io"
Has credentials. Stays on my machine.

Agent 3 - Claude:

"Redis or Memcached for sessions?"
General question, use cloud.

Agent 4 - Qwen (local):

"Summarize: 47 tickets. 3 had PII (Alice Johnson, Bob Chen, Maria Garcia)"
Customer names. Can't send that to Anthropic.

Everything worked. Cloud agents took 2-4 seconds. Local ones were faster (1-2s). The credentials and customer names never hit the network.

Why bother

I don't want my entire coding session local. Qwen is good but Claude is better for complex stuff.

I just want specific messages to stay on my hardware when they contain:

  • Internal service URLs
  • API keys or tokens
  • Customer data
  • Anything I wouldn't put in a blog post

The per-request control is the point. Not "all local" or "all cloud" — mix them based on what you're asking.

How it's different from my last post

Last time I showed what happens when Claude quota runs out. Trooper falls back to Ollama automatically.

This is proactive. You tell it "keep this one local" before sending. Different problem.

Both use the same context system so the local model knows what happened in the cloud part of the conversation.

What doesn't work great

Qwen isn't Claude. It's fast and fine for formatting/parsing/summarization. But if you need deep reasoning, route to Claude.

You need Ollama running. I use qwen2.5:3b (2GB, fast enough) or 7b if I want better quality.

Repo: https://github.com/shouvik12/trooper

Still iterating on this. Let me know if you hit edge cases or have ideas for better routing heuristics.

u/Substantial_Load_690 — 10 days ago
▲ 6 r/Anthropic+1 crossposts

Trooper went from API proxy to handling real Claude conversations- mid-chat, one flag, sensitive messages never leave your machine

Trooper started as a proxy for Claude API calls. When quota hits, it falls back to local Ollama with context preserved.

Today it handles real human conversations — and I shipped a feature worth talking about.

The scenario:

You're mid-conversation with Claude through Trooper. Architecture decisions, authentication design, system planning. Everything going to the cloud.

Then a sensitive detail comes up. An internal service URL. A proprietary system. Something you'd rather not send outside your machine.

Previously your only options were stop the conversation or send it anyway.

Now there's a third:

"x_force_local": true

One field in the request body. That specific message routes to local Ollama with full session context intact. Next message goes back to Claude automatically. No restart. No lost context. No interruption.

Left terminal — a real conversation. Four turns. Claude handles architecture and auth questions. Developer types local. Sensitive vault URL stays on machine — Ollama answers it locally. Claude summarises after.

Right terminal — Trooper routing decisions in real time:

🔒 Developer requested local-only (x_force_local) — skipping cloud
🔒 Local: ollama (force_local) | privacy mode | session saved: 18 tokens

What actually happened with the vault detail:

The raw vault URL never left the machine - Ollama handled it locally. Trooper's SITREP then created a compressed abstraction of the session state. Claude received that abstraction, not the raw message.

What's different here vs LiteLLM or Bifrost:

The individual pieces exist elsewhere — local routing, fallback, context compression all appear in fragments across different tools. What's different is the composition:

A session-stateful LLM router that enables per-turn execution locality with a shared compressed memory layer enabling cross-provider continuity.

The differentiator: execution locality is a runtime decision inside a persistent conversation state machine.

They route between clouds. Trooper routes to your machine. Per message. Mid-conversation. Without breaking anything .

Three reasons to use x_force_local:

  • Privacy — sensitive payload never leaves the machine
  • Cost control — force expensive turns to local
  • Offline mode — keep working when cloud is unavailable

How context is preserved:

Trooper uses a 3-layer memory system across provider switches:

  • Anchor — first 2 turns, always preserved verbatim
  • SITREP — compressed abstraction of middle turns
  • Tail — recent turns within token budget

The local model always knows where the conversation was — without receiving raw history from the cloud session.

Repo: https://github.com/shouvik12/trooper

u/Substantial_Load_690 — 12 days ago

10 agents hit Claude at 16:08:31. All 10 recovered on Ollama by 16:08:32. One second. (Trooper)

Following up on my earlier posts about Trooper.

Wanted to see how it handles real concurrent load. Spun up 10 named agents at the same time: research, summarizer, code-review, data-analyst, writer, qa, planner, memory, classifier, monitor. All of them hitting Trooper simultaneously.

https://preview.redd.it/3vxvp57eu30h1.png?width=1080&format=png&auto=webp&s=928ae0ae32435daaa5b73299bfd0d41da645bad6

Every single agent:

  • Hit Claude at 16:08:31 simultaneously
  • Got credit_balance error
  • Fell back to Ollama within 1 second
  • Preserved context
  • Kept going

No dropped sessions. No resets. No manual intervention.

failure → fallback → continue

Across all 10. At the same time.

Repo: https://github.com/shouvik12/trooper

reddit.com
u/Substantial_Load_690 — 13 days ago
▲ 1 r/ollama

10 agents hit Claude at 16:08:31. All 10 recovered on Ollama by 16:08:32. One second. (Trooper)

Following up on my earlier posts about Trooper.

Wanted to see how it handles real concurrent load. Spun up 10 named agents at the same time: research, summarizer, code-review, data-analyst, writer, qa, planner, memory, classifier, monitor. All of them hitting Trooper simultaneously.

https://preview.redd.it/anopcq4ic30h1.png?width=1590&format=png&auto=webp&s=f811b941e6d8618ea7fa48f45b836b39857dbcf2

Every single agent:

  • Hit Claude at 16:08:31 simultaneously
  • Got credit_balance error
  • Fell back to Ollama within 1 second
  • Preserved context
  • Kept going

No dropped sessions. No resets. No manual intervention.

failure → fallback → continue

Across all 10. At the same time.

Repo: https://github.com/shouvik12/trooper

reddit.com
u/Substantial_Load_690 — 13 days ago
▲ 5 r/ollama

🪖 Trooper load testing (Claude + Ollama): TEM behavior is consistent

Under repeated load tests, Trooper shows a stable execution pattern across provider failure + context pressure.

I’m calling this:

⚙️ TEM — Trooper Execution Model

A stable execution loop across LLMs, beyond just routing.

📌 Observed behavior (repeated across sessions)

Fallback:

Claude 400 → Ollama fallback

Claude fails (rate/credit)
Ollama continues execution in same session
no reset, no context loss

Context Compaction:

context compaction trigger

when context exceeds budget (~7k > 6k)
middle history is compressed, not dropped

SITREP :

sitrep output

session is augmented with structured summary.
intent
open loops
actions
resolution state

🧠 What TEM is (as observed)

TEM is the execution layer behavior where:

  • provider failure does not interrupt execution
  • fallback is transparent to the session
  • memory is converted into structured state (SITREP)
  • long context is compressed, not truncated

 

🔁 Execution pattern under load

failure → fallback → continue → compress → continue

This is stable across runs, not conditional behavior.

💡 Definition

TEM is the behavior of a multi-provider LLM system where execution continuity is preserved across failures via state compression under context limits instead of full context retention

 📎 Why this matters

This is not routing logic in practice — it behaves like a persistent execution loop over unreliable models with structured memory as the invariant

Repo: https://github.com/shouvik12/trooper

reddit.com
u/Substantial_Load_690 — 14 days ago
▲ 25 r/ProxyUseCases+5 crossposts

Most people use Ollama as a primary local model.
I ended up using it differently — as a continuation layer when the cloud fails.

Here's how a real session played out:

Turn 1 - Claude failed (credit_balance)
Trooper detected the error, fell back to Ollama, and carried full context:

X-Trooper-Decision: ollama (fallback: credit_balance)
X-Trooper-Summary: claude → ollama (credit_balance) | context ✓
X-Trooper-Session-Saved: 12 tokens

Turns 2–6 — simple queries (local only)
Rule-based classifier detected simple turns. Ollama handled all of them directly.
Cloud was never contacted again.

X-Trooper-Decision: ollama (simple turn) | cloud skipped
X-Trooper-Session-Saved: 76 tokens

Ollama handled 5 out of 6 turns in this session.

The key problem with fallback

Local fallback usually fails because the model starts cold — no context.

What fixes it

Before sending to Ollama, Trooper compacts the session into a structured SITREP:

{
  "intent": "building a go proxy",
  "stage": "in_progress",
  "open_loops": ["streaming pending"],
  "recent_actions": ["deploy monday"],
  "confidence": 1.00
}
  • extracted rule-based
  • no cloud LLM call
  • no added latency

So Ollama doesn’t restart the conversation — it continues it.

What this turns Ollama into

  • Reliability layer → absorbs cloud failures
  • Execution layer → handles simple prompts locally
  • Cost layer → avoids unnecessary API calls

Not just a local alternative — a fallback infrastructure layer.

There’s been some early organic pull on this:

379 clones, 166 unique cloners, 1,319 views, 196 visitors in ~14 days
No launch post — just devs finding it and trying it.

What Trooper is

A drop-in proxy. Zero dependencies. Pure Go.

Your app → Trooper → Claude
                   → fallback → Ollama
                   → continues seamlessly

Curious if others here are using Ollama this way — as fallback infra rather than primary?

https://github.com/shouvik12/trooper

u/Substantial_Load_690 — 16 days ago
▲ 2 r/ollama+1 crossposts

Shipped v3.0 today based on feedback from the thread yesterday.

Three things added:

  • Circuit breaker — if a provider fails 3 times in 60s, Trooper skips it automatically. No more wasted round trips hitting a known-dead provider on every request.
  • Observability log lines — every request now surfaces what happened clearly in the terminal
  • X-Trooper-Summary header — one line on every response showing exactly what Trooper did

Still zero dependencies, single Go binary.

github.com/shouvik12/trooper

u/Substantial_Load_690 — 19 days ago

If you use OpenAI regularly you've probably hit rate limits or run out of credits mid-conversation. Trooper is a Go proxy that handles this automatically — when OpenAI hits quota, it falls back to local Ollama and carries the conversation context with it.

v2.1 adds context compaction — when fallback happens, Trooper compacts the full session history into three layers before sending to Ollama:

* **Anchor** — first 2 turns, never dropped * **SITREP** — structured rule-based summary of the middle (intent, open issues, recent actions, resolved items) * **Tail** — last N turns verbatim

All within a 6144 token budget. Triggers automatically, no config needed.

Other v2.1 fixes:

* Live streaming fixed — tokens pipe through in real time * Health checks free — switched from inference requests to GET /models * Session memory leak fixed — 24hr TTL with background cleanup * Binds to [127.0.0.1](http://127.0.0.1) by default

Zero dependencies, single Go binary, no Python, no YAML.

The codebase is ~850 lines — if you want to contribute, PRs are welcome.

GitHub: https://github.com/shouvik12/trooper

reddit.com
u/Substantial_Load_690 — 20 days ago

If you use OpenAI regularly you've probably hit rate limits or run out of credits mid-conversation. Trooper is a Go proxy that handles this automatically — when OpenAI hits quota, it falls back to local Ollama and carries the conversation context with it.

v2.1 adds context compaction — when fallback happens, Trooper compacts the full session history into three layers before sending to Ollama:

* **Anchor** — first 2 turns, never dropped * **SITREP** — structured rule-based summary of the middle (intent, open issues, recent actions, resolved items) * **Tail** — last N turns verbatim

All within a 6144 token budget. Triggers automatically, no config needed.

Other v2.1 fixes:

* Live streaming fixed — tokens pipe through in real time * Health checks free — switched from inference requests to GET /models * Session memory leak fixed — 24hr TTL with background cleanup * Binds to [127.0.0.1](http://127.0.0.1) by default

Zero dependencies, single Go binary, no Python, no YAML.

The codebase is ~850 lines — if you want to contribute, PRs are welcome.

GitHub: https://github.com/shouvik12/trooper

reddit.com
u/Substantial_Load_690 — 20 days ago

I shared an earlier version of Trooper here a few weeks ago — v2.1 is out with some significant upgrades worth sharing.

For those who missed it: Trooper is a proxy that sits in front of your cloud LLM and automatically falls back to local Ollama when quota runs out. Single Go binary, zero external dependencies, binds to 127.0.0.1 by default.

Selfhosted-friendly by design:

  • Single Go binary, zero external dependencies — pure stdlib
  • Binds to 127.0.0.1 by default — not exposed on your network
  • Docker support — single docker compose up
  • No cloud dependency on fallback — your data stays on your machine

v2.1 adds context compaction — when the session exceeds the token budget, Trooper compacts it into three layers before sending to Ollama:

  • Anchor — first 2 turns, never dropped
  • SITREP — rule-based structured summary of the middle turns
  • Tail — last N turns verbatim

Triggers automatically. No config needed. You see it in the logs:

📦  Context compaction triggered — 1532 tokens exceeds 6144 budget
    Anchor turns   : 2
    Middle turns   : 4 → SITREP
    Recent turns   : 2
    SITREP         : intent="building go proxy" confidence=1.00

Other v2.1 fixes — streaming via io.Copy, free health checks via GET /models, session TTL with background cleanup.

The codebase is ~850 lines across two files — easy to self-host, easy to audit, easy to contribute to.

GitHub: https://github.com/shouvik12/trooper

reddit.com
u/Substantial_Load_690 — 20 days ago
▲ 13 r/GeminiAI+4 crossposts

Built Trooper to solve a problem I kept hitting — Claude's API quota running out mid-conversation and breaking my app flow.

I built it in an evening, using Claude as my coding assistant throughout. Claude helped with the Go proxy architecture, the fallback logic, and debugging the streaming response handling.

What Trooper does:

  • Sits between your app and Claude's API
  • When Claude returns a 429 or 402, silently reroutes to a local Ollama model
  • Preserves full conversation history across the switch
  • Zero code changes in your app — just point your base URL at localhost:3000
  • Streaming support
  • Configurable fallback trigger codes
  • 401 errors surface properly — bad keys never masked

Getting good traction from the builder community which has been encouraging.

It's free and open source.

GitHub: github.com/shouvik12/trooper

Happy to answer questions or take feedback from the Claude community.

u/Substantial_Load_690 — 23 days ago

I built Trooper after hitting Claude's free tier limit mid-conversation one too many times.

It's a lightweight Go proxy that sits between your app and any LLM API (Claude, GPT, Groq, Mistral). When the primary provider hits a quota or rate limit, it automatically reroutes to a local Ollama model — full conversation context intact. Zero code changes in your app, just point your base URL at localhost:3000.

What it does:

  • Works with any LLM provider, not just Claude
  • Preserves full conversation history across the switch
  • Streaming support
  • Configurable fallback trigger codes (429, 402, 529, 400 by default)
  • 401 errors surface properly — bad keys are never masked
  • Single Go binary, Docker support

Tested this morning — mid-conversation context preserved perfectly across the switch. Built in an evening.

GitHub: https://github.com/shouvik12/trooper

Happy to answer questions or take feedback

What's next: MCP integration with Claude AI

reddit.com
u/Substantial_Load_690 — 29 days ago