r/LLM_Gateways

Building a multi-model API proxy for AI-heavy teams — what would you actually want from it?

I’m working on a multi-model API proxy for teams that use LLMs heavily, and I’d love to get feedback before we onboard our first batch of users later this month.

The idea is to provide one API layer for accessing major current models, with routing, fallback, usage controls, observability, and developer-friendly tooling on top.

I’m especially interested in hearing from:

  • small and mid-sized companies using AI APIs in production
  • teams building internal AI tools
  • people heavily using agentic coding tools
  • dev teams switching between multiple model providers
  • anyone dealing with cost, latency, reliability, rate limit, or quota issues

A few things I’m trying to understand:

  1. Which models would you need supported on day one?
  2. Do you care more about cost, latency, quality, context window, or automatic fallback?
  3. What is the most painful part of your current AI API setup?
  4. Would task-based routing be useful? For example, cheaper models for simple tasks, stronger models for coding/reasoning, fallback models when a provider is down.
  5. What kind of observability would you want? Logs, traces, cost per user, cost per project, prompt/version tracking, evals?
  6. For agentic coding workflows, what matters most: tool calling reliability, context window, latency, model choice, rate limits, or something else?
  7. If you’re an SMB, what would make you trust a proxy layer enough to put it between your app and the model providers?

Not trying to make this a sales pitch — we’re still shaping the product and want to build around real workflows instead of guessing.

If you use AI APIs heavily, especially for coding agents or production workflows, I’d really appreciate hearing what your ideal setup would look like.

reddit.com
u/Accurate-Pudding-999 — 6 days ago

Fully open-source LiteLLM alternative with SSO for education?

Hi,

We’re designing a self-hosted LLM gateway setup in an educational/research context, with local vLLM backends and tools like OpenWebUI for human-facing chat.

Right now, the easiest option seems to be LiteLLM, but we’re checking whether there is a fully free and open-source alternative with similar features.

What we need is not only model routing, but the gateway/governance layer:

- OpenAI-compatible API gateway
- Authentication and API key management
- Single Sign-On support, ideally OIDC/SAML
- User/team quotas or budgets
- Cost tracking or token accounting
- Metrics and logs
- Load balancing between multiple vLLM instances serving the same model
- Support for local/private model endpoints
- Compatibility with OpenWebUI
- Reasonable setup and maintenance for a small education/research team

The architecture would probably expose several endpoints through the gateway: local vLLM models, possibly other compatible providers, and maybe a semantic-routing endpoint for automatic model selection when users do not choose a specific model.

Has anyone here deployed something like this in education or research?

We’re also considering whether Portkey or Bifrost could fit this role, so experiences with those tools would be very welcome.

Is there a fully self-hosted, open-source stack that gets close to LiteLLM’s feature set, or is LiteLLM currently the most practical option?

Thanks

reddit.com
u/ComplexMarionberry27 — 9 days ago
▲ 5 r/LLM_Gateways+2 crossposts

I tested privacy-aware routing with 4 AI agents: 2 stayed local, 2 went to Claude: Trooper

4 agents, mixed routing: some cloud, some local

Been experimenting with per-request privacy routing in Trooper. Wanted to see if it actually works when you need some requests to stay local but don't want to give up Claude for everything else.

Ran 4 agents. Two asked about public stuff (OAuth vulnerabilities, Redis vs Memcached). Two handled internal data (API keys, customer names).

Agent 1 - Claude:

"Top 3 OAuth2 vulnerabilities?"
Public knowledge, let Claude handle it.

Agent 2 - Qwen (local):

"Format this: api_key=sk-prod-xxxx, vault_url=https://vault.acme.io"
Has credentials. Stays on my machine.

Agent 3 - Claude:

"Redis or Memcached for sessions?"
General question, use cloud.

Agent 4 - Qwen (local):

"Summarize: 47 tickets. 3 had PII (Alice Johnson, Bob Chen, Maria Garcia)"
Customer names. Can't send that to Anthropic.

Everything worked. Cloud agents took 2-4 seconds. Local ones were faster (1-2s). The credentials and customer names never hit the network.

Why bother

I don't want my entire coding session local. Qwen is good but Claude is better for complex stuff.

I just want specific messages to stay on my hardware when they contain:

  • Internal service URLs
  • API keys or tokens
  • Customer data
  • Anything I wouldn't put in a blog post

The per-request control is the point. Not "all local" or "all cloud" — mix them based on what you're asking.

How it's different from my last post

Last time I showed what happens when Claude quota runs out. Trooper falls back to Ollama automatically.

This is proactive. You tell it "keep this one local" before sending. Different problem.

Both use the same context system so the local model knows what happened in the cloud part of the conversation.

What doesn't work great

Qwen isn't Claude. It's fast and fine for formatting/parsing/summarization. But if you need deep reasoning, route to Claude.

You need Ollama running. I use qwen2.5:3b (2GB, fast enough) or 7b if I want better quality.

Repo: https://github.com/shouvik12/trooper

Still iterating on this. Let me know if you hit edge cases or have ideas for better routing heuristics.

u/Substantial_Load_690 — 9 days ago