r/LLMDevs

Experiment: give 100 agents $100 each and let them trade with each other — anyone tried this?

Idea I want to run: spin up 100 agents, give each one $100, let them spend on tokens/tools and make their own purchase decisions. Then let agents propose trades to each other and accept/reject on their own — no human in the loop for the actual transaction. Curious if anyone's already tried something like this, or knows of an existing sandbox/testbed for agent-to-agent economies.

reddit.com

u/Dry_Steak30 — 4 hours ago

▲ 0 r/LLMDevs

The creative process of software building shouldn't be constrained by tokens

Had prompted an agent a few times and right in the midst of it generating code, the token limit was reached. This way of building software sucks. It's like running out of gas while driving or cooking before reaching the destination or before the meal is cooked. When anyone gets into the flow and creative process of building a software, they shouldn't have to worry about how much money is being sunk into building the software. It's an iterative process with many refinements needed, especially when the LLM can't understand context well enough and there aren't good enough tools/techniques to help communicate the context. We really need a more economical way of running coding agents on our own computers with cheaper RAM and GPU's. Perhaps even with solar power.

reddit.com

u/No_Elevator_9641 — 6 hours ago

▲ 28 r/LLMDevs+7 crossposts

We're building agents that can read millions of documents, but still forget a video they watched yesterday.

One thing has felt odd to me while working with AI agents.

We've gotten pretty good at giving them memory for text.

They can search documentation, index repositories, retrieve past conversations, and even build long-term memory over time.

Videos, though, are still treated as temporary input.

The agent watches a recording, answers a few questions, and when the session ends, that understanding is usually gone. Next session, the same video gets processed all over again.

That feels like an architectural gap rather than a model limitation.

A video isn't fundamentally different from any other source of information. Once you've extracted transcripts, OCR, visual observations, and timestamps, why throw that work away?

I ended up building an open-source project around this idea.

Instead of asking the agent to repeatedly "watch" the same video, it builds a persistent local index the first time. Future questions become retrieval instead of video analysis.

It changed how I think about video in agent workflows.

I'm curious whether others see this as a real missing piece, or if you've already solved it another way.

GitHub: https://github.com/oxbshw/watch-skill

u/Fearless-Role-2707 — 9 hours ago

▲ 25 r/LLMDevs+2 crossposts

I curated 48 LLM observability tools (Langfuse, Phoenix, Opik, LangSmith…) + a comparison matrix

Every few weeks I end up re-comparing LLM observability/eval tools for a project, so I put it all in one place: 48 verified tools across tracing, evals, prompt mgmt, gateways, OTel instrumentation, and guardrails, each with current stars + license; plus a self-host / license / tracing / evals / OTel comparison table for the top platforms.

It also includes original agent skills (instrument tracing, add evals, debug-from-traces, PII-safe tracing for regulated apps) and a minimal OpenTelemetry GenAI tracer.

Full disclosure, it's my org's repo (CC0, contributions welcome): https://github.com/ContextJet-ai/awesome-llm-observability — what tool am I missing?

u/nishchaymahor19 — 15 hours ago

▲ 1 r/LLMDevs

LiteLLM is great and all, but what about security?

Genuine question, not a dunk.

We're trying to roll out LiteLLM company-wide, and security is blocking it. Their worry is that it's a single component holding keys to every provider, sitting in the path of all our prompt data, with audit logging that isn't where they need it for compliance. I get it, that's a juicy target.

For those in regulated environments (health, finance, gov): did you actually get LiteLLM approved, and what did it take? Self-hosting only? Custom audit logging? A wrapper around the wrapper? Or did the review push you to something else entirely?

Trying to work out if this is a "configure it right" problem or a "wrong tool for this context" problem.

reddit.com

u/Own-Fennel-3875 — 11 hours ago

▲ 17 r/LLMDevs+7 crossposts

Chimera: an open-source, self-hostable agent that runs on local models (any OpenAI-compatible endpoint) and can fuse several at once

I've been building an open-source agent (Apache-2.0) and wanted to share it here because it's designed to be fully local and self-hostable: it talks to any OpenAI-compatible endpoint, so Ollama / llama.cpp / vLLM / LM Studio all work as the backend. No cloud lock-in, your keys and data stay yours.

The core idea is LLM-Fusion: for the hard steps it can run a panel of models on the same prompt, have a judge model cross-check them (consensus / contradictions / blind spots), and a synthesizer write the final answer. Locally this is fun because you can mix a few small local models and let them cross-check each other. A cost/latency-aware router keeps easy turns on a single model so you're not paying panel latency for everything.

Beyond that it's a full agent: plan -> act -> verify-or-revert (it runs your tests and treats the result as ground truth), layered memory (SQLite + FTS recall, cross-session profile, consolidation), a governance kernel, cron/proactive jobs, MCP client + OpenAPI-to-tool import, and an isolated subagent/crew layer (parallel git worktrees with per-worker verify gates). Runs on a laptop or a $5 VPS via Docker.

Honest status: it's alpha - 463 tests, mypy --strict clean, but no production mileage yet. Local reasoning quality obviously depends on the models you point it at, so I'd genuinely love to hear which local models people find good enough to actually drive an agent loop (reliable tool use + self-correction) - that's the make-or-break for going fully local.

Repo: https://github.com/brcampidelli/chimera-agent

u/Federal-Teaching2800 — 16 hours ago

▲ 10 r/LLMDevs+2 crossposts

[Benchmark] Qwen3.6-27B-FP8 on One RTX 6000 Ada: Fast TTFT, 668 tok/s Peak Throughput

Detailed setup below:

---

Model

Field	Value
Model	Qwen/Qwen-3.6 27B
Hugging Face path	Qwen/Qwen3.6-27B-FP8
Quantization / dtype	FP8
Request sizing configured	8192 max tokens

---

Serving Setup

Field	Value
Engine	vLLM 0.19
Endpoint	/v1/chat/completions
Streaming	ON
Tensor parallel size	1
Data parallel size	1
GPU memory utilization	0.90
max_model_len	8192
max_num_seqs	16
Tool call parser	qwen3_coder
Reasoning parser	qwen3

Engine flags:

--tensor-parallel-size 1
--data-parallel-size 1
--tool-call-parser qwen3_coder
--reasoning-parser qwen3
--gpu-memory-utilization 0.90
--max-model-len 8192
--max-num-seqs 16

---

Hardware

Component	Configuration
GPU	1× RTX 6000 Ada
VRAM	48GB
CPU	48 vCPU
System RAM	118GB

---

Workload

Field	Value
Dataset	ShareGPT sample
Unique prompts	128
Concurrency levels	8, 12, 16
Total requests	384
Conversation shape	Multi-turn chat
Languages	en, zh, ru, th, ko, fr, pl, ja
max_model_len	8192
max output tokens per completion	1024
Temperature	0.2

---

Results Summary

• TTFT p50 avg: 0.48s

• TTFT p95 avg: 0.94s

• TPOT p50 avg: 29.2 ms/token

• Total throughput peak: 668.5 tok/s

• KV cache max: 32.67%

---

TTFT :

Metric	Avg	Max	Unit	Interpretation
p50 TTFT	0.4802	3.75	seconds	Median requests started streaming quickly.
p95 TTFT	0.9444	4.875	seconds	Most requests started under ~1 second on average.
p99 TTFT	1.074	4.975	seconds	Tail TTFT stayed controlled on average, with occasional spikes.

---

Token Throughput

Token Type	Avg	Max	Unit	Interpretation
Prompt tokens	170.4	386.9	tokens/sec	Input processing throughput.
Output tokens	161.5	314.1	tokens/sec	Decode throughput.
Total tokens	331.9	668.5	tokens/sec	Combined prefill + decode throughput.

---

Curious how others would read these numbers? Is this a good single-GPU Qwen3.6-27B performance, or is there obvious headroom I’m missing here?

u/Temporary-Owl1725 — 13 hours ago

▲ 8 r/LLMDevs+3 crossposts

Agent with tiered working memory and cross-session learning — architecture, gaps, and what the research didn't cover

I've been building PRAANA — a coding agent with two systems I couldn't find combined in one self-contained binary: an Adaptive Context Engine (within-session) and Cognitive Memory (cross-session). Posting because the architectural decisions may be useful independent of the coding use case.

The core problem:

Every agent session is a context window management problem. Append-until-full plus reactive compaction is lossy — by the time you compact, you've already paid the drift cost, and you've lost track of which information was load-bearing.

PRAANA's ACE curates on every turn. A deterministic compiler assembles the prompt in 5 sections:

1. System Frame       — identity + tools
2. Memory Digest      — ranked cross-session learnings
3. Active State       — current work objects, full resolution
4. Peripheral Stubs   — everything inactive, one-line anchors
5. Recent Turns       — last N turns, budget-capped

State objects demote Active → Soft → Hard based on idle turns. Two-pass auto-hydration before each turn: substring keyword match, then BM25 for fuzzy overlap. Scores are density-weighted: decisions score 1.0, narrative scores 0.6, errors score 0.8. The compiler knows what kind of information is filling up, not just token count.

Cognitive Memory:

At /exit, a summariser extracts structured learnings from the transcript. Six kinds: fact, preference, decision, pattern, mistake, constraint — domain-agnostic; coding-specific knowledge lives in content, not schema. Stored in SQLite with sqlite-vec + Transformers.js (in-process, 384-dim). Confidence decays 5%/day. Entries confirmed across two or more sessions promote to Consolidated Memory (10x slower decay). Ranked recall: cosine × confidence × recency × pin_boost.

Where the research fell short:

I surveyed 20+ agent-memory repos. What I found:

Mem0, LangChain, and most memory backends are retrieval systems. They store and recall but have no outcome-based feedback loops. No architecture for "this memory was used and confirmed, increase confidence" vs "this memory was contradicted, reduce it." Letta has the most interesting consolidation work (sleep-time agents) but it's a platform, not extractable, and consolidation is partial.

Nobody combined proactive context curation with learning memory in one self-contained process. The compression tools — Headroom, ACON — are SDK/proxy layers that sit between you and the LLM. They don't own agent state.

The gap I missed: the research covered storage architecture, not learning signal. The reinforcement path in PRAANA — boost confidence when a session succeeds, decay when contradicted — is wired but the session-success signal hasn't shipped yet (#162). I designed a complete feedback loop and then discovered the trigger was the hard part.

The larger plan:

Four systems — Adaptive Context, Cognitive Memory, Background Consolidation, Intelligent Router — all domain-agnostic. No system encodes anything about code. The coding agent is the proving ground; coding outcomes are measurable. Once Phase 1 validates the architecture, Phase 2 extracts the runtime as @praana/runtime. I'm not extracting it until the coding agent proves it works.

Gaps:

Reinforcement path dormant (#162). No A/B eval harness — scorecard ships, headless task runner is next, no published benchmark claims. Background Consolidation Processor schema exists, not scalable yet. Runtime extraction is Phase 2, not started.

GitHub: amitkumardubey/praana — MIT, TypeScript, Bun.

If you're working on agent memory or context management architecturally, I'd welcome the comparison. What are you seeing in production that the research repos didn't surface?

u/Reasonable_Craft_425 — 17 hours ago

▲ 8 r/LLMDevs+4 crossposts

ContextForge: a local proxy that cut my Claude Code token usage by up to 72%

Hi everyone,

I’ve been working on a project to address a specific frustration I had with AI coding agents: token waste. I noticed that agents often burn a significant portion of the context window just re-reading the same files to find functions or re-discovering the repository structure on every turn.

I built ContextForge — a local proxy and CLI that acts as a "codebase-aware" runtime.

How it works

ContextForge sits between your agent (like Claude Code) and your LLM provider. Instead of letting the agent "guess" where files are, it provides local intelligence:

Local AST Graph: It indexes your repo using native C++ parsing into a local SQLite graph. When the agent needs to find a symbol, the proxy handles the lookup locally.
Context Optimization: It applies a compression pipeline that skeletonizes older file history (keeping only signatures) and vaults oversized responses (like lockfiles), replacing them with pointers.
Protocol Translation: It translates Anthropic requests into OpenAI format, which allows you to run Claude Code against Ollama/OpenAI-compatible models with full streaming support.

Case Study: "Soft-Delete" Feature

To test the architecture, I implemented a complex feature in an Express.js backend using an Ollama model. I compared a raw session (Passthrough) against one routed through ContextForge.

Metric	Passthrough Mode	ContextForge Mode	Difference
LLM round-trips	41	14	66% fewer
Input tokens	1,632,266	444,092	72.8% fewer
Output tokens	1,632,266	384,033	76.5% fewer
Session Compression	—	60,059 (13.5%)	—

Understanding the Metrics:

Workflow Savings (72.8%): These are tokens that were never generated because the tooling changed the workflow. The model used the local graph to find symbols instead of "guessing" via file searches, solving the task in 14 steps instead of 41.
Session Compression (13.5%): This is the actual text removed from the prompts within the session via skeletonization and deduplication.

Note: These results are from a specific, repository-heavy task. Savings vary significantly based on the work—long refactors benefit most, while short chats benefit much less.

Get Started

I've just released v1.0.3 and I'm looking for feedback from the community

Install: npm i -g @anuj612/contextforge
GitHub: https://github.com/anujkushwaha612/ContextForge

Note: No compiler needed — ships with prebuilt native binaries for Windows, macOS, and Linux via npm.

I’d love to hear your thoughts on the project and to tackle the new bugs and issues coming forward.

github.com

u/Independent_Pick3116 — 20 hours ago

▲ 9 r/LLMDevs+4 crossposts

TokenMizer - a local proxy for session checkpoint/resume and graph memory across Claude, GPT, and Ollama

I've been building TokenMizer, a local proxy that sits between your editor/CLI and whatever model you're using (Claude, GPT, Ollama) and handles two things I kept re-solving by hand: session checkpoint/resume, and a graph-based memory instead of a flat transcript.

The problem: once a long agent session hits the context limit, the usual fix is summarization, and summaries lose the reasoning behind a decision, not just the decision itself. I'd see a summary saying "switched to Argon2" with no trace of why bcrypt was rejected, so the agent would re-litigate the same tradeoff two sessions later. Flat transcripts have the opposite problem: everything is kept, but nothing is prioritized, so retrieval is just recency-biased keyword luck.

What TokenMizer does differently: instead of one growing text blob, decisions, constraints, and open questions are stored as nodes with edges (this decision depends on that constraint, this question was resolved by that decision). Checkpointing snapshots that graph plus a resumable session state, so you can kill a session and pick it back up without replaying the whole history through the model again.

Where it's rough: there's no eval harness yet comparing retrieval quality against a naive flat-transcript baseline, so right now my evidence is anecdotal (my own sessions), not benchmarked. I also learned the hard way that benchmarking your own memory system by asking it questions only it can answer is circular, so I'm holding off on publishing numbers until I have an honest comparison.

Repo: github.com/Shweta-Mishra-ai/tokenmizer (I'm the author). It's a Python project, MIT licensed. If you've hit the same summarization-loses-reasoning problem, I'd be interested in how you're handling it, and PRs/issues on the eval-harness gap would genuinely help.

u/Feisty-Cranberry2902 — 18 hours ago

▲ 2 r/LLMDevs

Memory for AI agents

The native LLM IDEa continue to solve for “memory”
How do you see companies like supermemory.com or mem0.ai getting adopted or are we looking at a pivot of some sort on their part?

The 3rd option, which is memory.store, seems to stand out with their “organisation brain” application which is essentially what the other guys also offer. Also YC wanted “memory for organisation”

Does their “memory” supplement the native memory? Has anyone used these across enterprises or in their individual capacity?

reddit.com

u/Able_Development_488 — 16 hours ago

▲ 173 r/LLMDevs+20 crossposts

I would like to share my latest open source local LLM inference tool implemented in C#. It supports models like Gemma4, Qwen3.6 with multi-modal (image, vision, audio), reasoning and function tool. It can run on Windows/MacOS/Linux and fully leverage GPU's capability. The API is completely compatible with OpenAI and Ollama interface.

Really appreciated if you can try it and give me some feedback. If you like it, it will be a big thank you if you can star it. Thank you very much!

u/fuzhongkai — 1 day ago

▲ 26 r/LLMDevs

I'm calling it now. OpenAI is sandbagging LLM development with codex 5.5

I've been working on a custom attention mechanism for almost 4 days straight now and I swear I'm going backwards... codex repeatedly keeps disabling key tests required to keep everything in check, is repeatedly constantly amazed at these incredible blunders it keeps stumbling upon... that it wrote...

Anyone else nothing similar brain-fog when it comes to llm development using codex cli?

reddit.com

u/johnnyApplePRNG — 1 day ago

▲ 5 r/LLMDevs+2 crossposts

I built an LLM eval gate that can't silently pass

https://github.com/albertofettucini/faithgate

Most LLM eval setups I've seen have a failure mode ops people will recognize: the happy path is green, and every unhappy path is also green. Judge API dies, no scores, nothing to compare, pass. That's an availability metric wearing a quality-gate costume.

I built faithgate around the opposite default. It's a faithfulness regression gate (suite of cases, score per prompt/model version, diff vs baseline, nonzero exit on regression) where every ambiguous state fails closed. Zero matched cases: fail. Unscored run: fail. Every score an abstention: fail. Abstentions are a distinct state in storage, never coerced to 0.0, and there's a --max-abstained policy flag for when you actually want tolerance.

Reproducibility bits: every run writes a manifest with judge id, model, kind, ragas and runner versions, and the suite hash. If the judge changed between baseline and head, comparing the scores is meaningless, so the gate exits 3 unless you explicitly pass --allow-judge-change. A corrupted manifest also fails closed. Duplicate case keys resolve pessimistically (baseline keeps max, head keeps min) so dupes can't quietly lower the bar.

My favorite part lives in CI. Next to the normal green gate there's a proves-detection job that runs the gate against a deliberately regressed suite and inverts the exit code. If the gate ever loses the ability to catch a known-bad change, dependency bump, refactor, whatever, the pipeline itself goes red. Tests for the test.

Judge honesty: default is Claude via your own key (RAGAS underneath). The keyless offline mode is published as untrustworthy, 68% balanced on a 40-example hand-labeled set, catches 9/20 unfaithful, with a unit test asserting the weakness.

Storage is one SQLite file with WAL, no server. Python 3.9 to 3.13, MIT. Known limitation: case identity is content-based, rewording a question mints a new case.

u/ahumanbeingmars — 23 hours ago

▲ 3 r/LLMDevs

litellm's price map is community maintained so it lags, curious how people deal with it

litellm can pull model_prices_and_context_window.json live from github which is handy, but that file is community maintained. so prices lag, new models take a while to appear (or never do), and sometimes the numbers are just wrong until someone opens a PR. how do you all handle this, just override per model in config?

What I ended up doing is pointing litellm at my own map instead. its the same env var so it works for both the python sdk and the gateway proxy:
export LITELLM_MODEL_COST_MAP_URL="https://cloudprice.net/api/v2/ai/litellm_model_prices.json"

Same schema, we just pull straight from each provider and refresh every day. it also has image/audio/video/rerank/ocr pricing, not just chat/embeddings.

Right now around 340 models come back with pricing thats not in the litellm map at all, mostly fresh releases like openrouter/z-ai/glm-5.2, openrouter/deepseek/deepseek V4 or for vercel.

Its completely free (with some throttling to avoid issues), No key, CORS on.

Anyway the thing I actually wanted to ask: would it make sense for litellm to support multiple cost map sources with a fallback, right in the gateway UI? like a primary url plus fallbacks, and if one is missing a model it falls through to the next. feels like that would fix the whole stale/missing thing no matter whose map you use.

u/Gaploid — 22 hours ago

▲ 19 r/LLMDevs+10 crossposts

I built Curion, a librarian-like memory agent for AI agents

I’ve been working on Curion, a memory system for AI agents built around a simple idea:

The main agent should not have to manage memory manually.

Most AI agents are useful inside a single session, but they still lose important context between sessions. Project decisions, implementation history, constraints, unresolved tasks, and previous reasoning often disappear unless I manually write long handoff notes.

At first, the obvious solution seems to be giving the agent memory tools: save, search, update, delete, edit.

But that creates a second problem.

If the main agent has to manage memory by itself, it can easily receive too many raw memories. Some are relevant, some are stale, some are only partially related, and some may conflict with newer information. The agent then has to spend context and attention deciding what matters.

That creates context bloat.

Curion takes a different approach.

I think of Curion as a librarian for AI agents.

A good librarian does not just throw every possibly related book at you. They understand the question, know how information is organized, filter what matters, notice conflicts, ask clarifying questions when needed, and return the most useful context.

That is what Curion is meant to do for agent memory.

The main agent only needs to say:

“I want to remember this.”

“I need to recall something about this.”

Curion handles the rest.

When saving memory, Curion can decide how information should be stored, whether it relates to existing records, whether something should be updated, and whether a conflict requires clarification.

When recalling memory, Curion does not just dump raw search results into the agent’s context. It retrieves relevant records, evaluates what is useful for the current task, synthesizes the context, and clearly says when nothing relevant was found.

The analogy I use is human memory. When we want to remember something, we do not consciously search through billions of memories. We ask for what we need, and the relevant memory appears automatically beneath the surface.

Curion is built around that same interface idea for AI agents.

It is project-first: Curion focuses on the project the agent is currently working in. It can also use cross-project recall when information from another project is actually relevant.

Curion is not just a save/search tool. It is a collaborative memory layer: a specialized memory librarian that helps agents remember responsibly, reduces context bloat, and gives the main agent only the context it actually needs.

GitHub: https://github.com/geanatz/curion

NPM: https://www.npmjs.com/package/@geanatz/curion

Portfolio: https://geanatz.com

u/geanatz — 1 day ago

▲ 10 r/LLMDevs+1 crossposts

[P] I found the standard way people measure KV cache quantization quality is blind to the cache, then built a 2 bit value cache that matches KIVI at half the bits

Been working on KV cache compression for long context inference on small GPUs. Two findings worth sharing.

The measurement trap. A lot of perplexity checks for KV quantization run a single forward pass with the cache disabled. In that mode the model reads exact full precision values and the quantizer never runs, so the metric literally cannot detect value cache quantization error. When I tested it, full precision, 4 bit, and 2 bit all gave the identical perplexity of 3.6416, because none of them actually ran on the cache. I switched to a cache path test that prefills and then decodes token by token, so the compressed cache is really read back.
The method. Rotate the value vectors with a Hadamard matrix, then quantize to 2 bit uniform. The rotation spreads outliers so a coarse grid fits, and since the matrix is its own inverse you undo it after the attention sum for free. Keys stay on KIVI int4, only values change. Result on the corrected metric: my 2 bit value cache matches KIVI 4 bit quality to three decimals, uses about 20 percent less memory, roughly 4 times less than fp16. Holds across Llama 2 7B and TinyLlama, reproduced on a second machine.

Honest limits: only compared to KIVI, not the newest rotational methods. Decode is 6 to 12 percent slower without a fused kernel. My first idea, ternary at 1.58 bit, actually failed once measured properly, and rotation did not rescue it, so the paper reports that too.

Paper: github.com/aryxnsdfs/kv-hadamard/blob/main/paper/kv_hadamard_paper.pdf

Code, data, figures: github.com/aryxnsdfs/kv-hadamard

Happy to answer questions.

u/Interesting-Owl6064 — 1 day ago

▲ 121 r/LLMDevs+1 crossposts

I built a knowledge canvas tool that lets you branch LLM conversations

I use LLMs to learn things a lot, but often don't understand something from it's response, or I just want to dive deeper on something, which is why I built this canvas for your notes (rich notion-like text editor)

I wanted to keep it as simple as possible while letting you bring in all your sources (YouTube videos, research papers, PDFs, web links, articles, etc.)

let me know if it sounds interesting and I can dm you the link!

super early version but looking to get 5-10 people in a discord community to make this the best platform for learning information using AI

u/No-Hurry-2568 — 2 days ago

▲ 1 r/LLMDevs

Cost-routing tasks across models in one session - cheap model for grunt work, frontier for reasoning, local for sensitive code

Been experimenting with model routing at the workflow level instead of the app level and wanted to share what it looks like in practice.

The setup: Zero, an open source coding agent (github.com/gitlawb/zero) that treats the model as a swappable component. It talks to 25+ providers - OpenAI, Anthropic, Gemini, DeepSeek, Qwen, Groq, plus local models through Ollama or LM Studio - and you switch mid-session with /model without losing context.

The routing pattern I've settled into:

Cheap/fast model for scaffolding, file reads, summaries, boilerplate
Frontier model only for the steps that need real reasoning - the escalation is a single command, same context
Local model for anything touching code or data I don't want leaving the machine

The cost curve changes completely. Instead of paying frontier prices for 100% of tokens, you pay them for the 20% of steps that actually need it. Over a week of heavy use the difference is not subtle.

Implementation details that matter: sessions are files on disk (resumable/forkable, so routing decisions survive restarts), it's a single Go binary, no telemetry, and there's a headless mode (zero exec, streams JSON) if you want to wire the routing into scripts or CI instead of doing it interactively.

Open question I haven't solved: my escalation decisions are still vibes-based. Has anyone built actual heuristics for when a task deserves the expensive model - token-count thresholds, retry-on-failure escalation, task classification? Curious what's working for people running this at scale.

reddit.com

u/amu4biz — 1 day ago

▲ 1 r/LLMDevs

Want to learn about RAG systems. Any resources?

Hi. Im new to LLM. Im currently working on a project but it seems like I don't know anough about llms. Are there any resourses to learn more?

reddit.com

u/Ivapol — 24 hours ago