u/Adr-740

I kept running into the same pattern in production:

LLMs being used for things like:

- intent detection

- tagging

- moderation

…but most of those calls are actually very simple.

So I tested it.

On a standard benchmark (Banking77):

→ ~90%+ of inputs can be handled by a lightweight ML model

→ while keeping ~95% agreement with the LLM

Built a small library around that idea:

→ It learns from your LLM outputs

→ routes “easy” cases to a cheap model

→ keeps hard ones on the LLM

→ with a guarantee on quality (you set the threshold)

Result:

massive cost reduction without noticeable degradation

Fully open-sourced here:

Would love feedback from people running high-volume LLM pipelines - curious if you’re seeing the same pattern.