
I've been suffering from horrible performance when using my NanoGPT subscription with models like GLM 5.1 and Gemma 4 due to requests being routed to a provider with a huge delay even for simple requests.
I'm talking about saying "Hi" and having to wait 50 seconds to get a hello back. I often get routed to providers that take 40x longer than should be expected.I know subscription usage means worse providers but that should mean a few seconds, not tens of seconds.
I sent a message to the CEO who I've seen active on reddit, asking if NanoGPT has ways to evaluate the providers and temporarily block the ones that are clearly overloaded/unresponsive, instead of just defaulting to the cheapest.
I also asked if I and other people will continue to have this issue or if this is something that is going to be fixed. After two weeks the experience is still pretty bad and I haven't gotten a reply at all so I'll probably be cancelling my subscription especially since the $8 -> $12 price increase.
It's very disappointing that i cant exclude the bad provider without switching to pay-as-you-go pricing - which basically makes the subscription useless for me. NanoGPT doesn't even tell the user which provider was used so even if that was possible, I'd have to manually benchmark and compare all of the providers to determine which one is the sucky one - even though that's literally what I'm supposed to be paying NanoGPT for, to route my requests.
I realized if you don't know what I mean by provider and routing then this might not make much sense, but basically how NanoGPT and OpenRouter work is that they just resell compute capacity (inference) from other "backend providers" like deepinfra, novita, parasail etc., forwarding your request to them. Now to make the most money, they of course often route requests to the provider that does it the cheapest, resulting in stuff like this.
So to avoid this I'm either going to switch to using an inference provider directly, or use a subscription service that does better provider quality control for routing.
Here's a screenshot that demonstrates how we can deduce from the format of one of the fields in the API response that the requests that take 50 to 60 seconds are a different provider than the one that takes 1.5 seconds (all of them for the same simple prompt): https://i.ibb.co/sdyP0n24/image.png
Edit: seems like OpenCode Go uses only official providers plus fireworks and deepinfra for GLM. I'll test that out next, it's cheaper too.
Edit: OpenCode Go is not any better for GLM 5.1 (huge delays) - so either zai or deepinfra is out of compute. Kimi k2.6 works perfectly though, with moonshot being the only provider.