What the april anthropic 529 incident revealed about llm gateway reliability posture
If you were on-call during the late-april opus 4.7 capacity squeeze, you probably already have opinions about this. If you were not, the short version: a lot of teams discovered mid-incident that what they had configured as failover was not actually functioning as a reliability mechanism.
Spent most of last weekend going through public post-mortems from teams that ran through it, partly to figure out whether our own posture would have held up. The pattern i kept seeing was that teams running through gateways found their gateway degrading in lockstep with the upstream it was proxying. The gateway did not have anywhere else to send the traffic, so it was effectively a more polished error returner. Some teams had multi-provider failover configured, but the failover only saved them if the secondary provider was healthy. During a few hours of the worst window, the major providers were degrading in correlated fashion.
This is making me think about gateways less as developer-convenience infrastructure and more as a serious dependency choice with the same kind of reliability questions you would ask of a database or a queue. Three concrete things i am now asking when i look at one.
Does the gateway hold any inference capacity itself, or is its entire serving capacity contingent on healthy upstreams? For workloads where a 30-minute degraded window is tolerable, pure-proxy is fine. For tier-1 inference, it is not. Some newer entrants keep their own compute on the back end as a degraded-mode fallback. Not frontier-class, not a substitute, but it converts service fully down into service in degraded mode where some quality drops but it still answers. Whether that distinction matters depends on what you are building.
What is the migration blast radius if you need to bypass the gateway entirely? Most gateways normalize traffic to a single api shape, which is useful until the incident requires hitting a provider direct. Then the normalization becomes a migration tax. We did not have a good answer for this in our runbook before april and i am still working on what the right one is.
Can you charge back by on-call team after the incident? If your gateway only gives you workspace-level rollups, you cannot answer which team consumed extra budget retrying through the degradation. That is a gap that quietly becomes a problem at scale.
I do not have a clean conclusion. Mostly april reminded me that reliability decisions for ai infra need to be made on the messy case. The gateway category is still maturing toward primitives that genuinely help under degradation rather than just under healthy load.
Would be useful to hear how teams here are writing runbooks for this failure mode specifically, especially the bilateral-degradation case where two upstreams are correlated. Mine still feels weak on it.