bug was invisible in staging for four months. showed up in production on day one.
new client. e-commerce checkout flow. we'd tested it thoroughly. every payment path, every edge case we could think of. staging was clean. QA signed off. we shipped.
within six hours of go-live, orders were occasionally completing on the frontend but never hitting the order management system. customers getting confirmation emails. warehouse getting nothing. nobody knew until a customer called asking where their order was.
pulled everything apart. the issue was environment-specific database connection pooling behavior. under real concurrent load, connections were timing out silently and the fallback wasn't logging the failure, just swallowing it and returning success upstream.
staging never had more than three people on it at once. production had real traffic. that was the entire difference.
four months of clean tests meant nothing because we'd never tested the thing the environment actually does differently from staging: volume.
lesson we don't forget now. load test in an environment that mirrors production connection limits before anything ships.