Maybe I'm overengineering this, but managing AI workloads in production feels weirdly fragmented right now.
I have:
- normal app monitoring
- separate GPU metrics
- separate prompt/version tracking
- separate model evaluation logs
- separate cost dashboards
- and then random scripts duct-taped between all of them
The actual inference part is becoming easier than the infrastructure around it.
Curious if people are converging on a stack yet or if everyone else also has a pile of semi-connected tooling.