u/fearless_expert216

I have a Python data service (gunicorn, 7 workers, 3 pod replicas, static) that the compute service calls during ML workflows. The heavy endpoint reads large datasets from S3 and processes them in-memory.

What I see in Prometheus:

- Request rate stays roughly flat during ML workflows

- p99 duration spikes to several minutes during heavy workflows

- Errors stay at zero

I suspect the high p99 is dominated by I/O wait on S3, and that under enough concurrent load in-flight requests would queue at the worker level, making horizontal autoscaling useful. But I want to confirm this with a load test before deciding which metric to scale on.

My questions:

Is sending varying levels of concurrent heavy requests and watching how key metrics (request duration, worker saturation, CPU, memory) respond a sound way to find the saturation point? Or is there a better-established approach for I/O-bound services?

For a service that pins workers waiting on S3, which metric tends to be the most predictive trigger for autoscaling? Custom worker saturation (queue length), or latency itself?

Using Prometheus with the gunicorn StatsD exporter. Open to suggestions about additional instrumentation worth adding before the test.

How to load test an I/O-bound service to choose the right autoscaling metric in Kubernetes?