u/DoubtfulJoe

I had access to a few machines with different CPU architectures and GPU configurations and ran benchmarks on several KVM/libvirt GPU passthrough VMs to see how much vCPU pinning and guest NUMA topology actually matter.


TL;DR: vCPU pinning was usually the safest starting point. Exposing NUMA topology to the guest could unlock much higher CPU-side memory bandwidth on multi-NUMA hosts, but it was not a guaranteed win. In some multi-GPU cases it made collective bandwidth worse when traffic had to cross NUMA nodes through host memory.


Tested systems: H200, MI350X, RTX PRO 6000, RTX 4090, RTX 4080, RTX 5090.
Compared:
- default-ish libvirt/QEMU layout
- vCPU pinning only
- vCPU pinning + guest NUMA topology
- vCPU pinning + guest NUMA + SMT siblings


Takeaways:
- Guest NUMA exposure made a huge difference for CPU-side memory bandwidth. STREAM improved by roughly 2x to 7x on some multi-NUMA hosts.
- vCPU pinning alone helped latency/noise in some cases, but did not unlock the big memory-bandwidth gains by itself.
- Exposing SMT siblings mostly made the guest look bigger without making it much faster. CPU throughput only improved around 2 to 3 percent in the larger tests.
- GPU compute benchmarks, like FP16 matmul, barely changed across CPU/NUMA configs.
- NCCL was the tricky part. Plain vCPU pinning helped a lot on some systems, while guest NUMA hurt badly on others.
- NCCL was tested with `NCCL_P2P_DISABLE=1`, so these results emphasize host-memory / PCIe behavior rather than direct GPU-to-GPU links.
- One RTX 5090 setup was clearly misconfigured or platform-limited: PCIe bandwidth looked like Gen 2, so I treated those results as diagnostic rather than representative.


The i9-14900K desktop baseline did very well in some CPU benchmarks, especially single-thread/sysbench, where its boost behavior helped it beat older EPYC Zen 2 in certain cases.


In a nutshell, for GPU-heavy passthrough VMs, I’d start with `vcpu_pin` and keep the guest single-NUMA. I’d only expose guest NUMA after checking where the GPUs physically sit and confirming the workload actually benefits from the extra host memory bandwidth. I plan to write a follow-up post about algorithm of choosing the right vCPU/NUMA config based on the hardware and workload that takes into account other tenants on the host and hugepages.


Full write-up:
https://medium.com/itnext/gpu-vm-performance-do-vcpu-pinning-and-numa-topology-really-matter-1b2093f4b45a


Benchmark scripts:
https://github.com/6erun/vcpu_benchmarks
Benchmarked vCPU pinning / guest NUMA for GPU passthrough VMs