PCIe topology for GPU/Infiniband VMs
Hi everyone,
I'm working on an OpenStack deployment with several GPU-enabled nodes, each having a fairly complex PCIe topology connecting 8x H200 GPUs to 4x ConnectX-7 InfiniBand NICs.
PCI passthrough is working correctly and inside the VM we can see all GPUs, NVSwitches, and NICs without issues.
However, in order to achieve near bare-metal performance for distributed AI workloads, the default libvirt XML generated by Nova is not enough. We need to:
- pin guest memory to the correct NUMA nodes
- pin vCPUs appropriately
- create a guest PCIe topology that closely mirrors the host topology
NVIDIA documents this approach here:
Without these adjustments, topology-aware libraries like NCCL cannot correctly compute optimal communication graphs, and microbenchmark performance is significantly worse than bare metal.
Our current workflow is roughly:
- create the VM normally through Nova
- intercept/dump the libvirt XML from nova_libvirt
- patch the XML with a custom script following the NVIDIA recommendations
- restart the domain with virsh
After this, performance becomes extremely close to bare metal and everything works well.
The problem is that any Nova-driven operation (soft reboot, hard reboot, cold migration, etc.) regenerates the libvirt XML, so we need to repeat the entire procedure every time.
My question is:
Does Nova expose any mechanism to deeply customize or persist libvirt XML configuration for instances?
I know about flavor/image metadata and extra specs, but they seem too limited for this level of topology customization. Ideally we'd like a cleaner and more OpenStack-native approach than patching XML after instance creation.
Has anyone here tackled something similar for high-performance GPU/NVLink/InfiniBand workloads?
Thanks!