u/mjf-89

PCIe topology for GPU/Infiniband VMs

Hi everyone,

I'm working on an OpenStack deployment with several GPU-enabled nodes, each having a fairly complex PCIe topology connecting 8x H200 GPUs to 4x ConnectX-7 InfiniBand NICs.

PCI passthrough is working correctly and inside the VM we can see all GPUs, NVSwitches, and NICs without issues.

However, in order to achieve near bare-metal performance for distributed AI workloads, the default libvirt XML generated by Nova is not enough. We need to:

- pin guest memory to the correct NUMA nodes

- pin vCPUs appropriately

- create a guest PCIe topology that closely mirrors the host topology

NVIDIA documents this approach here:

https://docs.nvidia.com/ai-enterprise/planning-resource/optimizing-vm-configuration-ai-inference/latest/configuring-vms.html#virtual-cpu-configuration

Without these adjustments, topology-aware libraries like NCCL cannot correctly compute optimal communication graphs, and microbenchmark performance is significantly worse than bare metal.

Our current workflow is roughly:

- create the VM normally through Nova

- intercept/dump the libvirt XML from nova_libvirt

- patch the XML with a custom script following the NVIDIA recommendations

- restart the domain with virsh

After this, performance becomes extremely close to bare metal and everything works well.

The problem is that any Nova-driven operation (soft reboot, hard reboot, cold migration, etc.) regenerates the libvirt XML, so we need to repeat the entire procedure every time.

My question is:

Does Nova expose any mechanism to deeply customize or persist libvirt XML configuration for instances?

I know about flavor/image metadata and extra specs, but they seem too limited for this level of topology customization. Ideally we'd like a cleaner and more OpenStack-native approach than patching XML after instance creation.

Has anyone here tackled something similar for high-performance GPU/NVLink/InfiniBand workloads?

Thanks!

reddit.com
u/mjf-89 — 13 days ago