-
Notifications
You must be signed in to change notification settings - Fork 250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NCCL topology on the VM of H200 #256
Comments
Well, the closer the VM looks to the underlying physical system, the less work you will have getting NCCL to perform... In particular, since NUMA does have performance implications, I would expose it if possible, unless all your GPUs/NICs are attached to a single NUMA node, which doesn't seem to be the case (you didn't include a complete topo file but the included part seems to show just 3 GPUs on NUMA node 0). My general suggestion for such issues is to run NCCL with I would take the baremetal config file, adjust the bus IDs of the GPUs and NICs to match what they are in the VM, and not worry about the PCIe switch IDs. Your goal is simply to tell NCCL which devices are close to each other. The VM doesn't expose the PCIe switches so their IDs shouldn't matter, so long as they don't conflict with anything else. Also, make sure you've read on ACS/ATS in NCCL's troubleshooting (https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html). |
Thank you very much, Kamil. Your advice helped me. Then I assigned 4 NUMA nodes just like the baremetal when creating the VM, also applied vcpu pinning, and specifid the NCCL_TOPO_FILE in which I gave the PCIe switches some values not used. The generated NCCL graph dump file also seemed correct, and the performance got better than before, but still not as expected.
And then I enabled ATS in the NICs of CX7. But I encountered the error message as following: I checked both VMs with the command 'ulimit -l', and the output is 'unlimited'.
Please refer to the sysctl parameters as follows in my VM. Is there any missing configuration?
|
It would be good to see the PCI info output from the NCCL INFO logs (NCCL_DEBUG_SUBSYS=INIT,ENV,GRAPH) which would show the PCI-E topology. I don't have any experience of how to configure ACS and ATS in combination. |
The detailed information about PCIe ATS capability of Mellanox ConnectX-7 on my VM is just as follows. root@207-vm:/home/fs# lspci -vvv |grep -i ats |
We have 2 H200 servers connected with the IP switch. We ran nccl_test and all_reduce_perf script worked well and had expected performance on the baremetal system.
Then, we created a virtual machine via kvm on each server with all 8 GPUs and NICs pass-through. But the performance was worse too much although they have the same version of nvidia drivers/cuda/nccl/nv_peer_mem on baremetal and VM.
I know this is related to GDR, and we may need to specify NCCL_TOPO_FILE when running all_reduce_perf script. We see the PCIe topology on VM (lspci -tv) as follows which is different with that seen on baremetal system. At least the PCIe switch which GPU and NIC are connected with couldn’t be seen.
We don’t know how to generate the proper nccl topology file for NCCL on the VM. We have dumped the XML topology on the baremetal system and the VM (NCCL_TOPO_DUMP_FILE=system.xml), and please refer to the attached files. We can edit the XML topology dumped on the baremetal system and adjust the PCI IDs to match what’s inside on the VM, But what the PCIe switch (pci id = 16、27、38 and so on ) should be? We didn’t see them at all on the VM.
We also found that the PCIe link speed is 16 GT/s in dumped XML topology on the VM, but it is 32 GT/s on the baremetal system.
Is there something missing on our side when creating the VM?
I only pass the 8 GPUs and NICs into VM, and Is it needed to pass other devices such as PCIe switches or nvswitches to the VM so that the VM reflects the host's NUMA structure and PCIe topology as close as possible?
Is it needed to configure NUMA nodes on the VM to match with that in the baremetal and also apply vcpu pinning?
I appreciate for any clues.
Thanks a lot!
The text was updated successfully, but these errors were encountered: