[FeatureRequest] Add domain for HWthreads closest to GPUs #534

JanLJL · 2023-06-27T23:53:18Z

Is your feature request related to a problem? Please describe.
Often, GPUs are not closest to the NUMA domain a humain might think (e.g., GPU 3 is closest to NUMA domain 0, etc). Not every user remembers to run likwid-topology first to get the corresponding NUMA domains for their GPU(s).

Describe the solution you'd like
Add a affinity domain for likwid-pin and likwid-perfctr, e.g., G for placing HW threads close to the GPU.
For example, pinning 10 HWthreads closest to GPU 1:

likwid-pin -C G1:0-9 ./run_app

The text was updated successfully, but these errors were encountered:

stdweird · 2024-06-13T07:04:35Z

i am also interested in being able to support cpu pinning in combination with gpu usage. what is the current best practise wrt likwidpin?

@JanLJL you mentioned likwid-topology, but what is the proper flow a use should follow. i am also interested if likwid pin supports a hierarchy: if parent processes use a gpu, make sure the children are also pinned on cores in the same numa domain.

a very recent issue we had was people running torchrun with python code doing dataloader+train, and dataloaders. the dataload+train is what nvidia-smi reports as using the gpu, the remaining dataloaders are child processes of the train+dataload. torchrun is really crappy in pinning correctly, so we are looking for a way to "help" it. likwidpin would be a good candidate for this, but it's unclear how one woud invoke it

TomTheBear · 2024-06-13T10:07:21Z

Hello,

Thanks for increasing priority on this feature request.

The current workflow would be to run likwid-topology to get the NUMA node where the GPU is attached to. Then you use likwid-pin -c Mx:y-z (x = NUMA domain ID, y and z for the number of HW threads).

One big question for this feature request is whether likwid-pin should also enforce the application to run on the selected GPU(s). I have not found a portable solution to do that yet. The CUDA_VISIBLE_DEVICES environment variable is fine on exclusive systems but inside e.g. shared-node SLURM jobs each with a GPU, this approach does not work anymore. Each SLURM job gets CUDA_VISIBLE_DEVICES=0 but under the hood, they are using different GPUs. My guess is that it is enforced through cgroups but I havn't found out how by now.

I never tried likwid-pin with PyTorch. There might be some other difficulties coming up (e.g. shepherd processes).

Hierarchies are currently not supported but also not needed. likwid-pin works on single processes, so either this process is using a GPU or not. They would be more interesting for likwid-mpirun where one MPI process could use a GPU while the others not. There is currently no way to do that because likwid-mpirun does not yet support the (I call it) colon syntax: mpirun <global opts> <local opts> <exec> <args1> : <local opts> <exec> <args2> : .... With the colon syntax, hierarchies should be doable.

TomTheBear closed this as completed in ce0fd89 Sep 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FeatureRequest] Add domain for HWthreads closest to GPUs #534

[FeatureRequest] Add domain for HWthreads closest to GPUs #534

JanLJL commented Jun 27, 2023

stdweird commented Jun 13, 2024

TomTheBear commented Jun 13, 2024

[FeatureRequest] Add domain for HWthreads closest to GPUs #534

[FeatureRequest] Add domain for HWthreads closest to GPUs #534

Comments

JanLJL commented Jun 27, 2023

stdweird commented Jun 13, 2024

TomTheBear commented Jun 13, 2024