Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FeatureRequest] Add domain for HWthreads closest to GPUs #534

Closed
JanLJL opened this issue Jun 27, 2023 · 2 comments
Closed

[FeatureRequest] Add domain for HWthreads closest to GPUs #534

JanLJL opened this issue Jun 27, 2023 · 2 comments

Comments

@JanLJL
Copy link
Contributor

JanLJL commented Jun 27, 2023

Is your feature request related to a problem? Please describe.
Often, GPUs are not closest to the NUMA domain a humain might think (e.g., GPU 3 is closest to NUMA domain 0, etc). Not every user remembers to run likwid-topology first to get the corresponding NUMA domains for their GPU(s).

Describe the solution you'd like
Add a affinity domain for likwid-pin and likwid-perfctr, e.g., G for placing HW threads close to the GPU.
For example, pinning 10 HWthreads closest to GPU 1:

likwid-pin -C G1:0-9 ./run_app
@stdweird
Copy link

i am also interested in being able to support cpu pinning in combination with gpu usage. what is the current best practise wrt likwidpin?

@JanLJL you mentioned likwid-topology, but what is the proper flow a use should follow. i am also interested if likwid pin supports a hierarchy: if parent processes use a gpu, make sure the children are also pinned on cores in the same numa domain.

a very recent issue we had was people running torchrun with python code doing dataloader+train, and dataloaders. the dataload+train is what nvidia-smi reports as using the gpu, the remaining dataloaders are child processes of the train+dataload. torchrun is really crappy in pinning correctly, so we are looking for a way to "help" it. likwidpin would be a good candidate for this, but it's unclear how one woud invoke it

@TomTheBear
Copy link
Member

Hello,

Thanks for increasing priority on this feature request.

The current workflow would be to run likwid-topology to get the NUMA node where the GPU is attached to. Then you use likwid-pin -c Mx:y-z (x = NUMA domain ID, y and z for the number of HW threads).

One big question for this feature request is whether likwid-pin should also enforce the application to run on the selected GPU(s). I have not found a portable solution to do that yet. The CUDA_VISIBLE_DEVICES environment variable is fine on exclusive systems but inside e.g. shared-node SLURM jobs each with a GPU, this approach does not work anymore. Each SLURM job gets CUDA_VISIBLE_DEVICES=0 but under the hood, they are using different GPUs. My guess is that it is enforced through cgroups but I havn't found out how by now.

I never tried likwid-pin with PyTorch. There might be some other difficulties coming up (e.g. shepherd processes).

Hierarchies are currently not supported but also not needed. likwid-pin works on single processes, so either this process is using a GPU or not. They would be more interesting for likwid-mpirun where one MPI process could use a GPU while the others not. There is currently no way to do that because likwid-mpirun does not yet support the (I call it) colon syntax: mpirun <global opts> <local opts> <exec> <args1> : <local opts> <exec> <args2> : .... With the colon syntax, hierarchies should be doable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants