Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix gpu_id #59

Merged
merged 3 commits into from
Nov 26, 2024
Merged

Fix gpu_id #59

merged 3 commits into from
Nov 26, 2024

Conversation

fatwir
Copy link
Collaborator

@fatwir fatwir commented Nov 25, 2024

This Pull Request enhances the GPU selection mechanism based on memory utilization. Specifically, it implements the following changes:

  1. GPU ID Selection:

    • The GPU ID will be selected based on memory utilization. If the memory utilization of a given GPU is less than 20%, the first MIG ID under that GPU will be selected.
  2. Limitations:

    • While nvidia-smi does support a MIG option, I currently do not have the necessary permissions to access this feature.
    • Detailed information about the MIG-partitioned GPU utilization can be retrieved using dcgmi. However, this tool is not installed and cannot be utilized.
  3. Current Approach:

    • For the time being, a manual search through the output of nvidia-smi -L is performed to select the appropriate GPU IDs.

Copy link
Collaborator

@vianamp vianamp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not test it.

# Based on the utilization, set the GPU ID


def get_gpu_info():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonder if we should use something like this instead: https://github.com/anderskm/gputil

Copy link
Collaborator Author

@fatwir fatwir Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this! I believe it would be beneficial to update it for the MIG partitioned GPUs as well. Additionally, I noticed that they are leveraging nvidia-smi --query-gpu to retrieve the statistics, which we are also using. The repo you shared offers a lot of stats about the GPUs, and I think incorporating that could be useful. I'll talk to ritvik about this!

@ritvikvasan
Copy link
Collaborator

this PR seems to be doing a lot more than this? did you mean to merge into the other nb-> python file branch?

@fatwir
Copy link
Collaborator Author

fatwir commented Nov 25, 2024

I had forked from this branch and updated the _setup_gpu function in the scripts!

@ritvikvasan
Copy link
Collaborator

can you merge into that branch then so its easier to know what you added?

@ritvikvasan
Copy link
Collaborator

I should have clarified here I didn't mean to drop this PR and just merge your changes, but to just base your PR off that branch

@fatwir fatwir changed the base branch from main to scripts_features_rebased November 26, 2024 00:08
Copy link
Collaborator

@ritvikvasan ritvikvasan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much better, thanks!

@fatwir fatwir merged commit 304d39c into scripts_features_rebased Nov 26, 2024
2 checks passed
@fatwir fatwir deleted the fix_gpu_id branch November 26, 2024 00:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants