-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How should GPU energy use be estimated? #37
Comments
Scaphandre has some discussion and a TODO for GPU measurement NVIDIA data is available for later model and datacenter class GPUs, not for some desktop models. This data source is reported as available for NVIDIA based cloud instances on AWS. The data is milliwatts averaged over a one second interval as an integer. |
Kepler currently support NVIDIA GPU (through both nvml and dcgm) and is also working on Intel Gaudi GPU support. We have a recent tutorial of using Kepler to measure LLM energy consumption and evaluating sustainability in terms of token/watts |
As @rootfs mentioned, in the Kepler, we collect data on both the GPU utilization of processes and the total GPU power consumption using the NVML library. Then, we distribute the total GPU power consumption among all processes utilizing the GPU based on their utilization. |
Hello! I've stumbled on this issue from the Scaphandre repository. After trying to extend Scaphandre to support GPUs, I eventually started "from scratch" and designed a new measurement tool (though Alumet is not "just" a tool for measuring energy consumption). As the Kepler team mentioned, NVML can report the energy consumption of most NVIDIA GPUs, as well as information on the GPU utilization by different processes, and it works quite well. It's better to measure than to rely on TDP-based estimations anyway. IMO that should be enough to start building some models :) |
Link to Alumet added to the Miro - It appears that NVIDIA power monitoring is well understood. Next step is to figure out the interfaces for Intel, AMD, Google TPU and AWS Inferentia etc. |
Outline Action Item Details
We have a reasonable handle on CPU energy use by taking CPU utilization and mapping it to an energy curve driven by the Thermal Design Power (TDP) of a package - which is sometimes the only public data that is available. GPUs are becoming more common, have a higher TDP than CPUs, but we don't have an easy or standard way to measure the utilization of the GPUs in a system. Propose to reach out to contacts at NVIDIA to see if we can find some answers and encourage them to join GSF.
Issue dependency with other WGs Groups
No response
The text was updated successfully, but these errors were encountered: