Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Measure GPU consumption #24

Open
bpetit opened this issue Dec 4, 2020 · 21 comments
Open

Measure GPU consumption #24

bpetit opened this issue Dec 4, 2020 · 21 comments
Labels
enhancement New feature or request good first issue Good for newcomers Hacktoberfest Issues ready to welcome contributors from Hacktoberfest ! help wanted Extra attention is needed

Comments

@bpetit
Copy link
Contributor

bpetit commented Dec 4, 2020

Problem

Some power hungry use cases rely on GPU. It would be great to propose to measure its consumption from the infrastructure point of view.

Solution

We can inspire from codecarbon by using pynvml.

Alternatives

Any other library existing would be worth a look.

Additional context

The idea is to make easier collecting those metrics from the infrastructure and thus feed metrics pipelines that may make easier exposing their impact to cloud providers machine learning clients.

@bpetit bpetit added the enhancement New feature or request label Dec 4, 2020
@bpetit bpetit added the help wanted Extra attention is needed label Dec 17, 2020
@uggla
Copy link
Collaborator

uggla commented May 5, 2021

Hello, I did a couple of investigations on this topic.
There is a wrapper of nvml library written in Rust here: https://crates.io/crates/nvml-wrapper so getting info from an Nvidia board looks not really complcated.
I have extracted and updaded the example provided to extract the power usage: https://github.com/uggla/nvml-basic
Unfortunately, I have the following output:

 uggla   main  ~  workspace  rust  nvml-basic  cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.01s
     Running `target/debug/basic`


Your NVIDIA GeForce GTX 1050 is currently sitting at 40 °C with a graphics clock of 139 MHz and a memory
clock of 405 MHz.
Memory usage is 4.92 MB out of an available 2.1 GB.
Right now the device is connected via a PCIe gen 1 x16 interface;
the max your hardware supports is PCIe gen 3 x16.
Power consumption is Not supported.

This device is not on a multi-GPU board.

System CUDA version: 11.3

So I manage to get data from my 1050 board but the power usage is not supported. :(
I have read that it can be a limitation of the driver. I expect more a limitation of my hardware. It would be great is someone could run this short code example on a different GPU before going ahead with the scaphandre implementation.

@demeringo
Copy link
Contributor

Hi, this is neat !

Your feedback triggered my curiosity to test nvml-wrapper on an AWS EC2 instance that uses nvidia GPU.

Disclaimer: my knowledge or experience of GPU or related driver is absolutely zero. So if you find anything that does not make sense below, please tell me ;-)

EC2 instance

  • g3.4xlarge
  • eu-west-1
  • all defaults settings
  • using AWS provided AMI that comes with nvida tesla driver preinstalled amzn2-ami-graphics-hvm-2.0.20210427.0-x86_64-gp2-e6724620-3ffb-4cc9-9690-c310d8e794ef

First attempt: libnvidia-ml.so not found

It did not work out of the box (complaining about missing libnvidia-ml.so).

root@ip-172-31-3-186 nvml-basic]# cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.01s
     Running `target/debug/basic`
Error: LibloadingError(DlOpen { desc: "libnvidia-ml.so: cannot open shared object file: No such file or directory" })

Second attempt: create a symlink to the lib

I did a couple of things to make it work

  • created the LD_LIBRARY_PATH env variable (export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/lib64) but it did not work either
  • created a symlink ln -s /usr/lib64/libnvidia-ml.so.1 /usr/lib64/libnvidia-ml.so

Relaunched and we have a measure:

[root@ip-172-31-3-186 nvml-basic]# cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.01s
     Running `target/debug/basic`


Your Tesla M60 is currently sitting at 21 °C with a graphics clock of 405 MHz and a memory clock of 324 MHz. 
Memory usage is 0 B out of an available 7.99 GB. 
Right now the device is connected via a PCIe gen 1 x16 interface; the max your hardware supports is PCIe gen 3 x16. 
Power consumption is 14599.

This device is not on a multi-GPU board.

System CUDA version: 11.0

In retrospect I am not sure if creating the LD_LIBRARY_PATH was of any use.

Using nvidia-smi command

While trying to this work I came accross the nvidia-smi command (See https://serverfault.com/questions/395455/how-to-check-gpu-usages-on-aws-ec2-gpu-instance)

I tried running nvidia-smi -i 0 -l -q -d POWER which returned results in the same range (+- 14 watts idle).
I do not know how the calculation is done but it displays a measure summary every second (I include 3 successive outputs below).

nvidia-smi -i 0 -l -q -d POWER

==============NVSMI LOG==============

Timestamp                                 : Mon May 10 22:25:27 2021
Driver Version                            : 450.119.01
CUDA Version                              : 11.0

Attached GPUs                             : 1
GPU 00000000:00:1E.0
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 13.88 W
        Power Limit                       : 150.00 W
        Default Power Limit               : 150.00 W
        Enforced Power Limit              : 150.00 W
        Min Power Limit                   : 112.50 W
        Max Power Limit                   : 162.00 W
    Power Samples
        Duration                          : 40.52 sec
        Number of Samples                 : 119
        Max                               : 14.73 W
        Min                               : 13.39 W
        Avg                               : 14.07 W


==============NVSMI LOG==============

Timestamp                                 : Mon May 10 22:25:32 2021
Driver Version                            : 450.119.01
CUDA Version                              : 11.0

Attached GPUs                             : 1
GPU 00000000:00:1E.0
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 15.08 W
        Power Limit                       : 150.00 W
        Default Power Limit               : 150.00 W
        Enforced Power Limit              : 150.00 W
        Min Power Limit                   : 112.50 W
        Max Power Limit                   : 162.00 W
    Power Samples
        Duration                          : 40.52 sec
        Number of Samples                 : 119
        Max                               : 14.73 W
        Min                               : 13.39 W
        Avg                               : 14.08 W


==============NVSMI LOG==============

Timestamp                                 : Mon May 10 22:25:37 2021
Driver Version                            : 450.119.01
CUDA Version                              : 11.0

Attached GPUs                             : 1
GPU 00000000:00:1E.0
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 13.88 W
        Power Limit                       : 150.00 W
        Default Power Limit               : 150.00 W
        Enforced Power Limit              : 150.00 W
        Min Power Limit                   : 112.50 W
        Max Power Limit                   : 162.00 W
    Power Samples
        Duration                          : 40.22 sec
        Number of Samples                 : 119
        Max                               : 14.73 W
        Min                               : 13.39 W
        Avg                               : 14.07 W

I have no idea if the results are relevant for of an idle machine. But I find very exciting that we are able to probe something out of an AWS instance using GPU ;-)

I think it would be interesting to redo the test with some kind of representative workload, and also verify if it works the same with other providers like azure or gcp.

@uggla
Copy link
Collaborator

uggla commented May 11, 2021

Hello @demeringo,

Thank you really much this is really helpful. Sorry I knew about the missing libnvidia-ml.so, but forget to mention it in the previous post.
14W idle for such card seems clearly possible and not completely wrong.

As I will not be able to fully test it with my laptop, I will mock the GPU results.
Though will you be able to do a test with scaphandre as soon as I will implement the GPU power reporting ? That will be great.

I think it would be interesting to redo the test with some kind of representative workload, and also verify if it works the same with other providers like azure or gcp.

Absolutelly but I dont't think there is a reason it will be different between providers as soon as it is nvidia GPU hardware.
Another interesting test would be with multiple GPU in order to know how the library react in such case.

@demeringo
Copy link
Contributor

Yes, this would be perfect, I can setup different public cloud servers for testing during a limited time... but I lack rust skills to do the integration... so if you could take it I would be more than happy to test a branch ;-)

@mindrunner
Copy link

I am happy to test on a bare metal box with a 1050ti (if testing is feasable in production mode)

However, it seem that power-draw might not be supported by some cards :(

==============NVSMI LOG==============

Timestamp                                 : Tue May 11 10:24:09 2021
Driver Version                            : 465.27
CUDA Version                              : 11.3

Attached GPUs                             : 1
GPU 00000000:02:00.0
    Power Readings
        Power Management                  : Supported
        Power Draw                        : N/A
        Power Limit                       : 75.00 W
        Default Power Limit               : 75.00 W
        Enforced Power Limit              : 75.00 W
        Min Power Limit                   : 52.50 W
        Max Power Limit                   : 75.00 W
    Power Samples
        Duration                          : 18446744073707.55 sec
        Number of Samples                 : 119
        Max                               : 35.50 W
        Min                               : 35.50 W
        Avg                               : 0.00 W

@uggla
Copy link
Collaborator

uggla commented May 11, 2021

However, it seem that power-draw might not be supported by some cards :(

@mindrunner, yes we have the same issue, I have a 1050 (not Ti) on my laptop it is not supported. That's the reason why I requested people with different HW to check.

on a bare metal box

Is your 1050Ti an embedded chip on a laptop or solder on a motherboard, or a "real" card plugged on pci express bus ? I understand that it is the last option, but this is just to be sure.

@uggla
Copy link
Collaborator

uggla commented May 11, 2021

@demeringo ,

Yes, this would be perfect, I can setup different public cloud servers for testing during a limited time... but I lack rust skills to do the integration... so if you could take it I would be more than happy to test a branch ;-)

Super cool, I'll notify you as soon as I have something usable. I just need to find a bit of spare time to handle it....

@mindrunner
Copy link

Yeah, Laptop cards have a different PM, also due to the fact they are usually driven next to an intel card and so on...

The card in my Laptop let's me read the power draw:
(01:00.0 VGA compatible controller: NVIDIA Corporation TU106GLM [Quadro RTX 3000 Mobile / Max-Q] (rev a1))

==============NVSMI LOG==============

Timestamp                                 : Tue May 11 18:44:53 2021
Driver Version                            : 465.27
CUDA Version                              : 11.3

Attached GPUs                             : 1
GPU 00000000:01:00.0
    Power Readings
        Power Management                  : N/A
        Power Draw                        : 12.45 W
        Power Limit                       : N/A
        Default Power Limit               : N/A
        Enforced Power Limit              : N/A
        Min Power Limit                   : N/A
        Max Power Limit                   : N/A
    Power Samples
        Duration                          : Not Found
        Number of Samples                 : Not Found
        Max                               : Not Found
        Min                               : Not Found
        Avg                               : Not Found

The card I was talking about in my previous post is a "normal" PCIe card:
(02:00.0 VGA compatible controller: NVIDIA Corporation GP107 [GeForce GTX 1050 Ti] (rev a1))
It is a PALIT GeForce® GTX 1050 Ti KalmX 4GB passive, just for reference.

@uggla
Copy link
Collaborator

uggla commented May 11, 2021

@mindrunne, thank you. This is really helpful.
I think 1050 is entry level hardware, so maybe that's the reason why there is no power sensor. Or maybe there is one which is disabled by the driver....

Yahoo, you have a laptop with a Quadro chip. It seems to be a high end laptop. I did not know that laptop can have such kind of chip.

@mindrunner
Copy link

Yahoo, you have a laptop with a Quadro chip. It seems to be a high end laptop. I did not know that laptop can have such kind of chip.

I guess it is pretty high end. DELL Precision 5750, the business brother of the 2020 XPS 17....

Anyways, searching the internet about this issue creates even more confusion. Some say, it is a driver issue, supposed to work with an older driver version. Not sure about that. If I figure out more, I will get back in touch here. Would be nice to have the GPU power included into my grafana dashboard, but in my case, really only eyecandy and nothing urgent :D

@itwars
Copy link

itwars commented May 21, 2021

Hi,
You can perhaps have a look to : https://pypi.org/project/pyJoules/ maybe same as pynvml?

@uggla
Copy link
Collaborator

uggla commented May 21, 2021

Hello @itwars thank you. In fact all these solutions rely on nvml library from Nvidia and the appropriate driver and hardware.
The rust nvml wrapping library (https://crates.io/crates/nvml-wrapper) is working very well. So soon scaphandre will be able to report Nvidia GPU consumption. It might take a bit more time than expected as @bpetit and @PierreRust are currently changing some internal stuff.

@itwars
Copy link

itwars commented May 21, 2021

Excellent! I'm really excited by having GPU power monitoring for my AI GPU powered lab. Any chance to have something similar for both AMD and Intel GPU?

@uggla
Copy link
Collaborator

uggla commented May 21, 2021

@itwars , it seems only a subset of Nvidia boards support these feature mostly the highend.
Regarding Intel and Amd, I have not done really extensive researches but it seems power data are not available. Equivalent libraries to nvml are really limited. Only good news, the one from Amd is open source if I remember well (not the case for nvml).
Sounds like energy management was not really a priority for GPU suppliers. Hoping that it will change in a near future.

@bpetit bpetit added the Hacktoberfest Issues ready to welcome contributors from Hacktoberfest ! label Oct 5, 2021
@bpetit bpetit added the good first issue Good for newcomers label Nov 3, 2021
@quantumsheep
Copy link

quantumsheep commented Nov 17, 2022

Hi, is there any news on this issue?

@uggla
Copy link
Collaborator

uggla commented Nov 17, 2022

@quantumsheep not really.
Do you need this feature ? I would say if someone needs that one I could be motivated to implement it.

@quantumsheep
Copy link

@uggla We have some servers with multiple GPUs that we want to get electrical consumption. We can take some time to implement the feature but if you can guide us on how to do it we would love it ❤️

@samuelrince
Copy link

Hey @uggla and @quantumsheep I also need this feature! It would be perfect to have it in Scaphandre directly. Currently I rely on this project utkuozdemir/nvidia_gpu_exporter. But it is built around Prometheus and there is no other way to export data (to my knowledge). In Boavizta/boagent we use the JSON exporter from Scaphandre and would like to keep that workflow for GPU metrics as well.
Happy to help if I can, but I don't think you can count on my Rust programming skills unfortunately 🙃

@uggla
Copy link
Collaborator

uggla commented Nov 22, 2022

Ok, I need to discuss with @bpetit about his plan for the next release.
I also need to discuss how Benoit wanted to deal with input plugins. I think this is the main difficulty with this issue.
Then I will try to put this issue on the TODO list.

@yuxin1234
Copy link

@uggla @bpetit Any update on this issue? Thanks @filga

@bpetit bpetit added this to the Release v1.2.0 milestone Jul 25, 2023
@bpetit
Copy link
Contributor Author

bpetit commented Jul 25, 2023

Hi !

I have a lot to catch up this thread, sorry !

@uggla don't hesitate to open a PR on dev, we are not so much on internals changes these days, more new features, so there shouldn't be too much conflicts.

I'll be more than happy to look at your PR soon after next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers Hacktoberfest Issues ready to welcome contributors from Hacktoberfest ! help wanted Extra attention is needed
Projects
Status: To do
Development

No branches or pull requests

8 participants