-
-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Measure GPU consumption #24
Comments
Hello, I did a couple of investigations on this topic.
So I manage to get data from my 1050 board but the power usage is not supported. :( |
Hi, this is neat ! Your feedback triggered my curiosity to test nvml-wrapper on an AWS EC2 instance that uses nvidia GPU. Disclaimer: my knowledge or experience of GPU or related driver is absolutely zero. So if you find anything that does not make sense below, please tell me ;-) EC2 instance
First attempt: libnvidia-ml.so not foundIt did not work out of the box (complaining about missing root@ip-172-31-3-186 nvml-basic]# cargo run
Finished dev [unoptimized + debuginfo] target(s) in 0.01s
Running `target/debug/basic`
Error: LibloadingError(DlOpen { desc: "libnvidia-ml.so: cannot open shared object file: No such file or directory" }) Second attempt: create a symlink to the libI did a couple of things to make it work
Relaunched and we have a measure:
In retrospect I am not sure if creating the LD_LIBRARY_PATH was of any use. Using nvidia-smi commandWhile trying to this work I came accross the I tried running nvidia-smi -i 0 -l -q -d POWER
==============NVSMI LOG==============
Timestamp : Mon May 10 22:25:27 2021
Driver Version : 450.119.01
CUDA Version : 11.0
Attached GPUs : 1
GPU 00000000:00:1E.0
Power Readings
Power Management : Supported
Power Draw : 13.88 W
Power Limit : 150.00 W
Default Power Limit : 150.00 W
Enforced Power Limit : 150.00 W
Min Power Limit : 112.50 W
Max Power Limit : 162.00 W
Power Samples
Duration : 40.52 sec
Number of Samples : 119
Max : 14.73 W
Min : 13.39 W
Avg : 14.07 W
==============NVSMI LOG==============
Timestamp : Mon May 10 22:25:32 2021
Driver Version : 450.119.01
CUDA Version : 11.0
Attached GPUs : 1
GPU 00000000:00:1E.0
Power Readings
Power Management : Supported
Power Draw : 15.08 W
Power Limit : 150.00 W
Default Power Limit : 150.00 W
Enforced Power Limit : 150.00 W
Min Power Limit : 112.50 W
Max Power Limit : 162.00 W
Power Samples
Duration : 40.52 sec
Number of Samples : 119
Max : 14.73 W
Min : 13.39 W
Avg : 14.08 W
==============NVSMI LOG==============
Timestamp : Mon May 10 22:25:37 2021
Driver Version : 450.119.01
CUDA Version : 11.0
Attached GPUs : 1
GPU 00000000:00:1E.0
Power Readings
Power Management : Supported
Power Draw : 13.88 W
Power Limit : 150.00 W
Default Power Limit : 150.00 W
Enforced Power Limit : 150.00 W
Min Power Limit : 112.50 W
Max Power Limit : 162.00 W
Power Samples
Duration : 40.22 sec
Number of Samples : 119
Max : 14.73 W
Min : 13.39 W
Avg : 14.07 W I have no idea if the results are relevant for of an idle machine. But I find very exciting that we are able to probe something out of an AWS instance using GPU ;-) I think it would be interesting to redo the test with some kind of representative workload, and also verify if it works the same with other providers like azure or gcp. |
Hello @demeringo, Thank you really much this is really helpful. Sorry I knew about the missing As I will not be able to fully test it with my laptop, I will mock the GPU results.
Absolutelly but I dont't think there is a reason it will be different between providers as soon as it is nvidia GPU hardware. |
Yes, this would be perfect, I can setup different public cloud servers for testing during a limited time... but I lack rust skills to do the integration... so if you could take it I would be more than happy to test a branch ;-) |
I am happy to test on a bare metal box with a 1050ti (if testing is feasable in production mode) However, it seem that power-draw might not be supported by some cards :(
|
@mindrunner, yes we have the same issue, I have a 1050 (not Ti) on my laptop it is not supported. That's the reason why I requested people with different HW to check.
Is your 1050Ti an embedded chip on a laptop or solder on a motherboard, or a "real" card plugged on pci express bus ? I understand that it is the last option, but this is just to be sure. |
Super cool, I'll notify you as soon as I have something usable. I just need to find a bit of spare time to handle it.... |
Yeah, Laptop cards have a different PM, also due to the fact they are usually driven next to an intel card and so on... The card in my Laptop let's me read the power draw:
The card I was talking about in my previous post is a "normal" PCIe card: |
@mindrunne, thank you. This is really helpful. Yahoo, you have a laptop with a Quadro chip. It seems to be a high end laptop. I did not know that laptop can have such kind of chip. |
I guess it is pretty high end. DELL Precision 5750, the business brother of the 2020 XPS 17.... Anyways, searching the internet about this issue creates even more confusion. Some say, it is a driver issue, supposed to work with an older driver version. Not sure about that. If I figure out more, I will get back in touch here. Would be nice to have the GPU power included into my grafana dashboard, but in my case, really only eyecandy and nothing urgent :D |
Hi, |
Hello @itwars thank you. In fact all these solutions rely on nvml library from Nvidia and the appropriate driver and hardware. |
Excellent! I'm really excited by having GPU power monitoring for my AI GPU powered lab. Any chance to have something similar for both AMD and Intel GPU? |
@itwars , it seems only a subset of Nvidia boards support these feature mostly the highend. |
Hi, is there any news on this issue? |
@quantumsheep not really. |
@uggla We have some servers with multiple GPUs that we want to get electrical consumption. We can take some time to implement the feature but if you can guide us on how to do it we would love it ❤️ |
Hey @uggla and @quantumsheep I also need this feature! It would be perfect to have it in Scaphandre directly. Currently I rely on this project utkuozdemir/nvidia_gpu_exporter. But it is built around Prometheus and there is no other way to export data (to my knowledge). In Boavizta/boagent we use the JSON exporter from Scaphandre and would like to keep that workflow for GPU metrics as well. |
Ok, I need to discuss with @bpetit about his plan for the next release. |
Hi ! I have a lot to catch up this thread, sorry ! @uggla don't hesitate to open a PR on dev, we are not so much on internals changes these days, more new features, so there shouldn't be too much conflicts. I'll be more than happy to look at your PR soon after next release. |
Problem
Some power hungry use cases rely on GPU. It would be great to propose to measure its consumption from the infrastructure point of view.
Solution
We can inspire from codecarbon by using pynvml.
Alternatives
Any other library existing would be worth a look.
Additional context
The idea is to make easier collecting those metrics from the infrastructure and thus feed metrics pipelines that may make easier exposing their impact to cloud providers machine learning clients.
The text was updated successfully, but these errors were encountered: