-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segfaults with dcgm-exporter 3.3.0 and higher #412
Comments
FYI, dcgm-exporter starts properly when I disable the following in the config:
If I enable any of those 3 options, we get a segmentation fault. |
Please update to the latest version to see if there is a crash. The specific backtrace you included has been fixed. If there is still a crash please include the full log and the config that causes the crash. Thank you. |
@glowkey yes it still crashes on
Segfault/logs:
|
We have not been able to repro this or issue #409 but are working to determine the cause. |
@glowkey any updates? This seems related to a DCGM issue since I can repro with versions past DCGM 3.3.0 in container and on baremetal VM by installing DCGM + libdcgm on GCP. If I
It seems like DCGM-exporter calls DCGM's 3.3.1
|
The gdb output was very helpful and we believe we may now understand the root cause. A fix will be available in the next release. Thank you for your help. |
What is the version?
3.3.0 and higher
What happened?
We're seeing segfaults in our EKS environment (running IPv6 clusters) when running dcgm-exporter 3.3.0 and higher (DCGM 3.3.3+) - we do not see segfaults when running 3.2.0 (DCGM 3.3.0). I have confirmed this happens on all released versions since 3.3.0-3.2.0.
Our nodes are running a mix of
Bottlerocket OS 1.25.0 (aws-k8s-1.29-nvidia)
andBottlerocket OS 1.26.1 (aws-k8s-1.29-nvidia)
The segfault is similar to this (from 3.3.3-3.3.0):
What did you expect to happen?
dcgm-exporter does not crash
What is the GPU model?
AWS NVIDIA A10G Tensor Core GPU
running on AWSg5.xlarge
instancesWhat is the environment?
Running in EKS 1.29
How did you deploy the dcgm-exporter and what is the configuration?
Using helm
How to reproduce the issue?
Upgrade to a release beyond 3.2.0
Anything else we need to know?
No response
The text was updated successfully, but these errors were encountered: