Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running latest DCGM exporter not working on GKE #448

Open
puneetloya opened this issue Feb 2, 2025 · 0 comments
Open

Running latest DCGM exporter not working on GKE #448

puneetloya opened this issue Feb 2, 2025 · 0 comments
Labels
question Further information is requested

Comments

@puneetloya
Copy link

Ask your question

Logs:

2025/02/02 10:09:46 maxprocs: Leaving GOMAXPROCS=8: CPU quota undefined
2025/02/02 10:09:46 INFO Starting dcgm-exporter Version=4.0.0-4.0.1
2025/02/02 10:09:46 INFO Attempting to initialize DCGM.
2025/02/02 10:09:46 INFO Initialized DCGM Fields module.
2025/02/02 10:09:46 INFO DCGM successfully initialized!
2025/02/02 10:09:46 INFO Attempting to initialize NVML library.
2025/02/02 10:09:46 INFO NVML provider successfully initialized!
2025/02/02 10:09:46 INFO Not collecting DCP metrics: This request is serviced by a module of DCGM that is not currently loaded
2025/02/02 10:09:46 INFO Falling back to metric file '/etc/dcgm-exporter/default-counters.csv'
2025/02/02 10:09:46 WARN Skipping line 20 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): metric not enabled
2025/02/02 10:09:46 WARN Skipping line 21 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): metric not enabled
2025/02/02 10:09:46 WARN Skipping line 22 ('DCGM_FI_PROF_DRAM_ACTIVE'): metric not enabled
2025/02/02 10:09:46 WARN Skipping line 23 ('DCGM_FI_PROF_PCIE_TX_BYTES'): metric not enabled
2025/02/02 10:09:46 WARN Skipping line 24 ('DCGM_FI_PROF_PCIE_RX_BYTES'): metric not enabled
2025/02/02 10:09:46 INFO Initializing system entities of type 'GPU'
2025/02/02 10:09:46 INFO Initializing system entities of type 'NvSwitch'
2025/02/02 10:09:46 INFO Not collecting NvSwitch metrics; no switches to monitor
2025/02/02 10:09:46 INFO Initializing system entities of type 'NvLink'
2025/02/02 10:09:46 INFO Not collecting NvLink metrics; no switches to monitor
2025/02/02 10:09:46 INFO Initializing system entities of type 'CPU'
SIGSEGV: segmentation violation

More detailed logs here: https://gist.github.com/puneetloya/2ec191a1ac79b3e23f76036e1a7d70fa

My values.yaml:

serviceMonitor:
  enabled: false

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
        - matchExpressions:
            - key: cloud.google.com/gke-accelerator
              operator: Exists

tolerations:
  - operator: Exists

securityContext:
  privileged: true

priorityClassName: dcgm-exporter

extraHostVolumes:
  - name: vulkan-icd-mount
    hostPath: /home/kubernetes/bin/nvidia/vulkan/icd.d
  - name: nvidia-install-dir-host
    hostPath: /home/kubernetes/bin/nvidia

extraVolumeMounts:
  - name: nvidia-install-dir-host
    mountPath: /usr/local/nvidia
    readOnly: true
  - name: vulkan-icd-mount
    mountPath: /etc/vulkan/icd.d
    readOnly: true

extraEnv:
- name: DCGM_EXPORTER_KUBERNETES_GPU_ID_TYPE
  value: device-name

I tried the steps mentioned in this issue: #385.

Technically I dont need any CPU metrics, why would the system initialize 'CPU'. May be my understanding is incomplete. I am using the default configmap provided by the helm chart and tried different versions of it but the result seems to be the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

1 participant