Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault #450

Open
andrewjamesbrown opened this issue Feb 7, 2025 · 3 comments
Open

Segmentation fault #450

andrewjamesbrown opened this issue Feb 7, 2025 · 3 comments
Labels
bug Something isn't working

Comments

@andrewjamesbrown
Copy link

andrewjamesbrown commented Feb 7, 2025

What is the version?

4.1.0-4.0.2-ubuntu22.04

What happened?

Seeing a segmentation fault in production:

2025/02/07 22:42:43 maxprocs: Leaving GOMAXPROCS=2: CPU quota undefined
2025/02/07 22:42:43 INFO Starting dcgm-exporter Version=4.1.0-4.0.2
2025/02/07 22:42:43 INFO Attempting to initialize DCGM.
2025/02/07 22:42:43 INFO Initialized DCGM Fields module.
2025/02/07 22:42:43 INFO DCGM successfully initialized!
2025/02/07 22:42:43 INFO Attempting to initialize NVML library.
2025/02/07 22:42:43 ERROR Cannot init NVML library; err: ERROR_LIBRARY_NOT_FOUND
2025/02/07 22:42:43 INFO NVML provider successfully initialized!
2025/02/07 22:42:43 INFO Not collecting DCP metrics: This request is serviced by a module of DCGM that is not currently loaded
2025/02/07 22:42:43 WARN Skipping line 19 ('DCGM_FI_PROF_PCIE_TX_BYTES'): metric not enabled
2025/02/07 22:42:43 WARN Skipping line 20 ('DCGM_FI_PROF_PCIE_RX_BYTES'): metric not enabled
2025/02/07 22:42:43 WARN Skipping line 21 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): metric not enabled
2025/02/07 22:42:43 WARN Skipping line 22 ('DCGM_FI_PROF_SM_ACTIVE'): metric not enabled
2025/02/07 22:42:43 WARN Skipping line 23 ('DCGM_FI_PROF_SM_OCCUPANCY'): metric not enabled
2025/02/07 22:42:43 WARN Skipping line 24 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): metric not enabled
2025/02/07 22:42:43 WARN Skipping line 25 ('DCGM_FI_PROF_DRAM_ACTIVE'): metric not enabled
2025/02/07 22:42:43 WARN Skipping line 26 ('DCGM_FI_PROF_PIPE_FP64_ACTIVE'): metric not enabled
2025/02/07 22:42:43 WARN Skipping line 27 ('DCGM_FI_PROF_PIPE_FP32_ACTIVE'): metric not enabled
2025/02/07 22:42:43 WARN Skipping line 28 ('DCGM_FI_PROF_PIPE_FP16_ACTIVE'): metric not enabled
2025/02/07 22:42:43 INFO Initializing system entities of type 'GPU'
2025/02/07 22:42:43 INFO Not collecting GPU metrics; Error getting devices count: Cannot perform the requested operation because NVML doesn't exist on this system.
2025/02/07 22:42:43 INFO Initializing system entities of type 'NvSwitch'
2025/02/07 22:42:43 INFO Not collecting NvSwitch metrics; no switches to monitor
2025/02/07 22:42:43 INFO Initializing system entities of type 'NvLink'
2025/02/07 22:42:43 INFO Not collecting NvLink metrics; no switches to monitor
2025/02/07 22:42:43 INFO Initializing system entities of type 'CPU'
SIGSEGV: segmentation violation
PC=0x7f06b03716aa m=0 sigcode=1 addr=0x4
signal arrived during cgo execution
goroutine 1 gp=0xc0000061c0 m=0 mp=0x337fe00 [syscall]:
runtime.cgocall(0x1978a50, 0xc00051e148)
	/usr/local/go/src/runtime/cgocall.go:157 +0x4b fp=0xc00051e120 sp=0xc00051e0e8 pc=0x4195cb
github.com/NVIDIA/go-dcgm/pkg/dcgm._Cfunc_dcgmGetCpuHierarchy(0x7fffffff, 0xc00037b400)
	_cgo_gotypes.go:1178 +0x4b fp=0xc00051e148 sp=0xc00051e120 pc=0x7f0c8b
github.com/NVIDIA/go-dcgm/pkg/dcgm.GetCpuHierarchy()
	/go/pkg/mod/github.com/!n!v!i!d!i!a/[email protected]/pkg/dcgm/cpu.go:42 +0x6b fp=0xc00051edf8 sp=0xc00051e148 pc=0x7f3beb
github.com/NVIDIA/dcgm-exporter/internal/pkg/dcgmprovider.dcgmProvider.GetCpuHierarchy(...)
	/go/src/github.com/NVIDIA/dcgm-exporter/internal/pkg/dcgmprovider/dcgm.go:163
github.com/NVIDIA/dcgm-exporter/internal/pkg/dcgmprovider.(*dcgmProvider).GetCpuHierarchy(_)
	<autogenerated>:1 +0x7c fp=0xc00051f028 sp=0xc00051edf8 pc=0x17a381c
github.com/NVIDIA/dcgm-exporter/internal/pkg/deviceinfo.(*Info).initializeCPUInfo(0xc000079908, {0x1, {0x0, 0x0, 0x0}, {0x0, 0x0, 0x0}})
	/go/src/github.com/NVIDIA/dcgm-exporter/internal/pkg/deviceinfo/device_info.go:196 +0x9f fp=0xc00051f3d0 sp=0xc00051f028 pc=0x17a55df
github.com/NVIDIA/dcgm-exporter/internal/pkg/deviceinfo.Initialize({0x1, {0x0, 0x0, 0x0}, {0x0, 0x0, 0x0}}, {0x1, {0x0, 0x0, ...}, ...}, ...)
	/go/src/github.com/NVIDIA/dcgm-exporter/internal/pkg/deviceinfo/device_info.go:108 +0x349 fp=0xc00051f440 sp=0xc00051f3d0 pc=0x17a4969
github.com/NVIDIA/dcgm-exporter/internal/pkg/devicewatchlistmanager.(*WatchListManager).CreateEntityWatchList(0xc0003696c0, 0x7, {0x21455d8, 0x33e0820}, 0x7530)
	/go/src/github.com/NVIDIA/dcgm-exporter/internal/pkg/devicewatchlistmanager/device_watchlist_manager.go:131 +0x458 fp=0xc00051f788 sp=0xc00051f440 pc=0x17cdff8
github.com/NVIDIA/dcgm-exporter/pkg/cmd.startDeviceWatchListManager(0xc0003af800, 0xc000007c00)
	/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:419 +0x305 fp=0xc00051f878 sp=0xc00051f788 pc=0x19764a5
github.com/NVIDIA/dcgm-exporter/pkg/cmd.startDCGMExporter(0xc0000ab700, 0xc00027eee0)
	/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:353 +0x34b fp=0xc00051fa70 sp=0xc00051f878 pc=0x197580b
github.com/NVIDIA/dcgm-exporter/pkg/cmd.action.func1()
	/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:311 +0x5b fp=0xc00051fac0 sp=0xc00051fa70 pc=0x19752bb
github.com/NVIDIA/dcgm-exporter/internal/pkg/stdout.Capture({0x215c4e8, 0xc0000cedc0}, 0xc0000e5b78)
	/go/src/github.com/NVIDIA/dcgm-exporter/internal/pkg/stdout/capture.go:76 +0x1e6 fp=0xc00051fb50 sp=0xc00051fac0 pc=0x1972c46
github.com/NVIDIA/dcgm-exporter/pkg/cmd.action(0xc0000ab700)
	/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:302 +0x67 fp=0xc00051fba8 sp=0xc00051fb50 pc=0x1975227
github.com/NVIDIA/dcgm-exporter/pkg/cmd.NewApp.func1(0xc0000ab700?)
	/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:283 +0x13 fp=0xc00051fbc0 sp=0xc00051fba8 pc=0x1978553
github.com/urfave/cli/v2.(*Command).Run(0xc0000d0f20, 0xc0000ab700, {0xc0001200f0, 0x3, 0x3})
	/go/pkg/mod/github.com/urfave/cli/[email protected]/command.go:279 +0x97d fp=0xc00051fe48 sp=0xc00051fbc0 pc=0x818ffd
github.com/urfave/cli/v2.(*App).RunContext(0xc0001bae00, {0x215c280, 0x33e0820}, {0xc0001200f0, 0x3, 0x3})
	/go/pkg/mod/github.com/urfave/cli/[email protected]/app.go:337 +0x58b fp=0xc00051fea8 sp=0xc00051fe48 pc=0x81588b
github.com/urfave/cli/v2.(*App).Run(0xc0000e5f30?, {0xc0001200f0?, 0x1?, 0x48453a?})
	/go/pkg/mod/github.com/urfave/cli/[email protected]/app.go:311 +0x2f fp=0xc00051fee8 sp=0xc00051fea8 pc=0x8152af
main.main()
	/go/src/github.com/NVIDIA/dcgm-exporter/cmd/dcgm-exporter/main.go:32 +0x5f fp=0xc00051ff50 sp=0xc00051fee8 pc=0x197867f
runtime.main()
	/usr/local/go/src/runtime/proc.go:271 +0x29d fp=0xc00051ffe0 sp=0xc00051ff50 pc=0x45185d
runtime.goexit({})
	/usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc00051ffe8 sp=0xc00051ffe0 pc=0x4848e1
goroutine 2 gp=0xc000006c40 m=nil [force gc (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	/usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc000084fa8 sp=0xc000084f88 pc=0x451c8e
runtime.goparkunlock(...)
	/usr/local/go/src/runtime/proc.go:408
runtime.forcegchelper()
	/usr/local/go/src/runtime/proc.go:326 +0xb3 fp=0xc000084fe0 sp=0xc000084fa8 pc=0x451b13
runtime.goexit({})
	/usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc000084fe8 sp=0xc000084fe0 pc=0x4848e1
created by runtime.init.6 in goroutine 1
	/usr/local/go/src/runtime/proc.go:314 +0x1a
goroutine 3 gp=0xc000007180 m=nil [GC sweep wait]:
runtime.gopark(0x1?, 0x0?, 0x0?, 0x0?, 0x0?)
	/usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc000085780 sp=0xc000085760 pc=0x451c8e
runtime.goparkunlock(...)
	/usr/local/go/src/runtime/proc.go:408
runtime.bgsweep(0xc000062070)
	/usr/local/go/src/runtime/mgcsweep.go:318 +0xdf fp=0xc0000857c8 sp=0xc000085780 pc=0x43c33f
runtime.gcenable.gowrap1()
	/usr/local/go/src/runtime/mgc.go:203 +0x25 fp=0xc0000857e0 sp=0xc0000857c8 pc=0x430c45
runtime.goexit({})
	/usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0000857e8 sp=0xc0000857e0 pc=0x4848e1
created by runtime.gcenable in goroutine 1
	/usr/local/go/src/runtime/mgc.go:203 +0x66
goroutine 4 gp=0xc000007340 m=nil [GC scavenge wait]:
runtime.gopark(0x10000?, 0x2135618?, 0x0?, 0x0?, 0x0?)
	/usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc000085f78 sp=0xc000085f58 pc=0x451c8e
runtime.goparkunlock(...)
	/usr/local/go/src/runtime/proc.go:408
runtime.(*scavengerState).park(0x337e820)
	/usr/local/go/src/runtime/mgcscavenge.go:425 +0x49 fp=0xc000085fa8 sp=0xc000085f78 pc=0x439ce9
runtime.bgscavenge(0xc000062070)
	/usr/local/go/src/runtime/mgcscavenge.go:658 +0x59 fp=0xc000085fc8 sp=0xc000085fa8 pc=0x43a299
runtime.gcenable.gowrap2()
	/usr/local/go/src/runtime/mgc.go:204 +0x25 fp=0xc000085fe0 sp=0xc000085fc8 pc=0x430be5
runtime.goexit({})
	/usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc000085fe8 sp=0xc000085fe0 pc=0x4848e1
created by runtime.gcenable in goroutine 1
	/usr/local/go/src/runtime/mgc.go:204 +0xa5
goroutine 18 gp=0xc000106380 m=nil [finalizer wait]:
runtime.gopark(0xc000084648?, 0x423525?, 0xa8?, 0x1?, 0xc0000061c0?)
	/usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc000084620 sp=0xc000084600 pc=0x451c8e
runtime.runfinq()
	/usr/local/go/src/runtime/mfinal.go:194 +0x107 fp=0xc0000847e0 sp=0xc000084620 pc=0x42fc87
runtime.goexit({})
	/usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0000847e8 sp=0xc0000847e0 pc=0x4848e1
created by runtime.createfing in goroutine 1
	/usr/local/go/src/runtime/mfinal.go:164 +0x3d
goroutine 25 gp=0xc0002ca8c0 m=nil [GC worker (idle)]:
runtime.gopark(0x17779ee59754f?, 0x41b6eb?, 0xf7?, 0xaa?, 0xc0003510e0?)
	/usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc000080750 sp=0xc000080730 pc=0x451c8e
runtime.gcBgMarkWorker()
	/usr/local/go/src/runtime/mgc.go:1310 +0xe5 fp=0xc0000807e0 sp=0xc000080750 pc=0x432d25
runtime.goexit({})
	/usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0000807e8 sp=0xc0000807e0 pc=0x4848e1
created by runtime.gcBgMarkStartWorkers in goroutine 1
	/usr/local/go/src/runtime/mgc.go:1234 +0x1c
goroutine 26 gp=0xc0002cbc00 m=nil [GC worker (idle)]:
runtime.gopark(0x17779ee597992?, 0x0?, 0x0?, 0x0?, 0x0?)
	/usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc000080f50 sp=0xc000080f30 pc=0x451c8e
runtime.gcBgMarkWorker()
	/usr/local/go/src/runtime/mgc.go:1310 +0xe5 fp=0xc000080fe0 sp=0xc000080f50 pc=0x432d25
runtime.goexit({})
	/usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc000080fe8 sp=0xc000080fe0 pc=0x4848e1
created by runtime.gcBgMarkStartWorkers in goroutine 1
	/usr/local/go/src/runtime/mgc.go:1234 +0x1c
goroutine 27 gp=0xc000007a40 m=nil [IO wait]:
runtime.gopark(0x5?, 0x0?, 0x0?, 0x0?, 0xb?)
	/usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc000087430 sp=0xc000087410 pc=0x451c8e
runtime.netpollblock(0x4d8ad8?, 0x418d66?, 0x0?)
	/usr/local/go/src/runtime/netpoll.go:573 +0xf7 fp=0xc000087468 sp=0xc000087430 pc=0x44a9f7
internal/poll.runtime_pollWait(0x7f06687bee70, 0x72)
	/usr/local/go/src/runtime/netpoll.go:345 +0x85 fp=0xc000087488 sp=0xc000087468 pc=0x47f125
internal/poll.(*pollDesc).wait(0xc0000ad2c0?, 0xc0001b9000?, 0x1)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc0000874b0 sp=0xc000087488 pc=0x4f5b07
internal/poll.(*pollDesc).waitRead(...)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc0000ad2c0, {0xc0001b9000, 0x1000, 0x1000})
	/usr/local/go/src/internal/poll/fd_unix.go:164 +0x27a fp=0xc000087548 sp=0xc0000874b0 pc=0x4f6dfa
os.(*File).read(...)
	/usr/local/go/src/os/file_posix.go:29
os.(*File).Read(0xc000470760, {0xc0001b9000?, 0x0?, 0x0?})
	/usr/local/go/src/os/file.go:118 +0x52 fp=0xc000087588 sp=0xc000087548 pc=0x502252
bufio.(*Scanner).Scan(0xc0002f5600)
	/usr/local/go/src/bufio/scan.go:219 +0x81e fp=0xc000087660 sp=0xc000087588 pc=0x5607fe
github.com/NVIDIA/dcgm-exporter/internal/pkg/stdout.Capture.func2()
	/go/src/github.com/NVIDIA/dcgm-exporter/internal/pkg/stdout/capture.go:58 +0x50 fp=0xc0000877e0 sp=0xc000087660 pc=0x1972d50
runtime.goexit({})
	/usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0000877e8 sp=0xc0000877e0 pc=0x4848e1
created by github.com/NVIDIA/dcgm-exporter/internal/pkg/stdout.Capture in goroutine 1
	/go/src/github.com/NVIDIA/dcgm-exporter/internal/pkg/stdout/capture.go:57 +0x1d9
goroutine 9 gp=0xc00046e1c0 m=nil [IO wait]:
runtime.gopark(0x9f4ca2b4cf45288b?, 0xdb608c66d780917b?, 0x8b?, 0x28?, 0xb?)
	/usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc0004636f8 sp=0xc0004636d8 pc=0x451c8e
runtime.netpollblock(0x4d8ad8?, 0x418d66?, 0x0?)
	/usr/local/go/src/runtime/netpoll.go:573 +0xf7 fp=0xc000463730 sp=0xc0004636f8 pc=0x44a9f7
internal/poll.runtime_pollWait(0x7f06687bec80, 0x72)
	/usr/local/go/src/runtime/netpoll.go:345 +0x85 fp=0xc000463750 sp=0xc000463730 pc=0x47f125
internal/poll.(*pollDesc).wait(0xc0002f5880?, 0xc00048e000?, 0x0)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc000463778 sp=0xc000463750 pc=0x4f5b07
internal/poll.(*pollDesc).waitRead(...)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc0002f5880, {0xc00048e000, 0x1a80, 0x1a80})
	/usr/local/go/src/internal/poll/fd_unix.go:164 +0x27a fp=0xc000463810 sp=0xc000463778 pc=0x4f6dfa
net.(*netFD).Read(0xc0002f5880, {0xc00048e000?, 0x7f06687b72c8?, 0xc00012a780?})
	/usr/local/go/src/net/fd_posix.go:55 +0x25 fp=0xc000463858 sp=0xc000463810 pc=0x64fd85
net.(*conn).Read(0xc00011e528, {0xc00048e000?, 0xc000463938?, 0x42317b?})
	/usr/local/go/src/net/net.go:185 +0x45 fp=0xc0004638a0 sp=0xc000463858 pc=0x660c45
net.(*TCPConn).Read(0xc00044fa40?, {0xc00048e000?, 0x1a7b?, 0x48e000?})
	<autogenerated>:1 +0x25 fp=0xc0004638d0 sp=0xc0004638a0 pc=0x672ba5
crypto/tls.(*atLeastReader).Read(0xc00012a780, {0xc00048e000?, 0x0?, 0xc00012a780?})
	/usr/local/go/src/crypto/tls/conn.go:806 +0x3b fp=0xc000463918 sp=0xc0004638d0 pc=0x6ae5bb
bytes.(*Buffer).ReadFrom(0xc00044fb30, {0x21414c0, 0xc00012a780})
	/usr/local/go/src/bytes/buffer.go:211 +0x98 fp=0xc000463970 sp=0xc000463918 pc=0x51ef38
crypto/tls.(*Conn).readFromUntil(0xc00044f888, {0x213fa00, 0xc00011e528}, 0xc000463980?)
	/usr/local/go/src/crypto/tls/conn.go:828 +0xde fp=0xc0004639a8 sp=0xc000463970 pc=0x6ae79e
crypto/tls.(*Conn).readRecordOrCCS(0xc00044f888, 0x0)
	/usr/local/go/src/crypto/tls/conn.go:626 +0x3cf fp=0xc000463c28 sp=0xc0004639a8 pc=0x6ab8af
crypto/tls.(*Conn).readRecord(...)
	/usr/local/go/src/crypto/tls/conn.go:588
crypto/tls.(*Conn).Read(0xc00044f888, {0xc00042f000, 0x1000, 0xc00046e380?})
	/usr/local/go/src/crypto/tls/conn.go:1370 +0x156 fp=0xc000463c98 sp=0xc000463c28 pc=0x6b2156
bufio.(*Reader).Read(0xc00042c360, {0xc00039e3c0, 0x9, 0x32d7dd0?})
	/usr/local/go/src/bufio/bufio.go:241 +0x197 fp=0xc000463cd0 sp=0xc000463c98 pc=0x55ddb7
io.ReadAtLeast({0x213f600, 0xc00042c360}, {0xc00039e3c0, 0x9, 0x9}, 0x9)
	/usr/local/go/src/io/io.go:335 +0x90 fp=0xc000463d18 sp=0xc000463cd0 pc=0x4cda30
io.ReadFull(...)
	/usr/local/go/src/io/io.go:354
golang.org/x/net/http2.readFrameHeader({0xc00039e3c0, 0x9, 0xc000463dc0?}, {0x213f600?, 0xc00042c360?})
	/go/pkg/mod/golang.org/x/[email protected]/http2/frame.go:237 +0x65 fp=0xc000463d68 sp=0xc000463d18 pc=0x9b72e5
golang.org/x/net/http2.(*Framer).ReadFrame(0xc00039e380)
	/go/pkg/mod/golang.org/x/[email protected]/http2/frame.go:501 +0x85 fp=0xc000463e10 sp=0xc000463d68 pc=0x9b7a25
golang.org/x/net/http2.(*clientConnReadLoop).run(0xc000463fa8)
	/go/pkg/mod/golang.org/x/[email protected]/http2/transport.go:2505 +0xda fp=0xc000463f60 sp=0xc000463e10 pc=0x9cb67a
golang.org/x/net/http2.(*ClientConn).readLoop(0xc000107880)
	/go/pkg/mod/golang.org/x/[email protected]/http2/transport.go:2381 +0x8b fp=0xc000463fc8 sp=0xc000463f60 pc=0x9ca9eb
golang.org/x/net/http2.(*Transport).newClientConn.gowrap1()
	/go/pkg/mod/golang.org/x/[email protected]/http2/transport.go:912 +0x25 fp=0xc000463fe0 sp=0xc000463fc8 pc=0x9c3265
runtime.goexit({})
	/usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc000463fe8 sp=0xc000463fe0 pc=0x4848e1
created by golang.org/x/net/http2.(*Transport).newClientConn in goroutine 8
	/go/pkg/mod/golang.org/x/[email protected]/http2/transport.go:912 +0xdfb
rax    0x0
rbx    0x7f0660d2d780
rcx    0x7f06b03a4a1c
rdx    0x1
rdi    0x0
rsi    0x42037120
rbp    0x4
rsp    0x7ffdbff35d30
r8     0x90800
r9     0x42037120
r10    0x0
r11    0x287
r12    0xffffffffffffff78
r13    0x2
r14    0x0
r15    0x7ffdbff35f20
rip    0x7f06b03716aa
rflags 0x10206
cs     0x33
fs     0x0
gs     0x0

What did you expect to happen?

dcgm-exporter works

What is the GPU model?

NVIDIA-A10G

What is the environment?

EKS 1.30
Container runtime: containerd://1.7.24+bottlerocket
Kubelet: v1.30.8-eks-3c20087
OS: Bottlerocket OS 1.32.0 (aws-k8s-1.30-nvidia)

How did you deploy the dcgm-exporter and what is the configuration?

Using helm chart 4.0.1

How to reproduce the issue?

No response

Anything else we need to know?

# nvidia-device-plugin
I0207 23:01:22.081491 2398447 main.go:235] "Starting NVIDIA Device Plugin" version="unknown"
I0207 23:01:22.081540 2398447 main.go:238] Starting FS watcher for /var/lib/kubelet/device-plugins
I0207 23:01:22.081644 2398447 main.go:245] Starting OS watcher.
I0207 23:01:22.082016 2398447 main.go:260] Starting Plugins.
I0207 23:01:22.082073 2398447 main.go:317] Loading configuration.
I0207 23:01:22.082985 2398447 main.go:342] Updating config with default resource matching patterns.
I0207 23:01:22.083246 2398447 main.go:353]
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": true,
    "mpsRoot": "",
    "nvidiaDriverRoot": "/",
    "nvidiaDevRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "useNodeFeatureAPI": null,
    "deviceDiscoveryStrategy": "auto",
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  },
  "imex": {}
}
I0207 23:01:22.083298 2398447 main.go:356] Retrieving plugins.
I0207 23:01:22.105568 2398447 server.go:195] Starting GRPC server for 'nvidia.com/gpu'
I0207 23:01:22.107445 2398447 server.go:139] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0207 23:01:22.112041 2398447 server.go:146] Registered device plugin for 'nvidia.com/gpu' with Kubelet

From dcgm-exporter:3.3.9-3.6.1-ubuntu22.04 (which is mostly stable):

# nvidia-smi
Fri Feb  7 23:03:49 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.230.02             Driver Version: 535.230.02   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A10G                    On  | 00000000:00:1E.0 Off |                    0 |
|  0%   35C    P0              93W / 300W |   5872MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
@andrewjamesbrown andrewjamesbrown added the bug Something isn't working label Feb 7, 2025
@andrewjamesbrown
Copy link
Author

Also seeing errors with 4.0.0-4.0.1:

2025/02/07 22:48:18 maxprocs: Leaving GOMAXPROCS=4: CPU quota undefined
2025/02/07 22:48:18 INFO Starting dcgm-exporter Version=4.0.0-4.0.1
2025/02/07 22:48:18 INFO Attempting to initialize DCGM.
2025/02/07 22:48:18 INFO Initialized DCGM Fields module.
2025/02/07 22:48:18 INFO DCGM successfully initialized!
2025/02/07 22:48:18 INFO Attempting to initialize NVML library.
2025/02/07 22:48:18 INFO NVML provider successfully initialized!
2025/02/07 22:48:18 INFO Not collecting DCP metrics: This request is serviced by a module of DCGM that is not currently loaded
2025/02/07 22:48:18 WARN Skipping line 19 ('DCGM_FI_PROF_PCIE_TX_BYTES'): metric not enabled
2025/02/07 22:48:18 WARN Skipping line 20 ('DCGM_FI_PROF_PCIE_RX_BYTES'): metric not enabled
2025/02/07 22:48:18 WARN Skipping line 21 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): metric not enabled
2025/02/07 22:48:18 WARN Skipping line 22 ('DCGM_FI_PROF_SM_ACTIVE'): metric not enabled
2025/02/07 22:48:18 WARN Skipping line 23 ('DCGM_FI_PROF_SM_OCCUPANCY'): metric not enabled
2025/02/07 22:48:18 WARN Skipping line 24 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): metric not enabled
2025/02/07 22:48:18 WARN Skipping line 25 ('DCGM_FI_PROF_DRAM_ACTIVE'): metric not enabled
2025/02/07 22:48:18 WARN Skipping line 26 ('DCGM_FI_PROF_PIPE_FP64_ACTIVE'): metric not enabled
2025/02/07 22:48:18 WARN Skipping line 27 ('DCGM_FI_PROF_PIPE_FP32_ACTIVE'): metric not enabled
2025/02/07 22:48:18 WARN Skipping line 28 ('DCGM_FI_PROF_PIPE_FP16_ACTIVE'): metric not enabled
2025/02/07 22:48:18 INFO Initializing system entities of type 'GPU'
2025/02/07 22:48:18 INFO Initializing system entities of type 'NvSwitch'
2025/02/07 22:48:18 INFO Not collecting NvSwitch metrics; no switches to monitor
2025/02/07 22:48:18 INFO Initializing system entities of type 'NvLink'
2025/02/07 22:48:18 INFO Not collecting NvLink metrics; no switches to monitor
2025/02/07 22:48:18 INFO Initializing system entities of type 'CPU'
SIGSEGV: segmentation violation
PC=0x7feaa9ee16aa m=7 sigcode=1 addr=0x4
signal arrived during cgo execution
goroutine 1 gp=0xc0000061c0 m=7 mp=0xc0003c4008 [syscall]:
runtime.cgocall(0x1975150, 0xc0005e8148)
	/usr/local/go/src/runtime/cgocall.go:157 +0x4b fp=0xc0005e8120 sp=0xc0005e80e8 pc=0x4195cb
github.com/NVIDIA/go-dcgm/pkg/dcgm._Cfunc_dcgmGetCpuHierarchy(0x7fffffff, 0xc000158a00)
	_cgo_gotypes.go:1178 +0x4b fp=0xc0005e8148 sp=0xc0005e8120 pc=0x7f0c8b
github.com/NVIDIA/go-dcgm/pkg/dcgm.GetCpuHierarchy()
	/go/pkg/mod/github.com/!n!v!i!d!i!a/[email protected]/pkg/dcgm/cpu.go:42 +0x6b fp=0xc0005e8df8 sp=0xc0005e8148 pc=0x7f3beb
github.com/NVIDIA/dcgm-exporter/internal/pkg/dcgmprovider.dcgmProvider.GetCpuHierarchy(...)
	/go/src/github.com/NVIDIA/dcgm-exporter/internal/pkg/dcgmprovider/dcgm.go:163
github.com/NVIDIA/dcgm-exporter/internal/pkg/dcgmprovider.(*dcgmProvider).GetCpuHierarchy(_)
	<autogenerated>:1 +0x7c fp=0xc0005e9028 sp=0xc0005e8df8 pc=0x17a1b7c
github.com/NVIDIA/dcgm-exporter/internal/pkg/deviceinfo.(*Info).initializeCPUInfo(0xc000170f08, {0x1, {0x0, 0x0, 0x0}, {0x0, 0x0, 0x0}})
	/go/src/github.com/NVIDIA/dcgm-exporter/internal/pkg/deviceinfo/device_info.go:196 +0x9f fp=0xc0005e93d0 sp=0xc0005e9028 pc=0x17a393f
github.com/NVIDIA/dcgm-exporter/internal/pkg/deviceinfo.Initialize({0x1, {0x0, 0x0, 0x0}, {0x0, 0x0, 0x0}}, {0x1, {0x0, 0x0, ...}, ...}, ...)
	/go/src/github.com/NVIDIA/dcgm-exporter/internal/pkg/deviceinfo/device_info.go:108 +0x349 fp=0xc0005e9440 sp=0xc0005e93d0 pc=0x17a2cc9
github.com/NVIDIA/dcgm-exporter/internal/pkg/devicewatchlistmanager.(*WatchListManager).CreateEntityWatchList(0xc000385450, 0x7, {0x2140538, 0x33d9600}, 0x7530)
	/go/src/github.com/NVIDIA/dcgm-exporter/internal/pkg/devicewatchlistmanager/device_watchlist_manager.go:131 +0x458 fp=0xc0005e9788 sp=0xc0005e9440 pc=0x17cc358
github.com/NVIDIA/dcgm-exporter/pkg/cmd.startDeviceWatchListManager(0xc00020f020, 0xc000103180)
	/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:412 +0x305 fp=0xc0005e9878 sp=0xc0005e9788 pc=0x1972bc5
github.com/NVIDIA/dcgm-exporter/pkg/cmd.startDCGMExporter(0xc0003ae400, 0xc000509800)
	/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:346 +0x34b fp=0xc0005e9a70 sp=0xc0005e9878 pc=0x1971f2b
github.com/NVIDIA/dcgm-exporter/pkg/cmd.action.func1()
	/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:304 +0x5b fp=0xc0005e9ac0 sp=0xc0005e9a70 pc=0x19719db
github.com/NVIDIA/dcgm-exporter/internal/pkg/stdout.Capture({0x2157448, 0xc000033cc0}, 0xc000507b78)
	/go/src/github.com/NVIDIA/dcgm-exporter/internal/pkg/stdout/capture.go:76 +0x1e6 fp=0xc0005e9b50 sp=0xc0005e9ac0 pc=0x196f406
github.com/NVIDIA/dcgm-exporter/pkg/cmd.action(0xc0003ae400)
	/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:295 +0x67 fp=0xc0005e9ba8 sp=0xc0005e9b50 pc=0x1971947
github.com/NVIDIA/dcgm-exporter/pkg/cmd.NewApp.func1(0xc0003ae400?)
	/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:276 +0x13 fp=0xc0005e9bc0 sp=0xc0005e9ba8 pc=0x1974c53
github.com/urfave/cli/v2.(*Command).Run(0xc000034840, 0xc0003ae400, {0xc000052120, 0x3, 0x3})
	/go/pkg/mod/github.com/urfave/cli/[email protected]/command.go:279 +0x97d fp=0xc0005e9e48 sp=0xc0005e9bc0 pc=0x818ffd
github.com/urfave/cli/v2.(*App).RunContext(0xc00011ee00, {0x21571e0, 0x33d9600}, {0xc000052120, 0x3, 0x3})
	/go/pkg/mod/github.com/urfave/cli/[email protected]/app.go:337 +0x58b fp=0xc0005e9ea8 sp=0xc0005e9e48 pc=0x81588b
github.com/urfave/cli/v2.(*App).Run(0xc000507f30?, {0xc000052120?, 0x1?, 0x48453a?})
	/go/pkg/mod/github.com/urfave/cli/[email protected]/app.go:311 +0x2f fp=0xc0005e9ee8 sp=0xc0005e9ea8 pc=0x8152af
main.main()
	/go/src/github.com/NVIDIA/dcgm-exporter/cmd/dcgm-exporter/main.go:32 +0x5f fp=0xc0005e9f50 sp=0xc0005e9ee8 pc=0x1974d7f
runtime.main()
	/usr/local/go/src/runtime/proc.go:271 +0x29d fp=0xc0005e9fe0 sp=0xc0005e9f50 pc=0x45185d
runtime.goexit({})
	/usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0005e9fe8 sp=0xc0005e9fe0 pc=0x4848e1
goroutine 2 gp=0xc000006c40 m=nil [force gc (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	/usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc000084fa8 sp=0xc000084f88 pc=0x451c8e
runtime.goparkunlock(...)
	/usr/local/go/src/runtime/proc.go:408
runtime.forcegchelper()
	/usr/local/go/src/runtime/proc.go:326 +0xb3 fp=0xc000084fe0 sp=0xc000084fa8 pc=0x451b13
runtime.goexit({})
	/usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc000084fe8 sp=0xc000084fe0 pc=0x4848e1
created by runtime.init.6 in goroutine 1
	/usr/local/go/src/runtime/proc.go:314 +0x1a
goroutine 3 gp=0xc000007180 m=nil [GC sweep wait]:
runtime.gopark(0x1?, 0x0?, 0x0?, 0x0?, 0x0?)
	/usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc000085780 sp=0xc000085760 pc=0x451c8e
runtime.goparkunlock(...)
	/usr/local/go/src/runtime/proc.go:408
runtime.bgsweep(0xc000062070)
	/usr/local/go/src/runtime/mgcsweep.go:318 +0xdf fp=0xc0000857c8 sp=0xc000085780 pc=0x43c33f
runtime.gcenable.gowrap1()
	/usr/local/go/src/runtime/mgc.go:203 +0x25 fp=0xc0000857e0 sp=0xc0000857c8 pc=0x430c45
runtime.goexit({})
	/usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0000857e8 sp=0xc0000857e0 pc=0x4848e1
created by runtime.gcenable in goroutine 1
	/usr/local/go/src/runtime/mgc.go:203 +0x66
goroutine 4 gp=0xc000007340 m=nil [GC scavenge wait]:
runtime.gopark(0x10000?, 0x21305b0?, 0x0?, 0x0?, 0x0?)
	/usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc000085f78 sp=0xc000085f58 pc=0x451c8e
runtime.goparkunlock(...)
	/usr/local/go/src/runtime/proc.go:408
runtime.(*scavengerState).park(0x3377600)
	/usr/local/go/src/runtime/mgcscavenge.go:425 +0x49 fp=0xc000085fa8 sp=0xc000085f78 pc=0x439ce9
runtime.bgscavenge(0xc000062070)
	/usr/local/go/src/runtime/mgcscavenge.go:658 +0x59 fp=0xc000085fc8 sp=0xc000085fa8 pc=0x43a299
runtime.gcenable.gowrap2()
	/usr/local/go/src/runtime/mgc.go:204 +0x25 fp=0xc000085fe0 sp=0xc000085fc8 pc=0x430be5
runtime.goexit({})
	/usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc000085fe8 sp=0xc000085fe0 pc=0x4848e1
created by runtime.gcenable in goroutine 1
	/usr/local/go/src/runtime/mgc.go:204 +0xa5
goroutine 5 gp=0xc000007c00 m=nil [finalizer wait]:
runtime.gopark(0x0?, 0x1f9cc40?, 0x20?, 0xa0?, 0x2000000020?)
	/usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc000084620 sp=0xc000084600 pc=0x451c8e
runtime.runfinq()
	/usr/local/go/src/runtime/mfinal.go:194 +0x107 fp=0xc0000847e0 sp=0xc000084620 pc=0x42fc87
runtime.goexit({})
	/usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0000847e8 sp=0xc0000847e0 pc=0x4848e1
created by runtime.createfing in goroutine 1
	/usr/local/go/src/runtime/mfinal.go:164 +0x3d
goroutine 8 gp=0xc0001f5a40 m=nil [GC worker (idle)]:
runtime.gopark(0xc0000867a8?, 0x41b6eb?, 0xf7?, 0xaa?, 0xc00039c5a0?)
	/usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc000086750 sp=0xc000086730 pc=0x451c8e
runtime.gcBgMarkWorker()
	/usr/local/go/src/runtime/mgc.go:1310 +0xe5 fp=0xc0000867e0 sp=0xc000086750 pc=0x432d25
runtime.goexit({})
	/usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0000867e8 sp=0xc0000867e0 pc=0x4848e1
created by runtime.gcBgMarkStartWorkers in goroutine 1
	/usr/local/go/src/runtime/mgc.go:1234 +0x1c
goroutine 38 gp=0xc0003b8000 m=nil [GC worker (idle)]:
runtime.gopark(0x90613d673d3e?, 0x0?, 0x0?, 0x0?, 0x0?)
	/usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc000080750 sp=0xc000080730 pc=0x451c8e
runtime.gcBgMarkWorker()
	/usr/local/go/src/runtime/mgc.go:1310 +0xe5 fp=0xc0000807e0 sp=0xc000080750 pc=0x432d25
runtime.goexit({})
	/usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0000807e8 sp=0xc0000807e0 pc=0x4848e1
created by runtime.gcBgMarkStartWorkers in goroutine 1
	/usr/local/go/src/runtime/mgc.go:1234 +0x1c
goroutine 9 gp=0xc000362540 m=nil [GC worker (idle)]:
runtime.gopark(0x33db080?, 0x3?, 0xb8?, 0x16?, 0x0?)
	/usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc000086f50 sp=0xc000086f30 pc=0x451c8e
runtime.gcBgMarkWorker()
	/usr/local/go/src/runtime/mgc.go:1310 +0xe5 fp=0xc000086fe0 sp=0xc000086f50 pc=0x432d25
runtime.goexit({})
	/usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc000086fe8 sp=0xc000086fe0 pc=0x4848e1
created by runtime.gcBgMarkStartWorkers in goroutine 1
	/usr/local/go/src/runtime/mgc.go:1234 +0x1c
goroutine 20 gp=0xc000102380 m=nil [GC worker (idle)]:
runtime.gopark(0x90613d67be6d?, 0x0?, 0x0?, 0x0?, 0x0?)
	/usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc000150750 sp=0xc000150730 pc=0x451c8e
runtime.gcBgMarkWorker()
	/usr/local/go/src/runtime/mgc.go:1310 +0xe5 fp=0xc0001507e0 sp=0xc000150750 pc=0x432d25
runtime.goexit({})
	/usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0001507e8 sp=0xc0001507e0 pc=0x4848e1
created by runtime.gcBgMarkStartWorkers in goroutine 1
	/usr/local/go/src/runtime/mgc.go:1234 +0x1c
goroutine 10 gp=0xc000102fc0 m=nil [IO wait]:
runtime.gopark(0x5?, 0x0?, 0x0?, 0x0?, 0xb?)
	/usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc000081430 sp=0xc000081410 pc=0x451c8e
runtime.netpollblock(0x4d8ad8?, 0x418d66?, 0x0?)
	/usr/local/go/src/runtime/netpoll.go:573 +0xf7 fp=0xc000081468 sp=0xc000081430 pc=0x44a9f7
internal/poll.runtime_pollWait(0x7fea61aeb700, 0x72)
	/usr/local/go/src/runtime/netpoll.go:345 +0x85 fp=0xc000081488 sp=0xc000081468 pc=0x47f125
internal/poll.(*pollDesc).wait(0xc00012d8c0?, 0xc0002bf000?, 0x1)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc0000814b0 sp=0xc000081488 pc=0x4f5b07
internal/poll.(*pollDesc).waitRead(...)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc00012d8c0, {0xc0002bf000, 0x1000, 0x1000})
	/usr/local/go/src/internal/poll/fd_unix.go:164 +0x27a fp=0xc000081548 sp=0xc0000814b0 pc=0x4f6dfa
os.(*File).read(...)
	/usr/local/go/src/os/file_posix.go:29
os.(*File).Read(0xc000391510, {0xc0002bf000?, 0x0?, 0x0?})
	/usr/local/go/src/os/file.go:118 +0x52 fp=0xc000081588 sp=0xc000081548 pc=0x502252
bufio.(*Scanner).Scan(0xc000311e00)
	/usr/local/go/src/bufio/scan.go:219 +0x81e fp=0xc000081660 sp=0xc000081588 pc=0x5607fe
github.com/NVIDIA/dcgm-exporter/internal/pkg/stdout.Capture.func2()
	/go/src/github.com/NVIDIA/dcgm-exporter/internal/pkg/stdout/capture.go:58 +0x50 fp=0xc0000817e0 sp=0xc000081660 pc=0x196f510
runtime.goexit({})
	/usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0000817e8 sp=0xc0000817e0 pc=0x4848e1
created by github.com/NVIDIA/dcgm-exporter/internal/pkg/stdout.Capture in goroutine 1
	/go/src/github.com/NVIDIA/dcgm-exporter/internal/pkg/stdout/capture.go:57 +0x1d9
goroutine 50 gp=0xc000102c40 m=nil [IO wait]:
runtime.gopark(0x573fba71e8357c7c?, 0xd0f8435a63256fe0?, 0x7c?, 0x7c?, 0xb?)
	/usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc00022f6f8 sp=0xc00022f6d8 pc=0x451c8e
runtime.netpollblock(0x4d8ad8?, 0x418d66?, 0x0?)
	/usr/local/go/src/runtime/netpoll.go:573 +0xf7 fp=0xc00022f730 sp=0xc00022f6f8 pc=0x44a9f7
internal/poll.runtime_pollWait(0x7fea61aeb608, 0x72)
	/usr/local/go/src/runtime/netpoll.go:345 +0x85 fp=0xc00022f750 sp=0xc00022f730 pc=0x47f125
internal/poll.(*pollDesc).wait(0xc0001c6a00?, 0xc000124000?, 0x0)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc00022f778 sp=0xc00022f750 pc=0x4f5b07
internal/poll.(*pollDesc).waitRead(...)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc0001c6a00, {0xc000124000, 0x1980, 0x1980})
	/usr/local/go/src/internal/poll/fd_unix.go:164 +0x27a fp=0xc00022f810 sp=0xc00022f778 pc=0x4f6dfa
net.(*netFD).Read(0xc0001c6a00, {0xc000124000?, 0x7fea61225a48?, 0xc000382498?})
	/usr/local/go/src/net/fd_posix.go:55 +0x25 fp=0xc00022f858 sp=0xc00022f810 pc=0x64fd85
net.(*conn).Read(0xc000282090, {0xc000124000?, 0xc00022f938?, 0x42317b?})
	/usr/local/go/src/net/net.go:185 +0x45 fp=0xc00022f8a0 sp=0xc00022f858 pc=0x660c45
net.(*TCPConn).Read(0xc000366540?, {0xc000124000?, 0x197b?, 0x124000?})
	<autogenerated>:1 +0x25 fp=0xc00022f8d0 sp=0xc00022f8a0 pc=0x672ba5
crypto/tls.(*atLeastReader).Read(0xc000382498, {0xc000124000?, 0x0?, 0xc000382498?})
	/usr/local/go/src/crypto/tls/conn.go:806 +0x3b fp=0xc00022f918 sp=0xc00022f8d0 pc=0x6ae5bb
bytes.(*Buffer).ReadFrom(0xc000366630, {0x213c420, 0xc000382498})
	/usr/local/go/src/bytes/buffer.go:211 +0x98 fp=0xc00022f970 sp=0xc00022f918 pc=0x51ef38
crypto/tls.(*Conn).readFromUntil(0xc000366388, {0x213a980, 0xc000282090}, 0xc00022f980?)
	/usr/local/go/src/crypto/tls/conn.go:828 +0xde fp=0xc00022f9a8 sp=0xc00022f970 pc=0x6ae79e
crypto/tls.(*Conn).readRecordOrCCS(0xc000366388, 0x0)
	/usr/local/go/src/crypto/tls/conn.go:626 +0x3cf fp=0xc00022fc28 sp=0xc00022f9a8 pc=0x6ab8af
crypto/tls.(*Conn).readRecord(...)
	/usr/local/go/src/crypto/tls/conn.go:588
crypto/tls.(*Conn).Read(0xc000366388, {0xc000122000, 0x1000, 0xc000102700?})
	/usr/local/go/src/crypto/tls/conn.go:1370 +0x156 fp=0xc00022fc98 sp=0xc00022fc28 pc=0x6b2156
bufio.(*Reader).Read(0xc00035ade0, {0xc0003fe3c0, 0x9, 0x32d0dd0?})
	/usr/local/go/src/bufio/bufio.go:241 +0x197 fp=0xc00022fcd0 sp=0xc00022fc98 pc=0x55ddb7
io.ReadAtLeast({0x213a580, 0xc00035ade0}, {0xc0003fe3c0, 0x9, 0x9}, 0x9)
	/usr/local/go/src/io/io.go:335 +0x90 fp=0xc00022fd18 sp=0xc00022fcd0 pc=0x4cda30
io.ReadFull(...)
	/usr/local/go/src/io/io.go:354
golang.org/x/net/http2.readFrameHeader({0xc0003fe3c0, 0x9, 0xc00022fdc0?}, {0x213a580?, 0xc00035ade0?})
	/go/pkg/mod/golang.org/x/[email protected]/http2/frame.go:237 +0x65 fp=0xc00022fd68 sp=0xc00022fd18 pc=0x9b7245
golang.org/x/net/http2.(*Framer).ReadFrame(0xc0003fe380)
	/go/pkg/mod/golang.org/x/[email protected]/http2/frame.go:501 +0x85 fp=0xc00022fe10 sp=0xc00022fd68 pc=0x9b7985
golang.org/x/net/http2.(*clientConnReadLoop).run(0xc00022ffa8)
	/go/pkg/mod/golang.org/x/[email protected]/http2/transport.go:2319 +0xda fp=0xc00022ff60 sp=0xc00022fe10 pc=0x9ca85a
golang.org/x/net/http2.(*ClientConn).readLoop(0xc0002e7380)
	/go/pkg/mod/golang.org/x/[email protected]/http2/transport.go:2215 +0x8b fp=0xc00022ffc8 sp=0xc00022ff60 pc=0x9c9e0b
golang.org/x/net/http2.(*Transport).newClientConn.gowrap1()
	/go/pkg/mod/golang.org/x/[email protected]/http2/transport.go:830 +0x25 fp=0xc00022ffe0 sp=0xc00022ffc8 pc=0x9c2b05
runtime.goexit({})
	/usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc00022ffe8 sp=0xc00022ffe0 pc=0x4848e1
created by golang.org/x/net/http2.(*Transport).newClientConn in goroutine 16
	/go/pkg/mod/golang.org/x/[email protected]/http2/transport.go:830 +0xd5b
rax    0x0
rbx    0x7fea587fa740
rcx    0x7feaa9f14a1c
rdx    0x1
rdi    0x0
rsi    0x7fea48000fc0
rbp    0x4
rsp    0x7fea61211000
r8     0x90800
r9     0x7fea48000fc0
r10    0x0
r11    0x287
r12    0xffffffffffffff78
r13    0x2
r14    0x0
r15    0x7fea612111f0
rip    0x7feaa9ee16aa
rflags 0x10206
cs     0x33
fs     0x0
gs     0x0

@andrewjamesbrown
Copy link
Author

Our configmap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: datadog-dcgm-exporter-configmap
data:
  metrics: |
    # Format
    # If line starts with a '#' it is considered a comment
    # DCGM FIELD                                                      ,Prometheus metric type ,help message
    
    # Clocks
    DCGM_FI_DEV_SM_CLOCK                                              ,gauge                  ,SM clock frequency (in MHz).
    DCGM_FI_DEV_MEM_CLOCK                                             ,gauge                  ,Memory clock frequency (in MHz).
    
    # Temperature
    DCGM_FI_DEV_MEMORY_TEMP                                           ,gauge                  ,Memory temperature (in C).
    DCGM_FI_DEV_GPU_TEMP                                              ,gauge                  ,GPU temperature (in C).
    
    # Power
    DCGM_FI_DEV_POWER_USAGE                                           ,gauge                  ,Power draw (in W).
    DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION                              ,counter                ,Total energy consumption since boot (in mJ).
    
    # PCIE
    DCGM_FI_DEV_PCIE_REPLAY_COUNTER                                   ,counter                ,Total number of PCIe retries.
    
    # Utilization (the sample period varies depending on the product)
    DCGM_FI_DEV_GPU_UTIL                                              ,gauge                  ,GPU utilization (in %).
    DCGM_FI_DEV_MEM_COPY_UTIL                                         ,gauge                  ,Memory utilization (in %).
    DCGM_FI_DEV_ENC_UTIL                                              ,gauge                  ,Encoder utilization (in %).
    DCGM_FI_DEV_DEC_UTIL                                              ,gauge                  ,Decoder utilization (in %).
    
    # Errors and violations
    DCGM_FI_DEV_XID_ERRORS                                            ,gauge                  ,Value of the last XID error encountered.
    
    # Memory usage
    DCGM_FI_DEV_FB_FREE                                               ,gauge                  ,Framebuffer memory free (in MiB).
    DCGM_FI_DEV_FB_USED                                               ,gauge                  ,Framebuffer memory used (in MiB).
    
    # NVLink
    DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL                                ,counter                ,Total number of NVLink bandwidth counters for all lanes.
    
    # VGPU License status
    DCGM_FI_DEV_VGPU_LICENSE_STATUS                                   ,gauge                  ,vGPU License status
    
    # Remapped rows
    DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS                           ,counter                ,Number of remapped rows for uncorrectable errors
    DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS                             ,counter                ,Number of remapped rows for correctable errors
    DCGM_FI_DEV_ROW_REMAP_FAILURE                                     ,gauge                  ,Whether remapping of rows has failed
    
    # DCP metrics
    DCGM_FI_PROF_PCIE_TX_BYTES                                        ,counter                ,The number of bytes of active pcie tx data including both header and payload.
    DCGM_FI_PROF_PCIE_RX_BYTES                                        ,counter                ,The number of bytes of active pcie rx data including both header and payload.
    DCGM_FI_PROF_GR_ENGINE_ACTIVE                                     ,gauge                  ,Ratio of time the graphics engine is active (in %).
    DCGM_FI_PROF_SM_ACTIVE                                            ,gauge                  ,The ratio of cycles an SM has at least 1 warp assigned (in %).
    DCGM_FI_PROF_SM_OCCUPANCY                                         ,gauge                  ,The ratio of number of warps resident on an SM (in %).
    DCGM_FI_PROF_PIPE_TENSOR_ACTIVE                                   ,gauge                  ,Ratio of cycles the tensor (HMMA) pipe is active (in %).
    DCGM_FI_PROF_DRAM_ACTIVE                                          ,gauge                  ,Ratio of cycles the device memory interface is active sending or receiving data (in %).
    DCGM_FI_PROF_PIPE_FP64_ACTIVE                                     ,gauge                  ,Ratio of cycles the fp64 pipes are active (in %).
    DCGM_FI_PROF_PIPE_FP32_ACTIVE                                     ,gauge                  ,Ratio of cycles the fp32 pipes are active (in %).
    DCGM_FI_PROF_PIPE_FP16_ACTIVE                                     ,gauge                  ,Ratio of cycles the fp16 pipes are active (in %).
    
    # Datadog additional recommended fields
    # Enabling this makes dcgm-exporter 3.3.0+ crash
    # DCGM_FI_DEV_COUNT                                                 ,counter                ,Number of Devices on the node.
    DCGM_FI_DEV_FAN_SPEED                                             ,gauge                  ,Fan speed for the device in percent 0-100.
    DCGM_FI_DEV_SLOWDOWN_TEMP                                         ,gauge                  ,Slowdown temperature for the device.
    DCGM_FI_DEV_POWER_MGMT_LIMIT                                      ,gauge                  ,Current power limit for the device.
    DCGM_FI_DEV_PSTATE                                                ,gauge                  ,Performance state (P-State) 0-15. 0=highest
    DCGM_FI_DEV_FB_TOTAL                                              ,gauge                  ,
    DCGM_FI_DEV_FB_RESERVED                                           ,gauge                  ,
    DCGM_FI_DEV_FB_USED_PERCENT                                       ,gauge                  ,
    DCGM_FI_DEV_CLOCK_THROTTLE_REASONS                                ,gauge                  ,Current clock throttle reasons (bitmask of DCGM_CLOCKS_THROTTLE_REASON_*)
    
    # Enabling this makes dcgm-exporter 3.3.0+ crash
    # DCGM_FI_PROCESS_NAME                                              ,label                  ,The Process Name.
    # Enabling this makes dcgm-exporter 3.3.0+ crash
    # DCGM_FI_CUDA_DRIVER_VERSION                                       ,label                  ,
    DCGM_FI_DEV_NAME                                                  ,label                  ,
    DCGM_FI_DEV_MINOR_NUMBER                                          ,label                  ,
    DCGM_FI_DRIVER_VERSION                                            ,label                  ,
    DCGM_FI_DEV_BRAND                                                 ,label                  ,
    DCGM_FI_DEV_SERIAL                                                ,label                  ,

@andrewjamesbrown
Copy link
Author

Possibly related to #448 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant