You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The sparow's ChecksController method to shutdown the check's one by one fails to shutdown all checks and makes the sparrow panic. This only happens in more than one check has been registered with the sparrow.
Expected Behavior
The sparrow should be able to shutdown gracefully in case of an unexpected error/ signal.
Steps To Reproduce
Send keyboard interrupt to sparrow while running (Ctrl+C)
panic happens once of the registered check's Shutdown method get's called
Relevant logs and/or screenshots, environment information, etc.
The logs show the panic happening for the latency check and almost give away the reason behind this weird behavior:
^Ctime=16:39:31 level=INFO source=/Users/puffito/dev/repos/sparrow/cmd/run.go:96 msg="Signal received, shutting down"
time=16:39:31 level=INFO source=/Users/puffito/dev/repos/sparrow/pkg/sparrow/run.go:184 msg="Shutting down sparrow gracefully"
time=16:39:31 level=DEBUG source=/Users/puffito/dev/repos/sparrow/pkg/sparrow/targets/manager.go:127 msg="Shut down signal received"
time=16:39:31 level=ERROR source=/Users/puffito/dev/repos/sparrow/pkg/checks/dns/dns.go:92 msg="Context canceled" err="context canceled"
time=16:39:31 level=ERROR source=/Users/puffito/dev/repos/sparrow/pkg/sparrow/controller.go:146 msg="Failed to run check" check=dns error="context canceled"
time=16:39:31 level=ERROR source=/Users/puffito/dev/repos/sparrow/pkg/checks/latency/latency.go:82 msg="Context canceled" err="context canceled"
time=16:39:31 level=DEBUG source=/Users/puffito/dev/repos/sparrow/pkg/sparrow/targets/manager.go:148 msg="Stopping gitlab reconciliation routine"
time=16:39:31 level=ERROR source=/Users/puffito/dev/repos/sparrow/pkg/sparrow/controller.go:146 msg="Failed to run check" check=latency error="context canceled"
time=16:39:31 level=ERROR source=/Users/puffito/dev/repos/sparrow/pkg/sparrow/targets/manager.go:86 msg="Error while reconciling targets" err="context canceled"
time=16:39:31 level=ERROR source=/Users/puffito/dev/repos/sparrow/pkg/checks/health/health.go:80 msg="Context canceled" err="context canceled"
time=16:39:31 level=ERROR source=/Users/puffito/dev/repos/sparrow/pkg/sparrow/controller.go:146 msg="Failed to run check" check=health error="context canceled"
time=16:39:31 level=ERROR source=/Users/puffito/dev/repos/sparrow/pkg/api/api.go:110 msg="Failed to serve api" error="http: Server closed" scheme=http
time=16:39:31 level=DEBUG source=/Users/puffito/dev/repos/sparrow/pkg/sparrow/metrics/metrics.go:132 msg="Tracing shutdown"
time=16:39:31 level=DEBUG source=/Users/puffito/dev/repos/sparrow/pkg/config/file.go:135 msg="Sending signal to shut down file loader"
time=16:39:31 level=INFO source=/Users/puffito/dev/repos/sparrow/pkg/sparrow/controller.go:84 msg="Shutting down checks controller"
time=16:39:31 level=ERROR source=/Users/puffito/dev/repos/sparrow/pkg/sparrow/controller.go:164 msg="Could not remove metrics collector from registry" check=latency
time=16:39:31 level=ERROR source=/Users/puffito/dev/repos/sparrow/pkg/sparrow/controller.go:164 msg="Could not remove metrics collector from registry" check=latency
time=16:39:31 level=ERROR source=/Users/puffito/dev/repos/sparrow/pkg/sparrow/controller.go:164 msg="Could not remove metrics collector from registry" check=latency
panic: send on closed channel
goroutine 8 [running]:
github.com/caas-team/sparrow/pkg/checks/latency.(*Latency).Shutdown(0x14000416000)
/Users/puffito/dev/repos/sparrow/pkg/checks/latency/latency.go:102 +0x2c
github.com/caas-team/sparrow/pkg/sparrow.(*ChecksController).UnregisterCheck(0x140002788c0, {0x1054300c8, 0x1400033e380}, {0x105435a58, 0x14000416000})
/Users/puffito/dev/repos/sparrow/pkg/sparrow/controller.go:168 +0x144
github.com/caas-team/sparrow/pkg/sparrow.(*ChecksController).Shutdown(0x140002788c0, {0x1054300c8, 0x1400033e380})
/Users/puffito/dev/repos/sparrow/pkg/sparrow/controller.go:87 +0x80
github.com/caas-team/sparrow/pkg/sparrow.(*Sparrow).shutdown.func1()
/Users/puffito/dev/repos/sparrow/pkg/sparrow/run.go:192 +0x118
sync.(*Once).doSlow(0x10542fed0?, 0x1059c1ca0?)
/opt/homebrew/opt/go/libexec/src/sync/once.go:76 +0xf8
sync.(*Once).Do(...)
/opt/homebrew/opt/go/libexec/src/sync/once.go:67
github.com/caas-team/sparrow/pkg/sparrow.(*Sparrow).shutdown(0x1400022c480, {0x105430020, 0x140003f09c0})
/Users/puffito/dev/repos/sparrow/pkg/sparrow/run.go:183 +0xd0
github.com/caas-team/sparrow/pkg/sparrow.(*Sparrow).Run(0x1400022c480, {0x105430020?, 0x140000f2de0?})
/Users/puffito/dev/repos/sparrow/pkg/sparrow/run.go:127 +0x368
github.com/caas-team/sparrow/cmd.NewCmdRun.run.func1.1()
/Users/puffito/dev/repos/sparrow/cmd/run.go:91 +0x30
created by github.com/caas-team/sparrow/cmd.NewCmdRun.run.func1 in goroutine 1
/Users/puffito/dev/repos/sparrow/cmd/run.go:90 +0x240
Process finished with the exit code 2
Who can address the issue?
No response
Anything else?
The error happens because the controller iterates over the actual slice of Checks, which the Iter() function returns. Since the Checks get immediately removed from the internal checks slice after being unregistered (see the ChecksControllerUnregisterCheck method), the referenced slice with the checks get's modified and we end up trying to delete the same check twice.
Adding a simple debug log in the UnregisterCheck confirms the suspicion. The healthcheck is being unregistered twice in a row in the logs below, which shouldn't happen, as the shutdown is only called once (with the sync.Once) in the sparrow and the ChecksController just iterates over the current checks.
Given the asynchronous nature of how the shutdown of the check's works (sending a signal in a channel), a simple way to deal with this is to get a copy of the checks slice when calling the Iter() function.
The text was updated successfully, but these errors were encountered:
Is there an existing issue for this?
Current Behavior
The sparow's
ChecksController
method to shutdown the check's one by one fails to shutdown all checks and makes the sparrow panic. This only happens in more than one check has been registered with the sparrow.Expected Behavior
The sparrow should be able to shutdown gracefully in case of an unexpected error/ signal.
Steps To Reproduce
Shutdown
method get's calledRelevant logs and/or screenshots, environment information, etc.
The logs show the panic happening for the latency check and almost give away the reason behind this weird behavior:
Who can address the issue?
No response
Anything else?
The error happens because the controller iterates over the actual slice of
Checks
, which theIter()
function returns. Since the Checks get immediately removed from the internalchecks
slice after being unregistered (see theChecksController
UnregisterCheck
method), the referenced slice with the checks get's modified and we end up trying to delete the same check twice.Adding a simple debug log in the UnregisterCheck confirms the suspicion. The healthcheck is being unregistered twice in a row in the logs below, which shouldn't happen, as the shutdown is only called once (with the sync.Once) in the sparrow and the
ChecksController
just iterates over the current checks.Given the asynchronous nature of how the shutdown of the check's works (sending a signal in a channel), a simple way to deal with this is to get a copy of the
checks
slice when calling theIter()
function.The text was updated successfully, but these errors were encountered: