Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Sparrow's shutdown sends signal on closed channel #244

Open
1 task done
puffitos opened this issue Dec 25, 2024 · 0 comments · May be fixed by #245
Open
1 task done

Bug: Sparrow's shutdown sends signal on closed channel #244

puffitos opened this issue Dec 25, 2024 · 0 comments · May be fixed by #245
Assignees
Labels
area/checks-controller Issues/PRs related to the ChecksController bug Something isn't working

Comments

@puffitos
Copy link
Collaborator

puffitos commented Dec 25, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

The sparow's ChecksController method to shutdown the check's one by one fails to shutdown all checks and makes the sparrow panic. This only happens if more than one check has been registered with the sparrow.

Expected Behavior

The sparrow should be able to shutdown gracefully in case of an unexpected error/ signal.

Steps To Reproduce

  1. Send keyboard interrupt to sparrow while running (Ctrl+C)
  2. panic happens once of the registered check's Shutdown method get's called

Relevant logs and/or screenshots, environment information, etc.

The logs show the panic happening for the latency check and almost give away the reason behind this weird behavior:

^Ctime=16:39:31 level=INFO source=/Users/puffito/dev/repos/sparrow/cmd/run.go:96 msg="Signal received, shutting down"
time=16:39:31 level=INFO source=/Users/puffito/dev/repos/sparrow/pkg/sparrow/run.go:184 msg="Shutting down sparrow gracefully"
time=16:39:31 level=DEBUG source=/Users/puffito/dev/repos/sparrow/pkg/sparrow/targets/manager.go:127 msg="Shut down signal received"
time=16:39:31 level=ERROR source=/Users/puffito/dev/repos/sparrow/pkg/checks/dns/dns.go:92 msg="Context canceled" err="context canceled"
time=16:39:31 level=ERROR source=/Users/puffito/dev/repos/sparrow/pkg/sparrow/controller.go:146 msg="Failed to run check" check=dns error="context canceled"
time=16:39:31 level=ERROR source=/Users/puffito/dev/repos/sparrow/pkg/checks/latency/latency.go:82 msg="Context canceled" err="context canceled"
time=16:39:31 level=DEBUG source=/Users/puffito/dev/repos/sparrow/pkg/sparrow/targets/manager.go:148 msg="Stopping gitlab reconciliation routine"
time=16:39:31 level=ERROR source=/Users/puffito/dev/repos/sparrow/pkg/sparrow/controller.go:146 msg="Failed to run check" check=latency error="context canceled"
time=16:39:31 level=ERROR source=/Users/puffito/dev/repos/sparrow/pkg/sparrow/targets/manager.go:86 msg="Error while reconciling targets" err="context canceled"
time=16:39:31 level=ERROR source=/Users/puffito/dev/repos/sparrow/pkg/checks/health/health.go:80 msg="Context canceled" err="context canceled"
time=16:39:31 level=ERROR source=/Users/puffito/dev/repos/sparrow/pkg/sparrow/controller.go:146 msg="Failed to run check" check=health error="context canceled"
time=16:39:31 level=ERROR source=/Users/puffito/dev/repos/sparrow/pkg/api/api.go:110 msg="Failed to serve api" error="http: Server closed" scheme=http
time=16:39:31 level=DEBUG source=/Users/puffito/dev/repos/sparrow/pkg/sparrow/metrics/metrics.go:132 msg="Tracing shutdown"
time=16:39:31 level=DEBUG source=/Users/puffito/dev/repos/sparrow/pkg/config/file.go:135 msg="Sending signal to shut down file loader"
time=16:39:31 level=INFO source=/Users/puffito/dev/repos/sparrow/pkg/sparrow/controller.go:84 msg="Shutting down checks controller"
time=16:39:31 level=ERROR source=/Users/puffito/dev/repos/sparrow/pkg/sparrow/controller.go:164 msg="Could not remove metrics collector from registry" check=latency
time=16:39:31 level=ERROR source=/Users/puffito/dev/repos/sparrow/pkg/sparrow/controller.go:164 msg="Could not remove metrics collector from registry" check=latency
time=16:39:31 level=ERROR source=/Users/puffito/dev/repos/sparrow/pkg/sparrow/controller.go:164 msg="Could not remove metrics collector from registry" check=latency
panic: send on closed channel

goroutine 8 [running]:
github.com/caas-team/sparrow/pkg/checks/latency.(*Latency).Shutdown(0x14000416000)
        /Users/puffito/dev/repos/sparrow/pkg/checks/latency/latency.go:102 +0x2c
github.com/caas-team/sparrow/pkg/sparrow.(*ChecksController).UnregisterCheck(0x140002788c0, {0x1054300c8, 0x1400033e380}, {0x105435a58, 0x14000416000})
        /Users/puffito/dev/repos/sparrow/pkg/sparrow/controller.go:168 +0x144
github.com/caas-team/sparrow/pkg/sparrow.(*ChecksController).Shutdown(0x140002788c0, {0x1054300c8, 0x1400033e380})
        /Users/puffito/dev/repos/sparrow/pkg/sparrow/controller.go:87 +0x80
github.com/caas-team/sparrow/pkg/sparrow.(*Sparrow).shutdown.func1()
        /Users/puffito/dev/repos/sparrow/pkg/sparrow/run.go:192 +0x118
sync.(*Once).doSlow(0x10542fed0?, 0x1059c1ca0?)
        /opt/homebrew/opt/go/libexec/src/sync/once.go:76 +0xf8
sync.(*Once).Do(...)
        /opt/homebrew/opt/go/libexec/src/sync/once.go:67
github.com/caas-team/sparrow/pkg/sparrow.(*Sparrow).shutdown(0x1400022c480, {0x105430020, 0x140003f09c0})
        /Users/puffito/dev/repos/sparrow/pkg/sparrow/run.go:183 +0xd0
github.com/caas-team/sparrow/pkg/sparrow.(*Sparrow).Run(0x1400022c480, {0x105430020?, 0x140000f2de0?})
        /Users/puffito/dev/repos/sparrow/pkg/sparrow/run.go:127 +0x368
github.com/caas-team/sparrow/cmd.NewCmdRun.run.func1.1()
        /Users/puffito/dev/repos/sparrow/cmd/run.go:91 +0x30
created by github.com/caas-team/sparrow/cmd.NewCmdRun.run.func1 in goroutine 1
        /Users/puffito/dev/repos/sparrow/cmd/run.go:90 +0x240

Process finished with the exit code 2

Who can address the issue?

No response

Anything else?

The error happens because the controller iterates over the actual slice of Checks, which the Iter() function returns. Since the Checks get immediately removed from the internal checks slice after being unregistered (see the ChecksController UnregisterCheck method), the referenced slice with the checks get's modified and we end up trying to delete the same check twice.

Adding a simple debug log in the UnregisterCheck confirms the suspicion. The healthcheck is being unregistered twice in a row in the logs below, which shouldn't happen, as the shutdown is only called once (with the sync.Once) in the sparrow and the ChecksController just iterates over the current checks.

time=16:53:07 level=INFO source=/Users/puffito/dev/repos/sparrow/pkg/sparrow/controller.go:84 msg="Shutting down checks controller"
time=16:53:07 level=DEBUG source=/Users/puffito/dev/repos/sparrow/pkg/sparrow/controller.go:160 msg="Unregistering check" check=latency
time=16:53:07 level=DEBUG source=/Users/puffito/dev/repos/sparrow/pkg/sparrow/controller.go:160 msg="Unregistering check" check=health
time=16:53:07 level=DEBUG source=/Users/puffito/dev/repos/sparrow/pkg/sparrow/controller.go:160 msg="Unregistering check" check=health
time=16:53:07 level=ERROR source=/Users/puffito/dev/repos/sparrow/pkg/sparrow/controller.go:164 msg="Could not remove metrics collector from registry" check=health

Given the asynchronous nature of how the shutdown of the check's works (sending a signal in a channel), a simple way to deal with this is to get a copy of the checks slice when calling the Iter() function.

@puffitos puffitos added bug Something isn't working area/checks-controller Issues/PRs related to the ChecksController labels Dec 25, 2024
@puffitos puffitos self-assigned this Dec 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/checks-controller Issues/PRs related to the ChecksController bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant