-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(controller): Enable dummy metrics server on non-leader workflow controller #11295
Merged
terrytangyuan
merged 6 commits into
argoproj:master
from
sakai-ast:add-dummy-metrics-server
Jul 7, 2023
Merged
fix(controller): Enable dummy metrics server on non-leader workflow controller #11295
terrytangyuan
merged 6 commits into
argoproj:master
from
sakai-ast:add-dummy-metrics-server
Jul 7, 2023
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…ontroller Signed-off-by: sakai <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alexec Do you have a moment to take a look at this one given that you commented and closed the mentioned issue previously?
Signed-off-by: sakai <[email protected]>
Signed-off-by: sakai <[email protected]>
… environment Signed-off-by: sakai <[email protected]>
Signed-off-by: sakai <[email protected]>
Signed-off-by: sakai <[email protected]>
terrytangyuan
approved these changes
Jul 7, 2023
3 tasks
45 tasks
terrytangyuan
pushed a commit
that referenced
this pull request
Aug 11, 2023
…ontroller (#11295) Signed-off-by: sakai <[email protected]>
dpadhiar
pushed a commit
to dpadhiar/argo-workflows
that referenced
this pull request
May 9, 2024
…ontroller (argoproj#11295) Signed-off-by: sakai <[email protected]> Signed-off-by: Dillen Padhiar <[email protected]>
MasonM
added a commit
to MasonM/argo-workflows
that referenced
this pull request
Oct 9, 2024
…er. Fixes argoproj#10807 While investigating the flaky `MetricsSuite/TestMetricsEndpoint` test that's been failing periodically for awhile now, I noticed this in the controller logs ([example](https://github.com/argoproj/argo-workflows/actions/runs/11221357877/job/31191811077)): ``` controller: time="2024-10-07T18:22:14.793Z" level=info msg="Starting dummy metrics server at localhost:9090/metrics" server: time="2024-10-07T18:22:14.793Z" level=info msg="Creating event controller" asyncDispatch=false operationQueueSize=16 workerCount=4 server: time="2024-10-07T18:22:14.800Z" level=info msg="GRPC Server Max Message Size, MaxGRPCMessageSize, is set" GRPC_MESSAGE_SIZE=104857600 server: time="2024-10-07T18:22:14.800Z" level=info msg="Argo Server started successfully on http://localhost:2746" url="http://localhost:2746" controller: I1007 18:22:14.800947 25045 leaderelection.go:260] successfully acquired lease argo/workflow-controller controller: time="2024-10-07T18:22:14.801Z" level=info msg="new leader" leader=local controller: time="2024-10-07T18:22:14.801Z" level=info msg="Generating Self Signed TLS Certificates for Telemetry Servers" controller: time="2024-10-07T18:22:14.802Z" level=info msg="Starting prometheus metrics server at localhost:9090/metrics" controller: panic: listen tcp :9090: bind: address already in use controller: controller: goroutine 37 [running]: controller: github.com/argoproj/argo-workflows/v3/util/telemetry.(*Metrics).RunPrometheusServer.func2() controller: /home/runner/work/argo-workflows/argo-workflows/util/telemetry/exporter_prometheus.go:94 +0x16a controller: created by github.com/argoproj/argo-workflows/v3/util/telemetry.(*Metrics).RunPrometheusServer in goroutine 36 controller: /home/runner/work/argo-workflows/argo-workflows/util/telemetry/exporter_prometheus.go:91 +0x53c 2024/10/07 18:22:14 controller: process exited 25045: exit status 2 controller: exit status 2 2024/10/07 18:22:14 controller: backing off 4s ``` I believe this is a race condition introduced in argoproj#11295. Here's the sequence of events that trigger this: 1. Controller starts 2. Dummy metrics server started on port 9090 3. Leader election takes place and controller starts leading 4. Context for dummy metrics server cancelled 5. Metrics server shuts down 6. Prometheus metrics server started on 9090 The problems is steps 5-6 can happen out-of-order, because the shutdown happens after the contxt is cancelled. Per the docs, "a CancelFunc does not wait for the work to stop" (https://pkg.go.dev/context#CancelFunc). The controller needs to explicitly wait for the dummy metrics server to shut down properly before starting the Prometheus metrics server. There's many ways of doing that, and this uses a `WaitGroup`, as that's the simplest approach I could think of. Signed-off-by: Mason Malone <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #10037
Motivation
Our datadog agents on GKE collect prometheus metrics from multiple workflow-controllers through the k8s Service. However, the metrics server only starts on the leader workflow-controller, causing some datadog agents to frequently output "Connection refused" error logs.
Therefore, I am aiming to resolve this issue.
Modifications
I have tried to implement the comment: #8283 (comment)
Verification
I have added tests and verified the functionality locally.