fix(controller): Enable dummy metrics server on non-leader workflow controller #11295

sakai-ast · 2023-07-05T10:38:37Z

Motivation

Our datadog agents on GKE collect prometheus metrics from multiple workflow-controllers through the k8s Service. However, the metrics server only starts on the leader workflow-controller, causing some datadog agents to frequently output "Connection refused" error logs.
Therefore, I am aiming to resolve this issue.

Modifications

I have tried to implement the comment: #8283 (comment)

Verification

I have added tests and verified the functionality locally.

…ontroller Signed-off-by: sakai <[email protected]>

terrytangyuan

@alexec Do you have a moment to take a look at this one given that you commented and closed the mentioned issue previously?

Signed-off-by: sakai <[email protected]>

… environment Signed-off-by: sakai <[email protected]>

Signed-off-by: sakai <[email protected]>

…ontroller (#11295) Signed-off-by: sakai <[email protected]>

…ontroller (argoproj#11295) Signed-off-by: sakai <[email protected]> Signed-off-by: Dillen Padhiar <[email protected]>

…er. Fixes argoproj#10807 While investigating the flaky `MetricsSuite/TestMetricsEndpoint` test that's been failing periodically for awhile now, I noticed this in the controller logs ([example](https://github.com/argoproj/argo-workflows/actions/runs/11221357877/job/31191811077)): ``` controller: time="2024-10-07T18:22:14.793Z" level=info msg="Starting dummy metrics server at localhost:9090/metrics" server: time="2024-10-07T18:22:14.793Z" level=info msg="Creating event controller" asyncDispatch=false operationQueueSize=16 workerCount=4 server: time="2024-10-07T18:22:14.800Z" level=info msg="GRPC Server Max Message Size, MaxGRPCMessageSize, is set" GRPC_MESSAGE_SIZE=104857600 server: time="2024-10-07T18:22:14.800Z" level=info msg="Argo Server started successfully on http://localhost:2746" url="http://localhost:2746" controller: I1007 18:22:14.800947 25045 leaderelection.go:260] successfully acquired lease argo/workflow-controller controller: time="2024-10-07T18:22:14.801Z" level=info msg="new leader" leader=local controller: time="2024-10-07T18:22:14.801Z" level=info msg="Generating Self Signed TLS Certificates for Telemetry Servers" controller: time="2024-10-07T18:22:14.802Z" level=info msg="Starting prometheus metrics server at localhost:9090/metrics" controller: panic: listen tcp :9090: bind: address already in use controller: controller: goroutine 37 [running]: controller: github.com/argoproj/argo-workflows/v3/util/telemetry.(*Metrics).RunPrometheusServer.func2() controller: /home/runner/work/argo-workflows/argo-workflows/util/telemetry/exporter_prometheus.go:94 +0x16a controller: created by github.com/argoproj/argo-workflows/v3/util/telemetry.(*Metrics).RunPrometheusServer in goroutine 36 controller: /home/runner/work/argo-workflows/argo-workflows/util/telemetry/exporter_prometheus.go:91 +0x53c 2024/10/07 18:22:14 controller: process exited 25045: exit status 2 controller: exit status 2 2024/10/07 18:22:14 controller: backing off 4s ``` I believe this is a race condition introduced in argoproj#11295. Here's the sequence of events that trigger this: 1. Controller starts 2. Dummy metrics server started on port 9090 3. Leader election takes place and controller starts leading 4. Context for dummy metrics server cancelled 5. Metrics server shuts down 6. Prometheus metrics server started on 9090 The problems is steps 5-6 can happen out-of-order, because the shutdown happens after the contxt is cancelled. Per the docs, "a CancelFunc does not wait for the work to stop" (https://pkg.go.dev/context#CancelFunc). The controller needs to explicitly wait for the dummy metrics server to shut down properly before starting the Prometheus metrics server. There's many ways of doing that, and this uses a `WaitGroup`, as that's the simplest approach I could think of. Signed-off-by: Mason Malone <[email protected]>

fix(controller): Enable dummy metrics server on non-leader workflow c…

227866e

…ontroller Signed-off-by: sakai <[email protected]>

sakai-ast marked this pull request as ready for review July 5, 2023 11:18

terrytangyuan reviewed Jul 5, 2023

View reviewed changes

sakai-ast added 5 commits July 7, 2023 10:45

chore: refactor tests and add tests as much as I can

02ce319

Signed-off-by: sakai <[email protected]>

fix: Change processing from async to sync

3a83bae

Signed-off-by: sakai <[email protected]>

fix: Remove own metrics server test as it fails in the GitHub Actions…

458bc2f

… environment Signed-off-by: sakai <[email protected]>

fix: fix for response body must be closed

f40256e

Signed-off-by: sakai <[email protected]>

fix: To address the issue of the test failing flakily

d7f1574

Signed-off-by: sakai <[email protected]>

terrytangyuan approved these changes Jul 7, 2023

View reviewed changes

terrytangyuan merged commit 137d5f8 into argoproj:master Jul 7, 2023

terrytangyuan mentioned this pull request Jul 7, 2023

v3.3.1: only the leader workflow-controller pod can expose metrics. #8283

Open

3 tasks

sakai-ast deleted the add-dummy-metrics-server branch July 13, 2023 00:23

terrytangyuan mentioned this pull request Aug 9, 2023

Release v3.4.10 cherry-pick candidates #11552

Closed

45 tasks

terrytangyuan pushed a commit that referenced this pull request Aug 11, 2023

fix(controller): Enable dummy metrics server on non-leader workflow c…

214def6

…ontroller (#11295) Signed-off-by: sakai <[email protected]>

agilgur5 added area/controller Controller issues, panics area/metrics labels Oct 6, 2023

MasonM mentioned this pull request Oct 9, 2024

fix(controller): handle race when starting metrics server. Fixes #10807 #13731

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(controller): Enable dummy metrics server on non-leader workflow controller #11295

fix(controller): Enable dummy metrics server on non-leader workflow controller #11295

sakai-ast commented Jul 5, 2023

terrytangyuan left a comment •

edited

Loading

fix(controller): Enable dummy metrics server on non-leader workflow controller #11295

fix(controller): Enable dummy metrics server on non-leader workflow controller #11295

Conversation

sakai-ast commented Jul 5, 2023

Motivation

Modifications

Verification

terrytangyuan left a comment • edited Loading

Choose a reason for hiding this comment

terrytangyuan left a comment •

edited

Loading