Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(controller): Enable dummy metrics server on non-leader workflow controller #11295

Merged
merged 6 commits into from
Jul 7, 2023

Conversation

sakai-ast
Copy link
Contributor

Fixes #10037

Motivation

Our datadog agents on GKE collect prometheus metrics from multiple workflow-controllers through the k8s Service. However, the metrics server only starts on the leader workflow-controller, causing some datadog agents to frequently output "Connection refused" error logs.
Therefore, I am aiming to resolve this issue.

Modifications

I have tried to implement the comment: #8283 (comment)

Verification

I have added tests and verified the functionality locally.

@sakai-ast sakai-ast marked this pull request as ready for review July 5, 2023 11:18
Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexec Do you have a moment to take a look at this one given that you commented and closed the mentioned issue previously?

@terrytangyuan terrytangyuan merged commit 137d5f8 into argoproj:master Jul 7, 2023
@sakai-ast sakai-ast deleted the add-dummy-metrics-server branch July 13, 2023 00:23
terrytangyuan pushed a commit that referenced this pull request Aug 11, 2023
@agilgur5 agilgur5 added area/controller Controller issues, panics area/metrics labels Oct 6, 2023
dpadhiar pushed a commit to dpadhiar/argo-workflows that referenced this pull request May 9, 2024
MasonM added a commit to MasonM/argo-workflows that referenced this pull request Oct 9, 2024
…er. Fixes argoproj#10807

While investigating the flaky `MetricsSuite/TestMetricsEndpoint` test
that's been failing periodically for awhile now, I noticed this in the
controller logs
([example](https://github.com/argoproj/argo-workflows/actions/runs/11221357877/job/31191811077)):

```
controller: time="2024-10-07T18:22:14.793Z" level=info msg="Starting dummy metrics server at localhost:9090/metrics"
server: time="2024-10-07T18:22:14.793Z" level=info msg="Creating event controller" asyncDispatch=false operationQueueSize=16 workerCount=4
server: time="2024-10-07T18:22:14.800Z" level=info msg="GRPC Server Max Message Size, MaxGRPCMessageSize, is set" GRPC_MESSAGE_SIZE=104857600
server: time="2024-10-07T18:22:14.800Z" level=info msg="Argo Server started successfully on http://localhost:2746" url="http://localhost:2746"
controller: I1007 18:22:14.800947   25045 leaderelection.go:260] successfully acquired lease argo/workflow-controller
controller: time="2024-10-07T18:22:14.801Z" level=info msg="new leader" leader=local
controller: time="2024-10-07T18:22:14.801Z" level=info msg="Generating Self Signed TLS Certificates for Telemetry Servers"
controller: time="2024-10-07T18:22:14.802Z" level=info msg="Starting prometheus metrics server at localhost:9090/metrics"
controller: panic: listen tcp :9090: bind: address already in use
controller:
controller: goroutine 37 [running]:
controller: github.com/argoproj/argo-workflows/v3/util/telemetry.(*Metrics).RunPrometheusServer.func2()
controller: 	/home/runner/work/argo-workflows/argo-workflows/util/telemetry/exporter_prometheus.go:94 +0x16a
controller: created by github.com/argoproj/argo-workflows/v3/util/telemetry.(*Metrics).RunPrometheusServer in goroutine 36
controller: 	/home/runner/work/argo-workflows/argo-workflows/util/telemetry/exporter_prometheus.go:91 +0x53c
2024/10/07 18:22:14 controller: process exited 25045: exit status 2
controller: exit status 2
2024/10/07 18:22:14 controller: backing off 4s
```

I believe this is a race condition introduced in
argoproj#11295. Here's
the sequence of events that trigger this:
1. Controller starts
2. Dummy metrics server started on port 9090
3. Leader election takes place and controller starts leading
4. Context for dummy metrics server cancelled
5. Metrics server shuts down
6. Prometheus metrics server started on 9090

The problems is steps 5-6 can happen out-of-order, because the shutdown
happens after the contxt is cancelled. Per the docs, "a CancelFunc does
not wait for the work to stop" (https://pkg.go.dev/context#CancelFunc).

The controller needs to explicitly wait for the dummy metrics server to
shut down properly before starting the Prometheus metrics server.
There's many ways of doing that, and this uses a `WaitGroup`, as that's
the simplest approach I could think of.

Signed-off-by: Mason Malone <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics area/metrics
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Workflow controller metrics server only starts after leadership election is won
3 participants