Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression in the 5.9.8 Release #304

Closed
rajatvig opened this issue Sep 3, 2024 · 6 comments · Fixed by #311
Closed

Regression in the 5.9.8 Release #304

rajatvig opened this issue Sep 3, 2024 · 6 comments · Fixed by #311

Comments

@rajatvig
Copy link

rajatvig commented Sep 3, 2024

Issue Details

Post an upgrade from 5.9.4 to 5.9.8, we noticed that the metrics for running builds are not getting updated after the builds complete. This behaviour causes a change in scaling behavior as metric calculation we use sums running and scheduled builds for a queue to decide if there are enough agents running. The metric we see the same value for is buildkite_queues_running_jobs_count.

Setup

We are running unclustered agents and running the agent metrics binary to export metrics to Prometheus.

@DrJosh9000
Copy link
Contributor

Hi @rajatvig , thanks for raising the issue. There was one change to the Prometheus backend (#288) which may explain the issue you're seeing - all metrics now have a cluster label, where previously they may not have. This could break queries if the label doesn't match or isn't ignored appropriately. Unfortunately the change was necessary to fix a panic.

Can you share the exact PromQL query?

@rajatvig
Copy link
Author

rajatvig commented Sep 5, 2024

I did see that PR merged but wasn't able to tie it back to the issue we are seeing. We are not yet running clustered agents.

The full PromQL we use is

100 * (sum(buildkite_queues_running_jobs_count{queue="queue"} + buildkite_queues_scheduled_jobs_count{queue="queue"}) or vector(0))

That gives us a count of running and scheduled jobs that help us determine how many agents we need to run. While the buildkite_queues_scheduled_jobs_count metric was fine, the metric buildkite_queues_running_jobs_count did not go to 0 when there were no builds running.

@DrJosh9000
Copy link
Contributor

I see, interesting. The metric being stuck could be related to #296, which removed a well-intended but heavy-handed gauge reset. Is the metric stuck for all queues, or a particular queue? Is it stuck for queues that were deleted?

@rajatvig
Copy link
Author

It was stuck for queues that were deleted, i.e. no builds were running.

@DrJosh9000
Copy link
Contributor

Sounds like #305 should fix it - I'll optimistically close this as fixed, please give v5.9.9 a try and feel free to re-open if you see the same issue.

@rajatvig
Copy link
Author

I just gave 5.9.9 a try and still seeing similar behaviour. I setup 2 jobs on the test queue and the metric buildkite_queues_running_jobs_count{queue="test"} went to 2 and then to 1 but did not go to 0 or absent like earlier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants