Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/coordinator: prioritize gomote buildlets for pool quotas #48857

Closed
toothrot opened this issue Oct 8, 2021 · 20 comments
Closed

cmd/coordinator: prioritize gomote buildlets for pool quotas #48857

toothrot opened this issue Oct 8, 2021 · 20 comments
Assignees
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@toothrot
Copy link
Contributor

toothrot commented Oct 8, 2021

When working on the 1.16.9 and 1.17.2 releases (#48842, #48843), a buildlet was pending for the linux-s390x-cross builder for some time, despite seemingly having quotas.

The cause was threefold:

  1. We hard-limit our maximum number of instances at 190 to avoid our external IP address quota
  2. Gomotes are prioritized in the scheduler per-HostType. However, the GCE (and other) pools are unaware that the pending machine is for an interactive (and critical) session.

In the second case, it's a race for quota for which machine gets created across all host types. For unpopular (or release-only builders), it's the worst case scenario, as there will be very few pending s390x-cross builders to be able to prioritize the gomote properly in the scheduler.

We can make the following changes:

  • Stop relying on external IPs for GCE (see https://golang.org/cl/354642 - project was missing NAT egress configuration)
    • This eliminates our narrowest quota, we have many more CPUs that we can never use as we are bottlenecked on IPs
  • Prioritize pool quota for gomotes
  • Possibly reserve some quota for very important things, like security releases.
@toothrot toothrot added the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label Oct 8, 2021
@toothrot toothrot added this to the Unreleased milestone Oct 8, 2021
@toothrot toothrot self-assigned this Oct 8, 2021
@gopherbot
Copy link
Contributor

Change https://golang.org/cl/354644 mentions this issue: cmd/coordinator,internal/coordinator: add gce pool priority

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/418951 mentions this issue: internal/coordinator/pool: add QuotaQueue documentation

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/418952 mentions this issue: internal/coordinator/pool: use QuotaQueue for reverse buildlets

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/418953 mentions this issue: internal/coordinator: add QuotaQueue to EC2 pool

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/419034 mentions this issue: internal/coordinator/pool: make QuotaQueue synchronous

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/418950 mentions this issue: internal/coordinator/pool: move queue waiting to QuotaQueue

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/418949 mentions this issue: internal/coordinator/pool: manage quotas in queue

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/418948 mentions this issue: internal/coordinator: add GCE priority queue

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/419035 mentions this issue: internal/coordinator/pool: rename QuotaQueue to queue.Quota

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/419036 mentions this issue: internal/coordinator: add GCE priority queue

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/419429 mentions this issue: cmd/coordinator,internal/coordinator: add queue UI

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/419428 mentions this issue: cmd/coordinator,internal,maintner: prioritize relui builds

gopherbot pushed a commit to golang/build that referenced this issue Jul 27, 2022
This refactors schedLess to use priorities and a sortable key for each
SchedItem. The behavior of the scheduler remains identical with these
changes.

This is in preparation for using a consistent sortable key for scheduler
items in order to prioritize buildlets across different quotas in
buildlet pools.

For golang/go#48857

Change-Id: Ie6a3570d9f5ac2fe27c9ede43672011b3b7ee8fa
Reviewed-on: https://go-review.googlesource.com/c/build/+/354644
Auto-Submit: Jenny Rakoczy <[email protected]>
Run-TryBot: Jenny Rakoczy <[email protected]>
TryBot-Result: Gopher Robot <[email protected]>
Reviewed-by: Heschi Kreinick <[email protected]>
Reviewed-by: Carlos Amedee <[email protected]>
gopherbot pushed a commit to golang/build that referenced this issue Jul 27, 2022
Adds a priority queue per-cpu-quota for buildlets. The scheduler
prioritization mechanism was unaware of per-CPU-type quotas. This
replaces it with a queue per-quota implemented instead on the pool.

With this change, this should correctly prioritize gomotes that are
waiting over other builds.

Previously, when waiting on a gomote (such as for a releaselet), other
smaller host types could win the quota race before a larger releaselet
would be created. This was particularly problematic during minor
releases with a number of security fixes, which causes a very large
number of concurrent builds due to all of the backports being committed
shortly before the release.

reverse buildlets: This change replaces waiters and waitChan with a
per-hostType queue for prioritizing remote buildlet requests.

EC2 pool: This replaces quota management in the EC2 pool with
QuotaQueue. Some minor changes were necessary to deal with bookkeeping
across different instance types. A1 metal instances, for example, have a
separate hard-limit that we want to enforce, even though they also
consume the normal CPU quota.

Moved internal/coordinator/log to avoid an import cycle.

Moved SchedItem to internal/coordinator/pool/queue, where it can live
next to the Quota code that relies on it heavily.

For golang/go#48857

Change-Id: Ic4039cf425d47f46e2cc6678b4b578338c653143
Reviewed-on: https://go-review.googlesource.com/c/build/+/419036
Auto-Submit: Jenny Rakoczy <[email protected]>
Run-TryBot: Jenny Rakoczy <[email protected]>
Reviewed-by: Heschi Kreinick <[email protected]>
TryBot-Result: Gopher Robot <[email protected]>
Reviewed-by: Carlos Amedee <[email protected]>
gopherbot pushed a commit to golang/build that referenced this issue Jul 28, 2022
This adds a UI for viewing what builds are in each queue.

For golang/go#48857

Change-Id: I025d01510833cc9c0d23556d7f90a62119eb39fb
Reviewed-on: https://go-review.googlesource.com/c/build/+/419429
TryBot-Result: Gopher Robot <[email protected]>
Reviewed-by: Heschi Kreinick <[email protected]>
Run-TryBot: Jenny Rakoczy <[email protected]>
Auto-Submit: Jenny Rakoczy <[email protected]>
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/419433 mentions this issue: internal/coordinator/pool: fix gce quota message

@dmitshur dmitshur added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. labels Jul 28, 2022
gopherbot pushed a commit to golang/build that referenced this issue Jul 28, 2022
Remove stray queue length call, which wasn't even in the correct spot
for Sprintf.

For golang/go#48857

Change-Id: I98bb33f0b3e51d5abe25d1675554b96be893b1f9
Reviewed-on: https://go-review.googlesource.com/c/build/+/419433
Auto-Submit: Jenny Rakoczy <[email protected]>
Reviewed-by: Dmitri Shuralyov <[email protected]>
Reviewed-by: Dmitri Shuralyov <[email protected]>
Run-TryBot: Jenny Rakoczy <[email protected]>
TryBot-Result: Gopher Robot <[email protected]>
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/420875 mentions this issue: internal/coordinator: correct dashboard quota string

gopherbot pushed a commit to golang/build that referenced this issue Aug 2, 2022
These were accidentally swapped when changing return argument order.

For golang/go#48857

Change-Id: I0f70d20c9e0cdb23b3d17cc51f9dfb7ce27cc108
Reviewed-on: https://go-review.googlesource.com/c/build/+/420875
Run-TryBot: Jenny Rakoczy <[email protected]>
Auto-Submit: Jenny Rakoczy <[email protected]>
Reviewed-by: Heschi Kreinick <[email protected]>
TryBot-Result: Gopher Robot <[email protected]>
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/420797 mentions this issue: internal/coordinator/pool: subtract untracked instances from quotas

gopherbot pushed a commit to golang/build that referenced this issue Aug 3, 2022
Attempt to reduce raciness when we are at our quota limit by subtracting
instances we're not tracking from the available quota, and relying on
our counts to be otherwise correct for GCE.

Reduce the interval for pollQuota to one minute, and only update limits.
pollQuota now queries all instances and accounts for their CPUs
separately from the usage numbers.

Our previous approach was problematic, as pollQuota would continually
clobber our tracking, which may have been causing instances to contend
for the same available quota.

For golang/go#48857

Change-Id: I482c5900567a95e8ded968176dd8851a0aee1631
Reviewed-on: https://go-review.googlesource.com/c/build/+/420797
TryBot-Result: Gopher Robot <[email protected]>
Reviewed-by: Heschi Kreinick <[email protected]>
Auto-Submit: Jenny Rakoczy <[email protected]>
Run-TryBot: Jenny Rakoczy <[email protected]>
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/422596 mentions this issue: cmd/coordinator,internal/gomote: fix release prioritization

gopherbot pushed a commit to golang/build that referenced this issue Aug 10, 2022
Updates the username matching in the coordinator to determine if a
buildlet is for a release.

Updates golang/go#48857

Change-Id: I0d7197ac2955baab2d8b3eb37bb265cf77a43876
Reviewed-on: https://go-review.googlesource.com/c/build/+/422596
Auto-Submit: Jenny Rakoczy <[email protected]>
TryBot-Result: Gopher Robot <[email protected]>
Reviewed-by: Heschi Kreinick <[email protected]>
Run-TryBot: Jenny Rakoczy <[email protected]>
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/426794 mentions this issue: buildlet,internal/coordinator/pool: retry quota exceeded

gopherbot pushed a commit to golang/build that referenced this issue Aug 30, 2022
Handle QUOTA_EXCEEDED errors when trying to create GCE buildlets. Due to
contention with other processes (GKE, etc.) in our builder project, we
may not have a perfect representation of quota usage. Instead, retry as
long as someone is waiting for a builder.

Also, delete some dead code.

Updates golang/go#48857

Change-Id: Iecf48378bdd46a4c817c09436fc45b4e27744ab6
Reviewed-on: https://go-review.googlesource.com/c/build/+/426794
TryBot-Result: Gopher Robot <[email protected]>
Reviewed-by: Heschi Kreinick <[email protected]>
Auto-Submit: Jenny Rakoczy <[email protected]>
Run-TryBot: Jenny Rakoczy <[email protected]>
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/426798 mentions this issue: buildlet: copy GCE errors

gopherbot pushed a commit to golang/build that referenced this issue Aug 30, 2022
This was intended to be part of CL 426794.

Updates golang/go#48857

Change-Id: Ic317c7fb633910c969f27bd220e17dba148e8909
Reviewed-on: https://go-review.googlesource.com/c/build/+/426798
Run-TryBot: Jenny Rakoczy <[email protected]>
Reviewed-by: Carlos Amedee <[email protected]>
TryBot-Result: Gopher Robot <[email protected]>
Reviewed-by: Heschi Kreinick <[email protected]>
Reviewed-by: Dmitri Shuralyov <[email protected]>
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/428114 mentions this issue: buildlet: log something useful for quota errors

gopherbot pushed a commit to golang/build that referenced this issue Sep 2, 2022
As implemented, it just logs a pointer address.

Updates golang/go#48857

Change-Id: Iba5bc8dbbb7a8eb7b5c110cd34233731e4f534ef
Reviewed-on: https://go-review.googlesource.com/c/build/+/428114
TryBot-Result: Gopher Robot <[email protected]>
Reviewed-by: Heschi Kreinick <[email protected]>
Run-TryBot: Jenny Rakoczy <[email protected]>
@heschi heschi moved this to Done in Go Release Sep 27, 2022
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/435379 mentions this issue: buildlet: include details in GCEError.Error

gopherbot pushed a commit to golang/build that referenced this issue Sep 28, 2022
Some of our logging in the coordinator includes pointers and repetition,
like so:

2022/09/27 19:16:27 failed with errors: [0xc02b50e120]
2022/09/27 19:16:27 Failed to create VM: "failed with errors: [0xc02b50e120]". Retrying after 1 second (attempt: 2).

The GCE error types only have MarshalJSON methods, so use that to show
more information. (This builds on the previous attempt in CL 428114.)
Remove the duplicate log, since the caller can log or otherwise handle
the error.

For golang/go#48857.

Change-Id: I0d31cac833ea617b4d722751e3ffe9a6689aeb87
Reviewed-on: https://go-review.googlesource.com/c/build/+/435379
Auto-Submit: Dmitri Shuralyov <[email protected]>
Reviewed-by: Dmitri Shuralyov <[email protected]>
Run-TryBot: Dmitri Shuralyov <[email protected]>
Reviewed-by: Jenny Rakoczy <[email protected]>
TryBot-Result: Gopher Robot <[email protected]>
@golang golang locked and limited conversation to collaborators Sep 27, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Projects
Archived in project
Development

No branches or pull requests

3 participants