[do not merge] Ensure client and scheduler are resilient to server autoscaling #2277
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
While profiling distributed build cluster performance, forcing the client to fallback to local compilation is the largest contributor to overall build time. Presently this happens due to at least one bug, but also sub-optimal error handling in the client and scheduler.
These issues are amplified when autoscaling sccache-dist servers, as the errors happen more frequently, can lead to sub-optimal autoscaling behavior, leading to more errors, etc.
So this PR is a collection of fixes for the sccache client, scheduler, and server to better support dist-server autoscaling, as well as general improvements for tracing and debugging distributed compilation across clients, schedulers, and workers.
job_id
, andserver_id
to client logs, and addsjob_id
to scheduler and server logs. This makes it significantly easier to trace build cluster failures across client and server logs.Nit: Searching through unstructured/ad-hoc log lines is difficult, using something like
structured_logger
instead ofenv_logger
would also improve this experience.Build cluster configuration
Before diving into these changes, I should describe the architecture of the cluster for which these changes are necessary.
sccache-dist
scheduler instance, which receives forwarded connections from Traefik.sccache-dist
servers, which scale in and out based on load, and are associated with one of a fixed pool of ports on the Traefik instance. For example, if the ASG includes up to 10 instances, Traefik will open 10 ports (e.g. 10500-10509) and associate each port with a worker.Workers are associated with and forwarded traffic from one of Traefik's open ports when they start up, and un-associated with that port when they shut down. When a new worker starts up, it could be associated with any free port, even ports previously associated with a different worker.
Note: While this PR isn't related, this description assumes sccache has been compiled with the changes in #1922, as that's necessary for the workers to report the
public_url
of the API Gateway instead of their private VPC address.Certificate handling for server scale in and out
When the server cluster goes through a cycle of scaling out, in, then out again, the new servers may be available at addresses that were previously associated with an old server. This presents a challenge for certificate handling, because the client and scheduler may have cached certificates for the initial instance, and those certs are not valid for communicating with the new instance:
In the initial state, the client and scheduler cached certificates for servers A and B. After scaling in and out again, the client and scheduler attempt and fail to use the certificates generated by server A to communicate with server C. I believe this is because the certificates for A and C both embed
127.0.0.1:10500
as theirSubjectAlternativeName
, and this confusesreqwest
.df2e4a1 updates the client to track certificates by
server_id
like the server does, and updates both the client and scheduler to remove the old certificate from the certs map before adding the existing certs to thereqwest
client builder.Scheduler job allocation resiliency
There's a delay between when servers scale in and when the scheduler prunes them from the list of active servers. In this time, the scheduler may attempt to allocate jobs to these servers. When this fails, and the current behavior is to return an error to the client to run a local compile.
This is sub-optimal for an autoscaling strategy, since by rejecting the jobs, the additional work sent back to the client to do isn't captured by the autoscaler.
For example, if the autoscaler scales in from 64 to 32 CPUs, and in the meantime the scheduler rejects the next 32 jobs to compile locally, the autoscaler believes it is in a steady-state rather than recognizing there are 64 units of work to handle.
At best, this leads to delays in scaling up, and at worst it can cause the autoscaler to believe it can continue to scale down.
The best solution is for the scheduler to handle the
alloc_job
failure and attempt to allocate to the next-best server candidate, until either the job is allocated or the candidate list is exhausted. This ensures the autoscaler will see the existing instances get busier, and stop scaling in/start scaling out again.Example of starting a cluster with 3 initial workers, scaling down to 1, then running a distributed compile before the scheduler has pruned the dead servers:
Client job execution resiliency
It's also possible for a server to be taken offline while it's running jobs for clients. In this scenario the scheduler
alloc_job
succeeds when the worker is still alive, but the worker is destroyed while the client is waiting on therun_job
response.To avoid the expensive local compilation, the client should handle the failure and allow retrying the job on a new server assigned by the scheduler. When combined with the feature described in the previous section, the scheduler should reallocate the job on an alive server.
Example client logs when worker shuts down during run_job, and client retries: