Use failure_rate instead of failure count for circuit breaker #18539

amishra-u · 2023-05-30T21:46:39Z

Continuation of #18359
I ran multiple experiment and tried to find optimal failure threshold and failure window interval with different remote_timeout, for healthy remote cache, semi-healthy (overloaded) remote cache and unhealthy remote cache.
As I described here even with healthy remote cache there was 5-10% circuit trip and we were not getting the best result.

Issue related to the failure count:

When the remote cache is healthy, builds are fast, and Bazel makes a high number of calls to the buildfarm. As a result, even with a moderate failure rate, the failure count may exceed the threshold.
Additionally, write calls, which have a higher probability of failure compared to other calls, are batched immediately after the completion of an action's build. This further increases the chances of breaching the failure threshold within the defined window interval.
On the other hand, when the remote cache is unhealthy or semi-healthy, builds are significantly slowed down, and Bazel makes fewer calls to the remote cache.

Finding a configuration that works well for both healthy and unhealthy remote caches was not feasible. Therefore, changed the approach to use the failure rate, and easily found a configuration that worked effectively in both scenarios.

linzhp · 2023-05-30T23:04:02Z

@coeuvre Can you review?

src/main/java/com/google/devtools/build/lib/remote/options/RemoteOptions.java

...test/java/com/google/devtools/build/lib/remote/circuitbreaker/FailureCircuitBreakerTest.java

src/main/java/com/google/devtools/build/lib/remote/circuitbreaker/FailureCircuitBreaker.java

amishra-u · 2023-05-31T15:34:22Z

@coeuvre incorporated feedback please review.

Continuation of bazelbuild#18359 I ran multiple experiment and tried to find optimal failure threshold and failure window interval with different remote_timeout, for healthy remote cache, semi-healthy (overloaded) remote cache and unhealthy remote cache. As I described [here](bazelbuild#18359 (comment)) even with healthy remote cache there was 5-10% circuit trip and we were not getting the best result. Issue related to the failure count: 1. When the remote cache is healthy, builds are fast, and Bazel makes a high number of calls to the buildfarm. As a result, even with a moderate failure rate, the failure count may exceed the threshold. 2. Additionally, write calls, which have a higher probability of failure compared to other calls, are batched immediately after the completion of an action's build. This further increases the chances of breaching the failure threshold within the defined window interval. 3. On the other hand, when the remote cache is unhealthy or semi-healthy, builds are significantly slowed down, and Bazel makes fewer calls to the remote cache. Finding a configuration that works well for both healthy and unhealthy remote caches was not feasible. Therefore, changed the approach to use the failure rate, and easily found a configuration that worked effectively in both scenarios. Closes bazelbuild#18539. PiperOrigin-RevId: 538588379 Change-Id: I64a49eeeb32846d41d54ca3b637ded3085588528

When the digest size exceeds the max configured digest size by remote-cache, an "out_of_range" error is returned. These errors should not be considered as API failures for the circuit breaker logic, as they do not indicate any issues with the remote-cache service. Similarly there are other non-retriable errors that should not be treated as server failure such as ALREADY_EXISTS. This change considers non-retriable errors as user/client error and logs them as success. While retriable errors such `DEADLINE_EXCEEDED`, `UNKNOWN` etc are logged as failure. Related PRs #18359 #18539 Closes #18613. PiperOrigin-RevId: 539948823 Change-Id: I5b51f6a3aecab7c17d73f78b8234d9a6da49fe6c

Continuation of bazelbuild#18359 I ran multiple experiment and tried to find optimal failure threshold and failure window interval with different remote_timeout, for healthy remote cache, semi-healthy (overloaded) remote cache and unhealthy remote cache. As I described [here](bazelbuild#18359 (comment)) even with healthy remote cache there was 5-10% circuit trip and we were not getting the best result. Issue related to the failure count: 1. When the remote cache is healthy, builds are fast, and Bazel makes a high number of calls to the buildfarm. As a result, even with a moderate failure rate, the failure count may exceed the threshold. 2. Additionally, write calls, which have a higher probability of failure compared to other calls, are batched immediately after the completion of an action's build. This further increases the chances of breaching the failure threshold within the defined window interval. 3. On the other hand, when the remote cache is unhealthy or semi-healthy, builds are significantly slowed down, and Bazel makes fewer calls to the remote cache. Finding a configuration that works well for both healthy and unhealthy remote caches was not feasible. Therefore, changed the approach to use the failure rate, and easily found a configuration that worked effectively in both scenarios. Closes bazelbuild#18539. PiperOrigin-RevId: 538588379 Change-Id: I64a49eeeb32846d41d54ca3b637ded3085588528

When the digest size exceeds the max configured digest size by remote-cache, an "out_of_range" error is returned. These errors should not be considered as API failures for the circuit breaker logic, as they do not indicate any issues with the remote-cache service. Similarly there are other non-retriable errors that should not be treated as server failure such as ALREADY_EXISTS. This change considers non-retriable errors as user/client error and logs them as success. While retriable errors such `DEADLINE_EXCEEDED`, `UNKNOWN` etc are logged as failure. Related PRs bazelbuild#18359 bazelbuild#18539 Closes bazelbuild#18613. PiperOrigin-RevId: 539948823 Change-Id: I5b51f6a3aecab7c17d73f78b8234d9a6da49fe6c

amishra-u and others added 3 commits May 30, 2023 14:00

Use failure_rate instead of failure count for circuit breaker

d7c7b67

Merge branch 'bazelbuild:master' into master

c4758a2

indentation

11f401e

amishra-u marked this pull request as ready for review May 30, 2023 22:04

amishra-u requested a review from a team as a code owner May 30, 2023 22:04

github-actions bot added awaiting-review PR is awaiting review from an assigned reviewer team-Remote-Exec Issues and PRs for the Execution (Remote) team labels May 30, 2023

decrease sleep time for slow test runs

3b14ded

coeuvre requested changes May 31, 2023

View reviewed changes

Minor Update in help

2862e12

coeuvre approved these changes May 31, 2023

View reviewed changes

coeuvre added awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally and removed awaiting-review PR is awaiting review from an assigned reviewer labels May 31, 2023

amishra-u mentioned this pull request Jun 1, 2023

[6.3.0] Use failure_rate instead of failure count for circuit breaker #18559

Merged

Merge branch 'bazelbuild:master' into master

445ccc8

amishra-u mentioned this pull request Jun 6, 2023

Minor Update: Add out_of_range to ignored failure list for circuit_breaker #18583

Closed

copybara-service bot closed this in 10fb5f6 Jun 7, 2023

iancha1992 removed the awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally label Jun 7, 2023

amishra-u mentioned this pull request Jun 8, 2023

Update ignored_error logic for circuit_breaker #18613

Closed

amishra-u mentioned this pull request Jun 13, 2023

[6.3.0] Update ignored_error logic for circuit_breaker #18662

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use failure_rate instead of failure count for circuit breaker #18539

Use failure_rate instead of failure count for circuit breaker #18539

amishra-u commented May 30, 2023

linzhp commented May 30, 2023

amishra-u commented May 31, 2023

Use failure_rate instead of failure count for circuit breaker #18539

Use failure_rate instead of failure count for circuit breaker #18539

Conversation

amishra-u commented May 30, 2023

linzhp commented May 30, 2023

amishra-u commented May 31, 2023