-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use failure_rate instead of failure count for circuit breaker #18539
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
github-actions
bot
added
awaiting-review
PR is awaiting review from an assigned reviewer
team-Remote-Exec
Issues and PRs for the Execution (Remote) team
labels
May 30, 2023
@coeuvre Can you review? |
coeuvre
requested changes
May 31, 2023
src/main/java/com/google/devtools/build/lib/remote/options/RemoteOptions.java
Outdated
Show resolved
Hide resolved
...test/java/com/google/devtools/build/lib/remote/circuitbreaker/FailureCircuitBreakerTest.java
Outdated
Show resolved
Hide resolved
src/main/java/com/google/devtools/build/lib/remote/circuitbreaker/FailureCircuitBreaker.java
Show resolved
Hide resolved
@coeuvre incorporated feedback please review. |
coeuvre
approved these changes
May 31, 2023
coeuvre
added
awaiting-PR-merge
PR has been approved by a reviewer and is ready to be merge internally
and removed
awaiting-review
PR is awaiting review from an assigned reviewer
labels
May 31, 2023
iancha1992
removed
the
awaiting-PR-merge
PR has been approved by a reviewer and is ready to be merge internally
label
Jun 7, 2023
amishra-u
added a commit
to amishra-u/bazel
that referenced
this pull request
Jun 7, 2023
Continuation of bazelbuild#18359 I ran multiple experiment and tried to find optimal failure threshold and failure window interval with different remote_timeout, for healthy remote cache, semi-healthy (overloaded) remote cache and unhealthy remote cache. As I described [here](bazelbuild#18359 (comment)) even with healthy remote cache there was 5-10% circuit trip and we were not getting the best result. Issue related to the failure count: 1. When the remote cache is healthy, builds are fast, and Bazel makes a high number of calls to the buildfarm. As a result, even with a moderate failure rate, the failure count may exceed the threshold. 2. Additionally, write calls, which have a higher probability of failure compared to other calls, are batched immediately after the completion of an action's build. This further increases the chances of breaching the failure threshold within the defined window interval. 3. On the other hand, when the remote cache is unhealthy or semi-healthy, builds are significantly slowed down, and Bazel makes fewer calls to the remote cache. Finding a configuration that works well for both healthy and unhealthy remote caches was not feasible. Therefore, changed the approach to use the failure rate, and easily found a configuration that worked effectively in both scenarios. Closes bazelbuild#18539. PiperOrigin-RevId: 538588379 Change-Id: I64a49eeeb32846d41d54ca3b637ded3085588528
copybara-service bot
pushed a commit
that referenced
this pull request
Jun 13, 2023
When the digest size exceeds the max configured digest size by remote-cache, an "out_of_range" error is returned. These errors should not be considered as API failures for the circuit breaker logic, as they do not indicate any issues with the remote-cache service. Similarly there are other non-retriable errors that should not be treated as server failure such as ALREADY_EXISTS. This change considers non-retriable errors as user/client error and logs them as success. While retriable errors such `DEADLINE_EXCEEDED`, `UNKNOWN` etc are logged as failure. Related PRs #18359 #18539 Closes #18613. PiperOrigin-RevId: 539948823 Change-Id: I5b51f6a3aecab7c17d73f78b8234d9a6da49fe6c
traversaro
pushed a commit
to traversaro/bazel
that referenced
this pull request
Jun 24, 2023
Continuation of bazelbuild#18359 I ran multiple experiment and tried to find optimal failure threshold and failure window interval with different remote_timeout, for healthy remote cache, semi-healthy (overloaded) remote cache and unhealthy remote cache. As I described [here](bazelbuild#18359 (comment)) even with healthy remote cache there was 5-10% circuit trip and we were not getting the best result. Issue related to the failure count: 1. When the remote cache is healthy, builds are fast, and Bazel makes a high number of calls to the buildfarm. As a result, even with a moderate failure rate, the failure count may exceed the threshold. 2. Additionally, write calls, which have a higher probability of failure compared to other calls, are batched immediately after the completion of an action's build. This further increases the chances of breaching the failure threshold within the defined window interval. 3. On the other hand, when the remote cache is unhealthy or semi-healthy, builds are significantly slowed down, and Bazel makes fewer calls to the remote cache. Finding a configuration that works well for both healthy and unhealthy remote caches was not feasible. Therefore, changed the approach to use the failure rate, and easily found a configuration that worked effectively in both scenarios. Closes bazelbuild#18539. PiperOrigin-RevId: 538588379 Change-Id: I64a49eeeb32846d41d54ca3b637ded3085588528
traversaro
pushed a commit
to traversaro/bazel
that referenced
this pull request
Jun 24, 2023
When the digest size exceeds the max configured digest size by remote-cache, an "out_of_range" error is returned. These errors should not be considered as API failures for the circuit breaker logic, as they do not indicate any issues with the remote-cache service. Similarly there are other non-retriable errors that should not be treated as server failure such as ALREADY_EXISTS. This change considers non-retriable errors as user/client error and logs them as success. While retriable errors such `DEADLINE_EXCEEDED`, `UNKNOWN` etc are logged as failure. Related PRs bazelbuild#18359 bazelbuild#18539 Closes bazelbuild#18613. PiperOrigin-RevId: 539948823 Change-Id: I5b51f6a3aecab7c17d73f78b8234d9a6da49fe6c
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Continuation of #18359
I ran multiple experiment and tried to find optimal failure threshold and failure window interval with different remote_timeout, for healthy remote cache, semi-healthy (overloaded) remote cache and unhealthy remote cache.
As I described here even with healthy remote cache there was 5-10% circuit trip and we were not getting the best result.
Issue related to the failure count:
Finding a configuration that works well for both healthy and unhealthy remote caches was not feasible. Therefore, changed the approach to use the failure rate, and easily found a configuration that worked effectively in both scenarios.