-
Notifications
You must be signed in to change notification settings - Fork 674
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SOLR-15056: add circuit breaker for CPU, fix load circuit breaker #96
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
Thanks for all the suggestions. I'll do some work on this and come back with a new PR.
|
I am confused by your second point -- I thought your initial suggestion was that the threshold should be unbounded because the load average metric is unbounded (which is the case right now). So what exactly are you proposing to check, please? I agree that checking that the value is greater than zero is valuable, but did you have anything else in mind? |
For the CPU and memory utilization circuit breakers, the valid range is 0.0 to 1.0. For the load average circuit breaker, the threshold needs to be greater than 0.0. Right now, the config could be "-42" and it would be accepted. |
@wrunderwood , @atris , it's a shame that this fine PR was abandoned due to bike shedding. Perhaps 24 months distance can help put it in a new light and get this ball rolling again? I'm also keen on CB on the update side of Solr. |
This comment was marked as outdated.
This comment was marked as outdated.
Hi all. I'm reviving this PR. I am going to push a large commit to the PR branch, bringing it up to date with the latest changes related to pluggable breakers. Wrt the split between Wrt backward compatibility for those upgrading from 8.x or 9.x, I'm switching the deprecated Note that this PR may conflict somewhat with #1871 which is still not merged. We'll handle that once one of them are merged. |
Will soon need a rebase and force-push to make crave tests happy. But will wait a bit for more review comments, and so that it is easier to view the changes since 2 years ago... |
solr/core/src/java/org/apache/solr/util/circuitbreaker/CircuitBreakerManager.java
Show resolved
Hide resolved
solr/core/src/java/org/apache/solr/util/circuitbreaker/CPUCircuitBreaker.java
Outdated
Show resolved
Hide resolved
solr/core/src/java/org/apache/solr/util/circuitbreaker/CPUCircuitBreaker.java
Outdated
Show resolved
Hide resolved
solr/solr-ref-guide/modules/deployment-guide/pages/circuit-breakers.adoc
Outdated
Show resolved
Hide resolved
solr/core/src/java/org/apache/solr/util/circuitbreaker/CPUCircuitBreaker.java
Show resolved
Hide resolved
@wrunderwood I pushed a few fixes related to my own recent changes. But I'll leave you in the drivers seat for this PR again, if you want to. |
@janhoy Thanks so much for picking this up. I'll take a look at the compromise. This is a tricky decision, changing behavior to match the documentation. Is it a breaking change or not? How many people read the code and figured out it wasn't CPU? Anybody? Our clusters with 300+ shards were configured assuming it was CPU usage (max 100%) until I explained otherwise. I'd like to add a note to the docs about using circuit breakers in a sharded system, because they multiply failures. For example, with 4 shards, if 10% of search requests are short-circuited on all nodes, the end user will see about a 1/3 failure rate. In a sharded system, it is probably worth enabling partial results to avoid that. A future feature would be an option to only check the circuit breakers on the initial external request, not on the distributed requests to shards. That has advantages (no partial failures) and disadvantages (can't reject the portion of the load which is intra-cluster). |
Note that the "pluggable" CBs are not yet released, so if we can get this fix in the same release (9.4), users will only ever have seen this version. Wrt earlier users of |
Keen to get this in, as it blocks a few other related MRs, have you concluded on your back-compat concerns? |
Any news on this @wrunderwood ? If there's no more feedback during this week I'll complete and merge early next week. |
solr/core/src/java/org/apache/solr/util/circuitbreaker/LoadAverageCircuitBreaker.java
Outdated
Show resolved
Hide resolved
Just updated the docs. When circuit breakers are enabled for update requests, how does that work for replicas? Can an NRT replica get out of sync by rejecting a request? It should be OK for TLOG and PULL replicas. If there is an issue for NRT replicas, I'll document that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused about 1.0 vs 100%. I believe the return value of the new CPU breaker is wrong, and that we don't catch it by not having any real test, just FakeCPUCircutiBreaker
tests.
solr/solr-ref-guide/modules/deployment-guide/pages/circuit-breakers.adoc
Outdated
Show resolved
Hide resolved
solr/core/src/java/org/apache/solr/util/circuitbreaker/CPUCircuitBreaker.java
Outdated
Show resolved
Hide resolved
solr/core/src/java/org/apache/solr/util/circuitbreaker/CPUCircuitBreaker.java
Show resolved
Hide resolved
solr/solr-ref-guide/modules/deployment-guide/pages/circuit-breakers.adoc
Outdated
Show resolved
Hide resolved
This will be handled by #1930, won't need to handle it here |
@wrunderwood Are you planning to incorporate latest review feedback and prepare for merge? I try to get this landed before 9.4. |
I think I've fixed the things called out, but I might have missed something. I don't use GitHub very often. Let me know if I left something unresolved. It all looks good to me, thanks of the improvements over my patch. |
Thanks. There are two "unresolved conversations" in the PR that you may want to address. |
I think that is everything. |
Thanks. I'll add the fail-early check from wrunderwood#1, rebase on main and force-push. |
Co-authored-by: Jan Høydahl <[email protected]> (cherry picked from commit 51c1a78)
SOLR-15056 add circuit breaker for CPU utilization, use accurate name for load average circuit breaker, update docs and tests
Description
The current CPU load circuit breaker is based on load average instead of CPU utilization. Rename that to an accurate name and create a new circuit breaker based on CPU utilization. Improved documentation for all circuit breakers by linking to the JMX metric they are based on.
Solution
Use existing metrics framework to get a CPU utilization metric. Not all JVMs provide that. Add unit test and documentation.
Rename existing CPU circuit breaker to LoadAverageCircuitBreaker, rename unit test.
Tests
Unit tests based on existing test. I have not run a system test (run up system load during a load benchmark).
Checklist
Please review the following and check all that apply:
main
branch../gradlew check
.