Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOLR-15056: add circuit breaker for CPU, fix load circuit breaker #96

Merged
merged 4 commits into from
Sep 20, 2023

Conversation

wrunderwood
Copy link
Contributor

@wrunderwood wrunderwood commented Apr 27, 2021

SOLR-15056 add circuit breaker for CPU utilization, use accurate name for load average circuit breaker, update docs and tests

Description

The current CPU load circuit breaker is based on load average instead of CPU utilization. Rename that to an accurate name and create a new circuit breaker based on CPU utilization. Improved documentation for all circuit breakers by linking to the JMX metric they are based on.

Solution

Use existing metrics framework to get a CPU utilization metric. Not all JVMs provide that. Add unit test and documentation.

Rename existing CPU circuit breaker to LoadAverageCircuitBreaker, rename unit test.

Tests

Unit tests based on existing test. I have not run a system test (run up system load during a load benchmark).

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Reference Guide

@dsmiley dsmiley requested a review from atris May 1, 2021 02:30
atris

This comment was marked as resolved.

@wrunderwood

This comment was marked as resolved.

@atris

This comment was marked as resolved.

@cpoerschke

This comment was marked as outdated.

@wrunderwood

This comment was marked as outdated.

@atris

This comment was marked as outdated.

@wrunderwood

This comment was marked as outdated.

@janhoy

This comment was marked as outdated.

janhoy

This comment was marked as outdated.

@wrunderwood

This comment was marked as outdated.

atris

This comment was marked as outdated.

@wrunderwood
Copy link
Contributor Author

Thanks for all the suggestions. I'll do some work on this and come back with a new PR.

  • Use SolrCoreAware.
  • Check input threshold values, the only reason this is usable is that they aren't checked.

@atris
Copy link
Contributor

atris commented Jun 11, 2021

Thanks for all the suggestions. I'll do some work on this and come back with a new PR.

  • Use SolrCoreAware.
  • Check input threshold values, the only reason this is usable is that they aren't checked.

I am confused by your second point -- I thought your initial suggestion was that the threshold should be unbounded because the load average metric is unbounded (which is the case right now).

So what exactly are you proposing to check, please? I agree that checking that the value is greater than zero is valuable, but did you have anything else in mind?

@wrunderwood
Copy link
Contributor Author

For the CPU and memory utilization circuit breakers, the valid range is 0.0 to 1.0. For the load average circuit breaker, the threshold needs to be greater than 0.0. Right now, the config could be "-42" and it would be accepted.

@janhoy
Copy link
Contributor

janhoy commented Jun 2, 2023

@wrunderwood , @atris , it's a shame that this fine PR was abandoned due to bike shedding. Perhaps 24 months distance can help put it in a new light and get this ball rolling again? I'm also keen on CB on the update side of Solr.

@wrunderwood

This comment was marked as outdated.

@janhoy
Copy link
Contributor

janhoy commented Aug 30, 2023

Hi all. I'm reviving this PR. I am going to push a large commit to the PR branch, bringing it up to date with the latest changes related to pluggable breakers.

Wrt the split between LoadAverageCircuitBreaker and CPUCircuitBreaker, I have stated before that I support @wrunderwood and @sigram that they should be separate. The difference is clearly documented in the ref guide.

Wrt backward compatibility for those upgrading from 8.x or 9.x, I'm switching the deprecated CircuitBreakerManager plugin to create an instance of LoadAverageCircuitBreaker when cpuThreshold is configured. That way there is no surprise for old users, while new users can make an active choice when picking a plugin. I am not adding support for the new cpu breaker in CircuitBreakerManager. I hope this is a good compromise.

Note that this PR may conflict somewhat with #1871 which is still not merged. We'll handle that once one of them are merged.

@janhoy
Copy link
Contributor

janhoy commented Aug 30, 2023

Will soon need a rebase and force-push to make crave tests happy. But will wait a bit for more review comments, and so that it is easier to view the changes since 2 years ago...

@janhoy
Copy link
Contributor

janhoy commented Aug 30, 2023

@wrunderwood I pushed a few fixes related to my own recent changes. But I'll leave you in the drivers seat for this PR again, if you want to.

@wrunderwood
Copy link
Contributor Author

@janhoy Thanks so much for picking this up.

I'll take a look at the compromise. This is a tricky decision, changing behavior to match the documentation. Is it a breaking change or not? How many people read the code and figured out it wasn't CPU? Anybody? Our clusters with 300+ shards were configured assuming it was CPU usage (max 100%) until I explained otherwise.

I'd like to add a note to the docs about using circuit breakers in a sharded system, because they multiply failures. For example, with 4 shards, if 10% of search requests are short-circuited on all nodes, the end user will see about a 1/3 failure rate. In a sharded system, it is probably worth enabling partial results to avoid that.

A future feature would be an option to only check the circuit breakers on the initial external request, not on the distributed requests to shards. That has advantages (no partial failures) and disadvantages (can't reject the portion of the load which is intra-cluster).

@janhoy
Copy link
Contributor

janhoy commented Aug 30, 2023

This is a tricky decision, changing behavior to match the documentation. Is it a breaking change or not?

Note that the "pluggable" CBs are not yet released, so if we can get this fix in the same release (9.4), users will only ever have seen this version. Wrt earlier users of CircuitBreakerManager, they have a choice to switch to the new CpuLoad plugin if that's what they want. We can/should I now highlight this in UpgradeNotes for 9.4 in the ref guide..

@janhoy
Copy link
Contributor

janhoy commented Sep 4, 2023

Keen to get this in, as it blocks a few other related MRs, have you concluded on your back-compat concerns?

@janhoy
Copy link
Contributor

janhoy commented Sep 12, 2023

Any news on this @wrunderwood ? If there's no more feedback during this week I'll complete and merge early next week.

@wrunderwood
Copy link
Contributor Author

Just updated the docs.

When circuit breakers are enabled for update requests, how does that work for replicas? Can an NRT replica get out of sync by rejecting a request? It should be OK for TLOG and PULL replicas. If there is an issue for NRT replicas, I'll document that.

Copy link
Contributor

@janhoy janhoy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused about 1.0 vs 100%. I believe the return value of the new CPU breaker is wrong, and that we don't catch it by not having any real test, just FakeCPUCircutiBreaker tests.

@janhoy
Copy link
Contributor

janhoy commented Sep 19, 2023

Just updated the docs.

When circuit breakers are enabled for update requests, how does that work for replicas? Can an NRT replica get out of sync by rejecting a request? It should be OK for TLOG and PULL replicas. If there is an issue for NRT replicas, I'll document that.

This will be handled by #1930, won't need to handle it here

@janhoy
Copy link
Contributor

janhoy commented Sep 19, 2023

@wrunderwood Are you planning to incorporate latest review feedback and prepare for merge? I try to get this landed before 9.4.

@wrunderwood
Copy link
Contributor Author

I think I've fixed the things called out, but I might have missed something. I don't use GitHub very often.

Let me know if I left something unresolved. It all looks good to me, thanks of the improvements over my patch.

@janhoy
Copy link
Contributor

janhoy commented Sep 19, 2023

Let me know if I left something unresolved. It all looks good to me, thanks of the improvements over my patch.

Thanks. There are two "unresolved conversations" in the PR that you may want to address.

@wrunderwood
Copy link
Contributor Author

I think that is everything.

@janhoy
Copy link
Contributor

janhoy commented Sep 20, 2023

Thanks. I'll add the fail-early check from wrunderwood#1, rebase on main and force-push.

@janhoy janhoy merged commit 51c1a78 into apache:main Sep 20, 2023
1 of 2 checks passed
janhoy pushed a commit that referenced this pull request Sep 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants