Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor degrade hierarchy with new circuit breaker mechanism and improve strategy #1490

Merged
merged 3 commits into from
Jul 30, 2020

Conversation

sczyh30
Copy link
Member

@sczyh30 sczyh30 commented May 18, 2020

Describe what this PR does / why we need it

Refactor legacy degrade hierarchy with new circuit breaker mechanism and improve strategy.

Does this pull request fix one issue?

Resolves #1421, #1032, #951, #154, #308, #56

Describe how you did it

  • Add CircuitBreaker abstraction (with half-open state) and add circuit breaker state change event observer support. Now Sentinel follows the canonical circuit breaker pattern (with some improvements).
  • Make statistics of each rule dependent (to support arbitrary statistic interval).
  • Improve RTT-based circuit breaking strategy (avg RT → slow request ratio).
  • Add simple "probe" mechanism (aka. half-open).
  • Refactor mechanism of metric recording and state change handling for circuit breakers: record RT and error when requests have completed (i.e. onExit, based on Refactor the mechanism of recording error (on completed) #1420).

Describe how to verify it

Run the test cases and demo.

Special notes for reviews

This PR contains internal breaking changes (and some behavioral changes for RT-based circuit breaking).

@sczyh30 sczyh30 added area/circuit-breaking Issues or PRs related to circuit breaking kind/feature Category issues or prs related to feature request. kind/refactor Issue related to functional refactoring. labels May 18, 2020
@sczyh30 sczyh30 added this to the 1.8.0 milestone May 18, 2020
@sczyh30 sczyh30 added the to-review To review label May 18, 2020
@sczyh30 sczyh30 requested review from linlinisme, cdfive, jasonjoo2010 and a team May 18, 2020 15:28
@sczyh30 sczyh30 marked this pull request as ready for review May 18, 2020 15:32
@codecov-commenter
Copy link

codecov-commenter commented May 19, 2020

Codecov Report

Merging #1490 into master will increase coverage by 1.11%.
The diff coverage is 75.41%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #1490      +/-   ##
============================================
+ Coverage     43.96%   45.08%   +1.11%     
- Complexity     1719     1786      +67     
============================================
  Files           376      382       +6     
  Lines         10660    10854     +194     
  Branches       1418     1443      +25     
============================================
+ Hits           4687     4893     +206     
+ Misses         5408     5373      -35     
- Partials        565      588      +23     
Impacted Files Coverage Δ Complexity Δ
...degrade/circuitbreaker/CircuitBreakerStrategy.java 0.00% <0.00%> (ø) 0.00 <0.00> (?)
...ntinel/slots/block/degrade/DegradeRuleManager.java 46.08% <55.88%> (+24.21%) 16.00 <15.00> (+7.00)
...degrade/circuitbreaker/AbstractCircuitBreaker.java 71.42% <71.42%> (ø) 19.00 <19.00> (?)
.../csp/sentinel/slots/block/degrade/DegradeSlot.java 83.87% <81.48%> (-16.13%) 10.00 <9.00> (+7.00) ⬇️
.../csp/sentinel/slots/block/degrade/DegradeRule.java 86.84% <87.50%> (+15.04%) 19.00 <7.00> (-5.00) ⬆️
...ade/circuitbreaker/ResponseTimeCircuitBreaker.java 88.88% <88.88%> (ø) 11.00 <11.00> (?)
...egrade/circuitbreaker/ExceptionCircuitBreaker.java 89.47% <89.47%> (ø) 10.00 <10.00> (?)
.../degrade/circuitbreaker/EventObserverRegistry.java 90.90% <90.90%> (ø) 5.00 <5.00> (?)
...s/block/degrade/circuitbreaker/CircuitBreaker.java 100.00% <100.00%> (ø) 0.00 <0.00> (?)
...tinel/slots/block/flow/param/ParamFlowChecker.java 53.02% <0.00%> (-2.69%) 28.00% <0.00%> (-1.00%)
... and 17 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 46076b3...4c043ed. Read the comment docs.

Comment on lines 242 to -201
for (DegradeRule rule : list) {
if (!isValidRule(rule)) {
RecordLog.warn(
"[DegradeRuleManager] Ignoring invalid degrade rule when loading new rules: " + rule);
RecordLog.warn("[DegradeRuleManager] Ignoring invalid rule when loading new rules: " + rule);
continue;
}

if (StringUtil.isBlank(rule.getLimitApp())) {
rule.setLimitApp(RuleConstant.LIMIT_APP_DEFAULT);
}

String identity = rule.getResource();
Set<DegradeRule> ruleSet = newRuleMap.get(identity);
if (ruleSet == null) {
ruleSet = new HashSet<>();
newRuleMap.put(identity, ruleSet);
CircuitBreaker cb = getExistingSameCbOrNew(rule);
if (cb == null) {
RecordLog.warn("[DegradeRuleManager] Unknown circuit breaking strategy, ignoring: " + rule);
continue;
}
ruleSet.add(rule);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inside getExistingSameCbOrNew method, the design of reuse the circuit breaker if the rule remains unchanged.is smart. I tried with DegradeRuleManager.loadRules in different places, it works. A small question is that, if using DegradeRuleManager.loadRules(rules) first time, and rules have two same DegradeRule , there will be two CircuitBreaker created, since in the first time, the static circuitBreakers variable is null.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it can be improved.

return;
}

if (curEntry.getBlockError() == null) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This judgement seems can be removed since Line#63.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's here in case the blockError was modified.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really concurrent? And the time lapse is really tiny here.

@CodingSinger
Copy link
Contributor

where is the proposal for this pr?

@sczyh30 sczyh30 requested a review from zhaoyuguang May 22, 2020 07:39
@sczyh30
Copy link
Member Author

sczyh30 commented Jun 5, 2020

where is the proposal for this pr?

I'll link the proposal here soon.

@sczyh30 sczyh30 added the size/XXL Indicate a PR that changes 1000+ lines. label Jun 5, 2020
Comment on lines +64 to +74
public boolean tryPass() {
// Template implementation.
if (currentState.get() == State.CLOSED) {
return true;
}
if (currentState.get() == State.OPEN) {
// For half-open state we allow a request for trial.
return retryTimeoutArrived() && fromOpenToHalfOpen();
}
return false;
}
Copy link
Contributor

@wavesZh wavesZh Jun 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If thare are two breakers and curState is open, one breaker tryPass and change state to half open, but other breaker tryPass return false, it is right that no request be allowed. but the below requests will not be allowed because two breakers tryPass can't both be true.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sczyh30 Any idea for the scene?Was I wrong?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ok for this scenario.

B1(open) & B2(open) -> no request at all
B1(open -> half) & B2(open) -> no request at all, but B1 may transform to closed leaving B2 alone, that is:
B1(closed) & B2(open)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, i cannot find how to change half-open to close.
Method fromHalfOpenToClose is invoked in DegradeSlot#exist when no BlockingException. But if B1(open -> half) & B2(open), the follow requests check by CircuitBreaker#tryPass must return false then throw DegradeException.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requests is blocked by tryPass() in abstract parent and the state of breakers are maintained in different implementations when requests complete (post-request).

So they are not collided.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public boolean tryPass() {
        // Template implementation.
        if (currentState.get() == State.CLOSED) {
            return true;
        }
        if (currentState.get() == State.OPEN) {
            // For half-open state we allow a request for trial.
            return retryTimeoutArrived() && fromOpenToHalfOpen();
        }
        return false;
    }

我已经自我怀疑了🤨😢 是我理解错了吗??

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

```java
public boolean tryPass() {
        // Template implementation.
        if (currentState.get() == State.CLOSED) {
            return true;
        }
        if (currentState.get() == State.OPEN) {
            // For half-open state we allow a request for trial.
            return retryTimeoutArrived() && fromOpenToHalfOpen();
        }
        return false;
    }

我已经自我怀疑了🤨😢 是我理解错了吗??

emmm.........仔细想了下,你考虑的问题的确存在,而且貌似还存在其它限流规则的问题,因为这里并没有判断是不是DegradeException,待我下午拿Test测一下

Copy link
Collaborator

@jasonjoo2010 jasonjoo2010 Jul 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sczyh30 Hi, Eric

Here're some boundaries which lead to new issues while we should discuss more how to make the recovery logic better:

  1. Multiple degrading rules for same resource(Especially after introducing prefix/postfix matching)
  2. Another rule after degrading rule in half-open state(Though there's no such kind of rule but it's possible in future)

These will lead to the incorrect workflow:

entry: R0(open -> half) -> R1(block)
exit: XXXBlockException -> skip the detection for stalling half state

For straightforward patch i think the onComplete method is intentioned to be call after a successfully transforming into half. We should make the exiting logic more simple and clear.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jasonjoo2010 @wavesZh Yes, it's a fatal bug indeed and should be resolved immediately. However we need a better temporary workaround for this (for the half-open case). I'll merge this PR first and we may discuss it in a new issue. It's a common problem for all rules that have their own metrics and rely on onComplete.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #1638

@CLAassistant
Copy link

CLAassistant commented Jun 12, 2020

CLA assistant check
All committers have signed the CLA.

return;
}

if (curEntry.getBlockError() == null) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really concurrent? And the time lapse is really tiny here.

*
* @since 1.7.0
*/
private int rtSlowRequestAmount = RuleConstant.DEGRADE_DEFAULT_SLOW_REQUEST_AMOUNT;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a missing property in new implementation. Is that acceptable in compatibility?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could bring breaking changes (and break the SEMVER spec) indeed. In convention, we should mark it deprecated and remove it until 2.x. But actually it's rarely used and often regarded as a "hidden" attribute (not even appeared in the dashboard)...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah you're right. But i just afraid the users who integrating manually suffering broken compatibility.😂 We can make decision here whether doing it carefully this time. Decide and record it and it would be ok.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO we could drop it (the legacy bad design) and add notes in relevant documents :)

@Override
public void exit(Context context, ResourceWrapper resourceWrapper, int count, Object... args) {
fireExit(context, resourceWrapper, count, args);
public void exit(Context context, ResourceWrapper r, int count, Object... args) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here's a problem that if something went wrong which may cause RT down to more than 10s (100ms for normal) . Our breaker will cut the traffic after 10s. Maybe we could improve it (or as a new issue and do it in future) that unfinished entries could be recorded as statIntervalMs when slidingCounter is reset.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's necessary to improve the scenarios regarding slow in-flight requests. See #1405

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh they are not the same issue i think.

How to calculate a more reasonable RT is one thing while how to make a more reasonable degradation is quite another thing. The primary reason is that the calculating window is separated from general RT, isn't?

And sure we can improve it later but it's more important than making it more reasonable calculating RT. Maybe another time window is required to make it under tracing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could open a new issue to discuss it.

…rove strategy

* Add `CircuitBreaker` abstraction (with half-open state) and add circuit breaker state change event observer support.
* Improve circuit breaking strategy (avg RT → slow request ratio) and make statistics of each rule dependent (to support arbitrary statistic interval).
* Add simple "trial" mechanism (aka. half-open).
* Refactor mechanism of metric recording and state change handling for circuit breakers: record RT and error when requests have completed (i.e. `onExit`, based on alibaba#1420).

Signed-off-by: Eric Zhao <[email protected]>
@sczyh30
Copy link
Member Author

sczyh30 commented Jul 28, 2020

@jasonjoo2010 I've updated the PR, please check.

Note: please use "rebase and merge" for this PR instead of "squash and merge".

Copy link
Collaborator

@jasonjoo2010 jasonjoo2010 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's ok for me now. @wavesZh @cdfive Any further problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/circuit-breaking Issues or PRs related to circuit breaking kind/feature Category issues or prs related to feature request. kind/refactor Issue related to functional refactoring. size/XXL Indicate a PR that changes 1000+ lines.
Projects
None yet
7 participants