-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: sealing: Stop recovery attempts after fault #8014
Conversation
Codecov Report
@@ Coverage Diff @@
## master #8014 +/- ##
==========================================
- Coverage 39.25% 39.16% -0.09%
==========================================
Files 660 660
Lines 71436 71459 +23
==========================================
- Hits 28042 27988 -54
- Misses 38569 38640 +71
- Partials 4825 4831 +6
Continue to review full report at Codecov.
|
+1 on Fun alternate idea: this ancient todo about exponential backoff would address this particular bug too. we really should do a TODO batch sometime soon hahah all our TODOs make so much sense to do! what's in this PR makes sense to me tho I think for #8011 we should also at least add the following for v1.14.0 :
|
I'm not enthusiastic about adding another patch to cover this edge case. It piles on more complexity and an expensive api call to every sector deal match. Much better would be to focus efforts on reassignment of pieces after the upgrade is aborted or precommit fails.
This I like, I'll add this to the PR. |
Related Issues
#8011
Proposed Changes
This addresses the major concern behind #8011 -- too many resources being used when retrying submission of a replica update message after fault.
The best solution available now is to check for the fault condition after submit replica update has failed.
It is understandable to want to stop including faulty sectors at deal inclusion as the name of 8011 suggests. But that is less good. Faulting could happen immediately after the last deal was included and the scheduler would be in the same bad state.
The totally correct thing that we should work towards in the future is interrupting the scheduler with an fsm event triggered by listening to the chain for faults (and expirations extensions, etc) happening. But that is too much refactor for right now.
Fun alternate idea: this ancient todo about exponential backoff would address this particular bug too.
Additional Info
Checklist
Before you mark the PR ready for review, please make sure that:
<PR type>: <area>: <change being made>
fix: mempool: Introduce a cache for valid signatures
PR type
: fix, feat, INTERFACE BREAKING CHANGE, CONSENSUS BREAKING, build, chore, ci, docs,perf, refactor, revert, style, testarea
: api, chain, state, vm, data transfer, market, mempool, message, block production, multisig, networking, paychan, proving, sealing, wallet, deps