Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: sealing: Stop recovery attempts after fault #8014

Merged
merged 4 commits into from
Feb 8, 2022
Merged

fix: sealing: Stop recovery attempts after fault #8014

merged 4 commits into from
Feb 8, 2022

Conversation

ZenGround0
Copy link
Contributor

@ZenGround0 ZenGround0 commented Feb 1, 2022

Related Issues

#8011

Proposed Changes

This addresses the major concern behind #8011 -- too many resources being used when retrying submission of a replica update message after fault.

The best solution available now is to check for the fault condition after submit replica update has failed.

It is understandable to want to stop including faulty sectors at deal inclusion as the name of 8011 suggests. But that is less good. Faulting could happen immediately after the last deal was included and the scheduler would be in the same bad state.

The totally correct thing that we should work towards in the future is interrupting the scheduler with an fsm event triggered by listening to the chain for faults (and expirations extensions, etc) happening. But that is too much refactor for right now.

Fun alternate idea: this ancient todo about exponential backoff would address this particular bug too.

Additional Info

Checklist

Before you mark the PR ready for review, please make sure that:

  • All commits have a clear commit message.
  • The PR title is in the form of of <PR type>: <area>: <change being made>
    • example: fix: mempool: Introduce a cache for valid signatures
    • PR type: fix, feat, INTERFACE BREAKING CHANGE, CONSENSUS BREAKING, build, chore, ci, docs,perf, refactor, revert, style, test
    • area: api, chain, state, vm, data transfer, market, mempool, message, block production, multisig, networking, paychan, proving, sealing, wallet, deps
  • This PR has tests for new functionality or change in behaviour
  • If new user-facing features are introduced, clear usage guidelines and / or documentation updates should be included in https://lotus.filecoin.io or Discussion Tutorials.
  • CI is green

@ZenGround0 ZenGround0 requested a review from a team as a code owner February 1, 2022 06:18
@ZenGround0 ZenGround0 changed the title Stop recovery attempts after fault fix: Stop recovery attempts after fault Feb 1, 2022
@ZenGround0 ZenGround0 changed the title fix: Stop recovery attempts after fault fix: sealing: Stop recovery attempts after fault Feb 1, 2022
@codecov
Copy link

codecov bot commented Feb 1, 2022

Codecov Report

Merging #8014 (47ffcee) into master (51643ca) will decrease coverage by 0.08%.
The diff coverage is 38.70%.

❗ Current head 47ffcee differs from pull request most recent head 1ab2744. Consider uploading reports for the commit 1ab2744 to get more accurate results

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #8014      +/-   ##
==========================================
- Coverage   39.25%   39.16%   -0.09%     
==========================================
  Files         660      660              
  Lines       71436    71459      +23     
==========================================
- Hits        28042    27988      -54     
- Misses      38569    38640      +71     
- Partials     4825     4831       +6     
Impacted Files Coverage Δ
extern/storage-sealing/states_failed.go 3.64% <14.28%> (+0.25%) ⬆️
extern/storage-sealing/states_replica_update.go 39.07% <18.18%> (-0.22%) ⬇️
extern/storage-sealing/upgrade_queue.go 34.02% <69.23%> (+1.41%) ⬆️
markets/loggers/loggers.go 88.88% <0.00%> (-11.12%) ⬇️
chain/actors/builtin/miner/diff.go 48.52% <0.00%> (-10.30%) ⬇️
chain/events/observer.go 71.64% <0.00%> (-6.72%) ⬇️
chain/sync_manager.go 67.08% <0.00%> (-5.60%) ⬇️
chain/exchange/peer_tracker.go 66.66% <0.00%> (-4.31%) ⬇️
storage/wdpost_sched.go 77.45% <0.00%> (-3.93%) ⬇️
chain/store/store.go 63.00% <0.00%> (-2.50%) ⬇️
... and 14 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 51643ca...1ab2744. Read the comment docs.

@jennijuju
Copy link
Member

jennijuju commented Feb 2, 2022

+1 on Fun alternate idea: this ancient todo about exponential backoff would address this particular bug too. we really should do a TODO batch sometime soon hahah all our TODOs make so much sense to do!

what's in this PR makes sense to me

tho I think for #8011 we should also at least add the following for v1.14.0 :

  • don't add piece to inactive sector and find a good sector: this can prevent a good amount of good data transfer work + PSD going into waste (since we don't reassign pieces atm)
  • do sector activeness check before PRU2 to save gpu work

@ZenGround0
Copy link
Contributor Author

don't add piece to inactive sector and find a good sector: this can prevent a good amount of good data transfer work + PSD going into waste (since we don't reassign pieces atm)

I'm not enthusiastic about adding another patch to cover this edge case. It piles on more complexity and an expensive api call to every sector deal match. Much better would be to focus efforts on reassignment of pieces after the upgrade is aborted or precommit fails.

do sector activeness check before PRU2 to save gpu work

This I like, I'll add this to the PR.

extern/storage-sealing/upgrade_queue.go Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants