fix: sealing: Stop recovery attempts after fault #8014

ZenGround0 · 2022-02-01T06:18:23Z

Related Issues

Proposed Changes

This addresses the major concern behind #8011 -- too many resources being used when retrying submission of a replica update message after fault.

The best solution available now is to check for the fault condition after submit replica update has failed.

It is understandable to want to stop including faulty sectors at deal inclusion as the name of 8011 suggests. But that is less good. Faulting could happen immediately after the last deal was included and the scheduler would be in the same bad state.

The totally correct thing that we should work towards in the future is interrupting the scheduler with an fsm event triggered by listening to the chain for faults (and expirations extensions, etc) happening. But that is too much refactor for right now.

Fun alternate idea: this ancient todo about exponential backoff would address this particular bug too.

Additional Info

Checklist

Before you mark the PR ready for review, please make sure that:

All commits have a clear commit message.
The PR title is in the form of of <PR type>: <area>: <change being made>
- example: fix: mempool: Introduce a cache for valid signatures
- PR type: fix, feat, INTERFACE BREAKING CHANGE, CONSENSUS BREAKING, build, chore, ci, docs,perf, refactor, revert, style, test
- area: api, chain, state, vm, data transfer, market, mempool, message, block production, multisig, networking, paychan, proving, sealing, wallet, deps
This PR has tests for new functionality or change in behaviour
If new user-facing features are introduced, clear usage guidelines and / or documentation updates should be included in https://lotus.filecoin.io or Discussion Tutorials.
CI is green

codecov · 2022-02-01T06:21:24Z

Codecov Report

Merging #8014 (47ffcee) into master (51643ca) will decrease coverage by 0.08%.
The diff coverage is 38.70%.

❗ Current head 47ffcee differs from pull request most recent head 1ab2744. Consider uploading reports for the commit 1ab2744 to get more accurate results

@@            Coverage Diff             @@
##           master    #8014      +/-   ##
==========================================
- Coverage   39.25%   39.16%   -0.09%     
==========================================
  Files         660      660              
  Lines       71436    71459      +23     
==========================================
- Hits        28042    27988      -54     
- Misses      38569    38640      +71     
- Partials     4825     4831       +6

Impacted Files	Coverage Δ
extern/storage-sealing/states_failed.go	`3.64% <14.28%> (+0.25%)`	⬆️
extern/storage-sealing/states_replica_update.go	`39.07% <18.18%> (-0.22%)`	⬇️
extern/storage-sealing/upgrade_queue.go	`34.02% <69.23%> (+1.41%)`	⬆️
markets/loggers/loggers.go	`88.88% <0.00%> (-11.12%)`	⬇️
chain/actors/builtin/miner/diff.go	`48.52% <0.00%> (-10.30%)`	⬇️
chain/events/observer.go	`71.64% <0.00%> (-6.72%)`	⬇️
chain/sync_manager.go	`67.08% <0.00%> (-5.60%)`	⬇️
chain/exchange/peer_tracker.go	`66.66% <0.00%> (-4.31%)`	⬇️
storage/wdpost_sched.go	`77.45% <0.00%> (-3.93%)`	⬇️
chain/store/store.go	`63.00% <0.00%> (-2.50%)`	⬇️
... and 14 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 51643ca...1ab2744. Read the comment docs.

jennijuju · 2022-02-02T03:54:40Z

+1 on Fun alternate idea: this ancient todo about exponential backoff would address this particular bug too. we really should do a TODO batch sometime soon hahah all our TODOs make so much sense to do!

what's in this PR makes sense to me

tho I think for #8011 we should also at least add the following for v1.14.0 :

don't add piece to inactive sector and find a good sector: this can prevent a good amount of good data transfer work + PSD going into waste (since we don't reassign pieces atm)
do sector activeness check before PRU2 to save gpu work

ZenGround0 · 2022-02-02T07:54:40Z

don't add piece to inactive sector and find a good sector: this can prevent a good amount of good data transfer work + PSD going into waste (since we don't reassign pieces atm)

I'm not enthusiastic about adding another patch to cover this edge case. It piles on more complexity and an expensive api call to every sector deal match. Much better would be to focus efforts on reassignment of pieces after the upgrade is aborted or precommit fails.

do sector activeness check before PRU2 to save gpu work

This I like, I'll add this to the PR.

extern/storage-sealing/upgrade_queue.go

Stop recovery attempts after fault

13ccb8c

ZenGround0 requested a review from a team as a code owner February 1, 2022 06:18

ZenGround0 changed the title ~~Stop recovery attempts after fault~~ fix: Stop recovery attempts after fault Feb 1, 2022

ZenGround0 changed the title ~~fix: Stop recovery attempts after fault~~ fix: sealing: Stop recovery attempts after fault Feb 1, 2022

ZenGround0 added 2 commits February 7, 2022 07:41

Check sector is active before PRU

47ffcee

Fix log

1ab2744

arajasek approved these changes Feb 7, 2022

View reviewed changes

extern/storage-sealing/upgrade_queue.go Outdated Show resolved Hide resolved

nit: edit a comment

b6c141c

arajasek merged commit 899ae8a into master Feb 8, 2022

arajasek deleted the fix/8011 branch February 8, 2022 17:06

jennijuju added the release/backport label Feb 8, 2022

jennijuju mentioned this pull request Feb 10, 2022

(SnapDeals) once a sector is marked for snap, new deals will be added to it even the sector is inactive #8011

Closed

18 tasks

jennijuju removed the release/backport label Feb 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: sealing: Stop recovery attempts after fault #8014

fix: sealing: Stop recovery attempts after fault #8014

ZenGround0 commented Feb 1, 2022 •

edited

Loading

codecov bot commented Feb 1, 2022 •

edited

Loading

jennijuju commented Feb 2, 2022 •

edited

Loading

ZenGround0 commented Feb 2, 2022

fix: sealing: Stop recovery attempts after fault #8014

fix: sealing: Stop recovery attempts after fault #8014

Conversation

ZenGround0 commented Feb 1, 2022 • edited Loading

Related Issues

Proposed Changes

Additional Info

Checklist

codecov bot commented Feb 1, 2022 • edited Loading

Codecov Report

jennijuju commented Feb 2, 2022 • edited Loading

ZenGround0 commented Feb 2, 2022

ZenGround0 commented Feb 1, 2022 •

edited

Loading

codecov bot commented Feb 1, 2022 •

edited

Loading

jennijuju commented Feb 2, 2022 •

edited

Loading