Snap-up sector stalled in FinalizeReplicaUpdate, causing WindowPoST error and 100% loss of power in partition #8148

benjaminh83 · 2022-02-18T08:52:28Z

Checklist

This is not a security-related bug/issue. If it is, please follow please follow the security policy.
This is not a question or a support request. If you have any lotus related questions, please ask in the lotus forum.
This is not a new feature request. If it is, please file a feature request instead.
This is not an enhancement request. If it is, please file a improvement suggestion instead.
I have searched on the issue tracker and the lotus forum, and there is no existing related issue or discussion.
I am running the Latest release, or the most recent RC(release canadiate) for the upcoming release or the dev branch(master), or have an issue updating to any of these.
I did not make any code changes to lotus.

Lotus component

Lotus Version

Daemon:  1.14.0-rc7+calibnet+git.c217ccf54+api1.3.0
Local: lotus-miner version 1.14.0-rc7+calibnet+git.c217ccf54

Describe the Bug

I made a snap up sector that had an error while getting transferred back to the miner. This could happen for various reasons, but point being that it did not succeed and sector (sector 3) is now stuck in FinalizeReplicaUpdate.

$ lotus-miner sectors list
ID  State                  OnChain  Active  Expiration                    Deals  DealWeight  
0   Proving                YES      NO      2223840 (in 1 year 23 weeks)  CC                 
  RecoveryTimeout: 821366 (in 5 weeks 5 days)
1   Proving                YES      NO      2225996 (in 1 year 23 weeks)  1      2.703GiB    
  RecoveryTimeout: 821366 (in 5 weeks 5 days)
2   Proving                YES      NO      2228336 (in 1 year 23 weeks)  1      2.7GiB      
  RecoveryTimeout: 821366 (in 5 weeks 5 days)
3   FinalizeReplicaUpdate  YES      NO      2229509 (in 1 year 23 weeks)  1      5.426GiB    
  RecoveryTimeout: 821366 (in 5 weeks 5 days)
4   Proving                YES      NO      2250128 (in 1 year 24 weeks)  CC                 
  RecoveryTimeout: 824246 (in 5 weeks 6 days)

It is impossible to do any changes to the sector stage. Also to abandon the snap-up, like:

$ lotus-miner sectors abort-upgrade 3
ERROR: normal shutdown of state machine

And the result of having this failed snap up sector is resulting in WindowPoST failure. A failure I cannot seem to escape!

I think this is rather concerning, as this could mean faulting a full partition. Good thing I'm on calibration network, as my miner has currently lost all power and is getting slashed.

Logging Information

2022-02-18T05:29:32.042+0100    WARN    storageminer    storage/wdpost_run.go:227       Checked sectors {"checked": 5, "good": 5}
2022-02-18T05:29:32.043+0100    INFO    storageminer    storage/wdpost_run.go:626       running window post     {"chain-random": "udN22rXlJ97KH9kgjhe7J8eI4nOY/wISOsmXMi9SnkY=", "deadline": {"CurrentEpoch":703259,"PeriodStart":703227,"Index":0,"Open":703227,"Close":703287,"Challenge":703207,"FaultCutoff":703157,"WPoStPeriodDeadlines":48,"WPoStProvingPeriod":2880,"WPoStChallengeWindow":60,"WPoStChallengeLookback":20,"FaultDeclarationCutoff":70}, "height": "703259", "skipped": 0}
2022-02-18T05:29:32.043 INFO filcrypto::proofs::api > generate_window_post: start
2022-02-18T05:29:32.043 INFO filcrypto::proofs::api > generate_window_post: finish
2022-02-18T05:29:32.043+0100    INFO    storageminer    storage/wdpost_run.go:646       computing window post   {"batch": 0, "elapsed": 0.000314858}
2022-02-18T05:29:32.043+0100    ERROR   storageminer    storage/wdpost_run.go:648       error generating window post: could not read from path="p_aux"
2022-02-18T05:29:32.043+0100    ERROR   storageminer    storage/wdpost_run.go:105       runPoStCycle failed: running window post failed:
2022-02-18T05:29:32.043+0100    ERROR   storageminer    storage/wdpost_run.go:49        Got err running window post failed:
2022-02-18T05:29:32.043+0100    WARN    storageminer    storage/wdpost_changehandler.go:255     Aborted window post Proving (Deadline: &{CurrentEpoch:703259 PeriodStart:703227 Index:0 Open:703227 Close:703287 Challenge:703207 FaultCutoff:703157 WPoStPeriodDeadlines:48 WPoStProvingPeriod:2880 WPoStChallengeWindow:60 WPoStChallengeLookback:20 FaultDeclarationCutoff:70})
2022-02-18T05:29:36.006+0100    INFO    miner   miner/miner.go:479      completed mineOne       {"tookMilliseconds": 2, "forRound": 703260, "baseEpoch": 703259, "baseDeltaSeconds": 6, "nullRounds": 0, "lateStart": false, "beaconEpoch": 0, "lookbackEpochs": 900, "networkPowerAtLookback": "2419166099275776", "minerPowerAtLookback": "0", "isEligible": false, "isWinner": false, "error": null}
2022-02-18T05:29:59.698+0100    WARN    storageminer    storage/wdpost_run.go:227       Checked sectors {"checked": 5, "good": 5}
2022-02-18T05:29:59.698+0100    INFO    storageminer    storage/wdpost_run.go:626       running window post     {"chain-random": "udN22rXlJ97KH9kgjhe7J8eI4nOY/wISOsmXMi9SnkY=", "deadline": {"CurrentEpoch":703260,"PeriodStart":703227,"Index":0,"Open":703227,"Close":703287,"Challenge":703207,"FaultCutoff":703157,"WPoStPeriodDeadlines":48,"WPoStProvingPeriod":2880,"WPoStChallengeWindow":60,"WPoStChallengeLookback":20,"FaultDeclarationCutoff":70}, "height": "703260", "skipped": 0}
2022-02-18T05:29:59.699 INFO filcrypto::proofs::api > generate_window_post: start
2022-02-18T05:29:59.699 INFO filcrypto::proofs::api > generate_window_post: finish
2022-02-18T05:29:59.699+0100    INFO    storageminer    storage/wdpost_run.go:646       computing window post   {"batch": 0, "elapsed": 0.000332559}
2022-02-18T05:29:59.699+0100    ERROR   storageminer    storage/wdpost_run.go:648       error generating window post: could not read from path="p_aux"
2022-02-18T05:29:59.699+0100    ERROR   storageminer    storage/wdpost_run.go:105       runPoStCycle failed: running window post failed:
2022-02-18T05:29:59.699+0100    ERROR   storageminer    storage/wdpost_run.go:49        Got err running window post failed:
2022-02-18T05:29:59.699+0100    WARN    storageminer    storage/wdpost_changehandler.go:255     Aborted window post Proving (Deadline: &{CurrentEpoch:703260 PeriodStart:703227 Index:0 Open:703227 Close:703287 Challenge:703207 FaultCutoff:703157 WPoStPeriodDeadlines:48 WPoStProvingPeriod:2880 WPoStChallengeWindow:60 WPoStChallengeLookback:20 FaultDeclarationCutoff:70})

Repo Steps

Mark CC sector for snap-up
Selfdeal some smaller deal
Seal the sector with a separate server with lotus-worker doing RU and PR1 PR2
Not really something you can exactly reproduce, but screw up the proving (i think there was an error in transfer of the data to the miner) so the sector gets stuck in state FinalizeReplicaUpdate
See WindowPoST failing.

The text was updated successfully, but these errors were encountered:

ZenGround0 · 2022-02-18T18:07:35Z

@benjaminh83 if you could answer some questions it will help me make progress on this. They are probably going to seem pretty basic since this is my first lotus miner debugging session, so thanks for bearing with me.

I'm not sure I understand your configuration. Is the "separate server" a completely different lotus-miner process as opposed to a separate worker? Or rather is this happening on a worker on a different server.
A little bit of background to catch me up. Is it not the case that the window post scheduler can access all sectors in storage on all worker machines? I'm a bit confused as to why the updated replica needed to be transferred to "the miner" in the first place. As far as I understand the FinalizeReplicaUpdate step moves the updated replica into long term storage away from the miner.
A failure I cannot seem to escape! I think this is rather concerning, as this could mean faulting a full partition.

Do I correctly understand that the post scheduler usually detects corrupted data and automatically faults sectors? And furthermore in this case that is not happening. And so instead of posting with one sector marked as faulty your post scheduler just doesn't post at all? And therefore you could lose the whole partition?

If I've got the above correct here is the solution I am imagining: the post scheduler can handle the case where a replica update cannot be reached, it then marks that sector as faulty, completes the post over the partition with that sector faulty.

benjaminh83 · 2022-02-18T19:22:06Z

@ZenGround0 No problem. Its not always easy for us to get the details right, so it makes totally sense what we have been doing :) I will try to answer each of them here:

happening on a worker on a different server. My calib test system is 2 servers:
SERVER1: EPYC, 256GiB mem, Tesla T4, 144TiB (Long Term Storage). Running: daemon, lotus-miner (monolith)
SERVER2: EPYC, 256GiB mem, 2xRTX2080ti, 3.84TiB (Scratch). Running lotus-worker (AP,PC1,PC2) and lotus-worker (C2,RU,PR1,PR2)
Yes, the last step would be to move the data to the long term storage. But my long term storage is not setup as a NFS and attached on the worker (SERVER2), so the process of moving the data to long term storage would have to go through the lotus-miner as a GET process and then saved on the long term storage attached on the miner.
Yes, @magik6k know this part very well, but I believe this part is the routine where individual sectors should be checked and marked as faulty, so they do not get included in the aggregation of the partition WindowPoST:
2022-02-17T05:43:00.723+0100 WARN storageminer storage/wdpost_run.go:227 Checked sectors {"checked": 4, "good": 4} I this example - actually sector 3 was not good - quite broken and ended up with the endless retries of WindowPoST. And Yes - I one sector is broken (corrupted), and its not caught here - then the WindowPoST process ether crashed, or it could also be the proof send on chain is faulty and rejected (I think).
Yes, this would normally be the case - It should be like all individual incomplete sectors should be marked "not good" and excluded from the partitions proving.
Unless I need to nuke my calib-miner, then I would also need some fix to hard remove this faulty sector 3 - its kind of not going away... :)

Other important info: (lotus-miner port 2345, lotus-worker 3456 and 3457)

$ lotus-miner storage list sectors
Storage                               Sector  Type          State                  Primary  Path use  URLs                                                       
4c8fcabd-fdb8-48f6-9055-5d0f06b51ede  0       sealed        Proving                         store     http://192.168.XX.XX:2345/remote/sealed/s-t031131-0        
4c8fcabd-fdb8-48f6-9055-5d0f06b51ede  0       cache         Proving                         store     http://192.168.XX.XX:2345/remote/cache/s-t031131-0         
4c8fcabd-fdb8-48f6-9055-5d0f06b51ede  1       unsealed      Proving                         store     http://192.168.XX.XX:2345/remote/unsealed/s-t031131-1      
4c8fcabd-fdb8-48f6-9055-5d0f06b51ede  1       cache         Proving                         store     http://192.168.XX.XX:2345/remote/cache/s-t031131-1         
4c8fcabd-fdb8-48f6-9055-5d0f06b51ede  1       update        Proving                         store     http://192.168.XX.XX:2345/remote/update/s-t031131-1        
4c8fcabd-fdb8-48f6-9055-5d0f06b51ede  1       update-cache  Proving                         store     http://192.168.XX.XX:2345/remote/update-cache/s-t031131-1  
4c8fcabd-fdb8-48f6-9055-5d0f06b51ede  2       unsealed      Proving                         store     http://192.168.XX.XX:2345/remote/unsealed/s-t031131-2      
4c8fcabd-fdb8-48f6-9055-5d0f06b51ede  2       cache         Proving                         store     http://192.168.XX.XX:2345/remote/cache/s-t031131-2         
4c8fcabd-fdb8-48f6-9055-5d0f06b51ede  2       update        Proving                         store     http://192.168.XX.XX:2345/remote/update/s-t031131-2        
4c8fcabd-fdb8-48f6-9055-5d0f06b51ede  2       update-cache  Proving                         store     http://192.168.XX.XX:2345/remote/update-cache/s-t031131-2  
4c8fcabd-fdb8-48f6-9055-5d0f06b51ede  3       unsealed      FinalizeReplicaUpdate           store     http://192.168.XX.XX:2345/remote/unsealed/s-t031131-3      
4c8fcabd-fdb8-48f6-9055-5d0f06b51ede  3       sealed        FinalizeReplicaUpdate           store     http://192.168.XX.XX:2345/remote/sealed/s-t031131-3        
4c8fcabd-fdb8-48f6-9055-5d0f06b51ede  3       cache         FinalizeReplicaUpdate           store     http://192.168.XX.XX:2345/remote/cache/s-t031131-3         
4c8fcabd-fdb8-48f6-9055-5d0f06b51ede  4       sealed        Proving                         store     http://192.168.XX.XX:2345/remote/sealed/s-t031131-4        
4c8fcabd-fdb8-48f6-9055-5d0f06b51ede  4       cache         Proving                         store     http://192.168.XX.XX:2345/remote/cache/s-t031131-4         
153c7166-c103-4df6-9e65-a2888cb86df7  3       unsealed      FinalizeReplicaUpdate  primary  seal      http://192.168.XX.45:3457/remote/unsealed/s-t031131-3      
153c7166-c103-4df6-9e65-a2888cb86df7  3       sealed        FinalizeReplicaUpdate  primary  seal      http://192.168.XX.45:3457/remote/sealed/s-t031131-3        
153c7166-c103-4df6-9e65-a2888cb86df7  3       cache         FinalizeReplicaUpdate  primary  seal      http://192.168.XX.45:3457/remote/cache/s-t031131-3         
153c7166-c103-4df6-9e65-a2888cb86df7  3       update        FinalizeReplicaUpdate  primary  seal      http://192.168.XX.45:3457/remote/update/s-t031131-3        
153c7166-c103-4df6-9e65-a2888cb86df7  3       update-cache  FinalizeReplicaUpdate  primary  seal      http://192.168.XX.45:3457/remote/update-cache/s-t031131-3  
21064779-0f18-4de7-a82b-fbdcaaa7d21e  1       unsealed      Proving                primary  seal      http://192.168.XX.45:3456/remote/unsealed/s-t031131-1      
21064779-0f18-4de7-a82b-fbdcaaa7d21e  2       unsealed      Proving                primary  seal      http://192.168.XX.45:3456/remote/unsealed/s-t031131-2

So data is still located on the lotus-worker... This is the layout:

4c8fcabd-fdb8-48f6-9055-5d0f06b51ede:
        [###################################               ] 103.9 TiB/144.9 TiB 71%
        Unsealed: 3; Sealed: 3; Caches: 5; Reserved: 0 B
        Weight: 10; Use: Store
        Local: /mnt/md126/teststorage/
        URL: http://192.168.XX.XX:2345/remote

153c7166-c103-4df6-9e65-a2888cb86df7:
        [#######                                           ] 519.2 GiB/3.437 TiB 14%
        Unsealed: 1; Sealed: 1; Caches: 1; Reserved: 0 B
        Weight: 10; Use: Seal 
        URL: http://192.168.XX.45:3457/remote (latency: 900µs)

21064779-0f18-4de7-a82b-fbdcaaa7d21e:
        [#######                                           ] 519.2 GiB/3.437 TiB 14%
        Unsealed: 2; Sealed: 0; Caches: 0; Reserved: 0 B
        Weight: 10; Use: Seal 
        URL: http://192.168.XX.45:3456/remote (latency: 1.1ms)

c756abae-2635-4763-aa27-cad5b553966b:
        [#####                                             ] 214.1 GiB/1.833 TiB 11%
        Unsealed: 0; Sealed: 0; Caches: 0; Reserved: 0 B
        Weight: 10; Use: Seal 
        Local: /home/acemgt/.lotusminer
        URL: http://192.168.XX.XX:2345/remote

ZenGround0 · 2022-03-01T18:29:09Z

This should be fixed by #8177, please reopen if you encounter a similar problem

benjaminh83 added kind/bug Kind: Bug need/triage labels Feb 18, 2022

jennijuju added P1 P1: Must be resolved SnapDeals and removed kind/bug Kind: Bug labels Feb 18, 2022

jennijuju assigned ZenGround0 Feb 18, 2022

ZenGround0 mentioned this issue Feb 23, 2022

fix:proving:post check sector handles snap deals replica faults #8177

Merged

5 tasks

ZenGround0 closed this as completed Mar 1, 2022

TippyFlitsUK removed the need/triage label Mar 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Snap-up sector stalled in FinalizeReplicaUpdate, causing WindowPoST error and 100% loss of power in partition #8148

Snap-up sector stalled in FinalizeReplicaUpdate, causing WindowPoST error and 100% loss of power in partition #8148

benjaminh83 commented Feb 18, 2022

ZenGround0 commented Feb 18, 2022 •

edited

Loading

benjaminh83 commented Feb 18, 2022

ZenGround0 commented Mar 1, 2022

Snap-up sector stalled in FinalizeReplicaUpdate, causing WindowPoST error and 100% loss of power in partition #8148

Snap-up sector stalled in FinalizeReplicaUpdate, causing WindowPoST error and 100% loss of power in partition #8148

Comments

benjaminh83 commented Feb 18, 2022

Checklist

Lotus component

Lotus Version

Describe the Bug

Logging Information

Repo Steps

ZenGround0 commented Feb 18, 2022 • edited Loading

benjaminh83 commented Feb 18, 2022

ZenGround0 commented Mar 1, 2022

ZenGround0 commented Feb 18, 2022 •

edited

Loading