Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snap-up sector stalled in FinalizeReplicaUpdate, causing WindowPoST error and 100% loss of power in partition #8148

Closed
9 of 18 tasks
benjaminh83 opened this issue Feb 18, 2022 · 3 comments
Assignees
Labels
P1 P1: Must be resolved SnapDeals

Comments

@benjaminh83
Copy link

Checklist

  • This is not a security-related bug/issue. If it is, please follow please follow the security policy.
  • This is not a question or a support request. If you have any lotus related questions, please ask in the lotus forum.
  • This is not a new feature request. If it is, please file a feature request instead.
  • This is not an enhancement request. If it is, please file a improvement suggestion instead.
  • I have searched on the issue tracker and the lotus forum, and there is no existing related issue or discussion.
  • I am running the Latest release, or the most recent RC(release canadiate) for the upcoming release or the dev branch(master), or have an issue updating to any of these.
  • I did not make any code changes to lotus.

Lotus component

  • lotus daemon - chain sync
  • lotus miner - mining and block production
  • lotus miner/worker - sealing
  • lotus miner - proving(WindowPoSt)
  • lotus miner/market - storage deal
  • lotus miner/market - retrieval deal
  • lotus miner/market - data transfer
  • lotus client
  • lotus JSON-RPC API
  • lotus message management (mpool)
  • Other

Lotus Version

Daemon:  1.14.0-rc7+calibnet+git.c217ccf54+api1.3.0
Local: lotus-miner version 1.14.0-rc7+calibnet+git.c217ccf54

Describe the Bug

I made a snap up sector that had an error while getting transferred back to the miner. This could happen for various reasons, but point being that it did not succeed and sector (sector 3) is now stuck in FinalizeReplicaUpdate.

$ lotus-miner sectors list
ID  State                  OnChain  Active  Expiration                    Deals  DealWeight  
0   Proving                YES      NO      2223840 (in 1 year 23 weeks)  CC                 
  RecoveryTimeout: 821366 (in 5 weeks 5 days)
1   Proving                YES      NO      2225996 (in 1 year 23 weeks)  1      2.703GiB    
  RecoveryTimeout: 821366 (in 5 weeks 5 days)
2   Proving                YES      NO      2228336 (in 1 year 23 weeks)  1      2.7GiB      
  RecoveryTimeout: 821366 (in 5 weeks 5 days)
3   FinalizeReplicaUpdate  YES      NO      2229509 (in 1 year 23 weeks)  1      5.426GiB    
  RecoveryTimeout: 821366 (in 5 weeks 5 days)
4   Proving                YES      NO      2250128 (in 1 year 24 weeks)  CC                 
  RecoveryTimeout: 824246 (in 5 weeks 6 days)

It is impossible to do any changes to the sector stage. Also to abandon the snap-up, like:

$ lotus-miner sectors abort-upgrade 3
ERROR: normal shutdown of state machine

And the result of having this failed snap up sector is resulting in WindowPoST failure. A failure I cannot seem to escape!

I think this is rather concerning, as this could mean faulting a full partition. Good thing I'm on calibration network, as my miner has currently lost all power and is getting slashed.

Logging Information

2022-02-18T05:29:32.042+0100    WARN    storageminer    storage/wdpost_run.go:227       Checked sectors {"checked": 5, "good": 5}
2022-02-18T05:29:32.043+0100    INFO    storageminer    storage/wdpost_run.go:626       running window post     {"chain-random": "udN22rXlJ97KH9kgjhe7J8eI4nOY/wISOsmXMi9SnkY=", "deadline": {"CurrentEpoch":703259,"PeriodStart":703227,"Index":0,"Open":703227,"Close":703287,"Challenge":703207,"FaultCutoff":703157,"WPoStPeriodDeadlines":48,"WPoStProvingPeriod":2880,"WPoStChallengeWindow":60,"WPoStChallengeLookback":20,"FaultDeclarationCutoff":70}, "height": "703259", "skipped": 0}
2022-02-18T05:29:32.043 INFO filcrypto::proofs::api > generate_window_post: start
2022-02-18T05:29:32.043 INFO filcrypto::proofs::api > generate_window_post: finish
2022-02-18T05:29:32.043+0100    INFO    storageminer    storage/wdpost_run.go:646       computing window post   {"batch": 0, "elapsed": 0.000314858}
2022-02-18T05:29:32.043+0100    ERROR   storageminer    storage/wdpost_run.go:648       error generating window post: could not read from path="p_aux"
2022-02-18T05:29:32.043+0100    ERROR   storageminer    storage/wdpost_run.go:105       runPoStCycle failed: running window post failed:
2022-02-18T05:29:32.043+0100    ERROR   storageminer    storage/wdpost_run.go:49        Got err running window post failed:
2022-02-18T05:29:32.043+0100    WARN    storageminer    storage/wdpost_changehandler.go:255     Aborted window post Proving (Deadline: &{CurrentEpoch:703259 PeriodStart:703227 Index:0 Open:703227 Close:703287 Challenge:703207 FaultCutoff:703157 WPoStPeriodDeadlines:48 WPoStProvingPeriod:2880 WPoStChallengeWindow:60 WPoStChallengeLookback:20 FaultDeclarationCutoff:70})
2022-02-18T05:29:36.006+0100    INFO    miner   miner/miner.go:479      completed mineOne       {"tookMilliseconds": 2, "forRound": 703260, "baseEpoch": 703259, "baseDeltaSeconds": 6, "nullRounds": 0, "lateStart": false, "beaconEpoch": 0, "lookbackEpochs": 900, "networkPowerAtLookback": "2419166099275776", "minerPowerAtLookback": "0", "isEligible": false, "isWinner": false, "error": null}
2022-02-18T05:29:59.698+0100    WARN    storageminer    storage/wdpost_run.go:227       Checked sectors {"checked": 5, "good": 5}
2022-02-18T05:29:59.698+0100    INFO    storageminer    storage/wdpost_run.go:626       running window post     {"chain-random": "udN22rXlJ97KH9kgjhe7J8eI4nOY/wISOsmXMi9SnkY=", "deadline": {"CurrentEpoch":703260,"PeriodStart":703227,"Index":0,"Open":703227,"Close":703287,"Challenge":703207,"FaultCutoff":703157,"WPoStPeriodDeadlines":48,"WPoStProvingPeriod":2880,"WPoStChallengeWindow":60,"WPoStChallengeLookback":20,"FaultDeclarationCutoff":70}, "height": "703260", "skipped": 0}
2022-02-18T05:29:59.699 INFO filcrypto::proofs::api > generate_window_post: start
2022-02-18T05:29:59.699 INFO filcrypto::proofs::api > generate_window_post: finish
2022-02-18T05:29:59.699+0100    INFO    storageminer    storage/wdpost_run.go:646       computing window post   {"batch": 0, "elapsed": 0.000332559}
2022-02-18T05:29:59.699+0100    ERROR   storageminer    storage/wdpost_run.go:648       error generating window post: could not read from path="p_aux"
2022-02-18T05:29:59.699+0100    ERROR   storageminer    storage/wdpost_run.go:105       runPoStCycle failed: running window post failed:
2022-02-18T05:29:59.699+0100    ERROR   storageminer    storage/wdpost_run.go:49        Got err running window post failed:
2022-02-18T05:29:59.699+0100    WARN    storageminer    storage/wdpost_changehandler.go:255     Aborted window post Proving (Deadline: &{CurrentEpoch:703260 PeriodStart:703227 Index:0 Open:703227 Close:703287 Challenge:703207 FaultCutoff:703157 WPoStPeriodDeadlines:48 WPoStProvingPeriod:2880 WPoStChallengeWindow:60 WPoStChallengeLookback:20 FaultDeclarationCutoff:70})

Repo Steps

  1. Mark CC sector for snap-up
  2. Selfdeal some smaller deal
  3. Seal the sector with a separate server with lotus-worker doing RU and PR1 PR2
  4. Not really something you can exactly reproduce, but screw up the proving (i think there was an error in transfer of the data to the miner) so the sector gets stuck in state FinalizeReplicaUpdate
  5. See WindowPoST failing.
@jennijuju jennijuju added P1 P1: Must be resolved SnapDeals and removed kind/bug Kind: Bug labels Feb 18, 2022
@ZenGround0
Copy link
Contributor

ZenGround0 commented Feb 18, 2022

@benjaminh83 if you could answer some questions it will help me make progress on this. They are probably going to seem pretty basic since this is my first lotus miner debugging session, so thanks for bearing with me.

  1. I'm not sure I understand your configuration. Is the "separate server" a completely different lotus-miner process as opposed to a separate worker? Or rather is this happening on a worker on a different server.
  2. A little bit of background to catch me up. Is it not the case that the window post scheduler can access all sectors in storage on all worker machines? I'm a bit confused as to why the updated replica needed to be transferred to "the miner" in the first place. As far as I understand the FinalizeReplicaUpdate step moves the updated replica into long term storage away from the miner.
  3. A failure I cannot seem to escape! I think this is rather concerning, as this could mean faulting a full partition.

Do I correctly understand that the post scheduler usually detects corrupted data and automatically faults sectors? And furthermore in this case that is not happening. And so instead of posting with one sector marked as faulty your post scheduler just doesn't post at all? And therefore you could lose the whole partition?

  1. If I've got the above correct here is the solution I am imagining: the post scheduler can handle the case where a replica update cannot be reached, it then marks that sector as faulty, completes the post over the partition with that sector faulty.

@benjaminh83
Copy link
Author

@ZenGround0 No problem. Its not always easy for us to get the details right, so it makes totally sense what we have been doing :) I will try to answer each of them here:

  1. happening on a worker on a different server. My calib test system is 2 servers:
    SERVER1: EPYC, 256GiB mem, Tesla T4, 144TiB (Long Term Storage). Running: daemon, lotus-miner (monolith)
    SERVER2: EPYC, 256GiB mem, 2xRTX2080ti, 3.84TiB (Scratch). Running lotus-worker (AP,PC1,PC2) and lotus-worker (C2,RU,PR1,PR2)

  2. Yes, the last step would be to move the data to the long term storage. But my long term storage is not setup as a NFS and attached on the worker (SERVER2), so the process of moving the data to long term storage would have to go through the lotus-miner as a GET process and then saved on the long term storage attached on the miner.

  3. Yes, @magik6k know this part very well, but I believe this part is the routine where individual sectors should be checked and marked as faulty, so they do not get included in the aggregation of the partition WindowPoST:
    2022-02-17T05:43:00.723+0100 WARN storageminer storage/wdpost_run.go:227 Checked sectors {"checked": 4, "good": 4} I this example - actually sector 3 was not good - quite broken and ended up with the endless retries of WindowPoST. And Yes - I one sector is broken (corrupted), and its not caught here - then the WindowPoST process ether crashed, or it could also be the proof send on chain is faulty and rejected (I think).

  4. Yes, this would normally be the case - It should be like all individual incomplete sectors should be marked "not good" and excluded from the partitions proving.

  5. Unless I need to nuke my calib-miner, then I would also need some fix to hard remove this faulty sector 3 - its kind of not going away... :)

Other important info: (lotus-miner port 2345, lotus-worker 3456 and 3457)

$ lotus-miner storage list sectors
Storage                               Sector  Type          State                  Primary  Path use  URLs                                                       
4c8fcabd-fdb8-48f6-9055-5d0f06b51ede  0       sealed        Proving                         store     http://192.168.XX.XX:2345/remote/sealed/s-t031131-0        
4c8fcabd-fdb8-48f6-9055-5d0f06b51ede  0       cache         Proving                         store     http://192.168.XX.XX:2345/remote/cache/s-t031131-0         
4c8fcabd-fdb8-48f6-9055-5d0f06b51ede  1       unsealed      Proving                         store     http://192.168.XX.XX:2345/remote/unsealed/s-t031131-1      
4c8fcabd-fdb8-48f6-9055-5d0f06b51ede  1       cache         Proving                         store     http://192.168.XX.XX:2345/remote/cache/s-t031131-1         
4c8fcabd-fdb8-48f6-9055-5d0f06b51ede  1       update        Proving                         store     http://192.168.XX.XX:2345/remote/update/s-t031131-1        
4c8fcabd-fdb8-48f6-9055-5d0f06b51ede  1       update-cache  Proving                         store     http://192.168.XX.XX:2345/remote/update-cache/s-t031131-1  
4c8fcabd-fdb8-48f6-9055-5d0f06b51ede  2       unsealed      Proving                         store     http://192.168.XX.XX:2345/remote/unsealed/s-t031131-2      
4c8fcabd-fdb8-48f6-9055-5d0f06b51ede  2       cache         Proving                         store     http://192.168.XX.XX:2345/remote/cache/s-t031131-2         
4c8fcabd-fdb8-48f6-9055-5d0f06b51ede  2       update        Proving                         store     http://192.168.XX.XX:2345/remote/update/s-t031131-2        
4c8fcabd-fdb8-48f6-9055-5d0f06b51ede  2       update-cache  Proving                         store     http://192.168.XX.XX:2345/remote/update-cache/s-t031131-2  
4c8fcabd-fdb8-48f6-9055-5d0f06b51ede  3       unsealed      FinalizeReplicaUpdate           store     http://192.168.XX.XX:2345/remote/unsealed/s-t031131-3      
4c8fcabd-fdb8-48f6-9055-5d0f06b51ede  3       sealed        FinalizeReplicaUpdate           store     http://192.168.XX.XX:2345/remote/sealed/s-t031131-3        
4c8fcabd-fdb8-48f6-9055-5d0f06b51ede  3       cache         FinalizeReplicaUpdate           store     http://192.168.XX.XX:2345/remote/cache/s-t031131-3         
4c8fcabd-fdb8-48f6-9055-5d0f06b51ede  4       sealed        Proving                         store     http://192.168.XX.XX:2345/remote/sealed/s-t031131-4        
4c8fcabd-fdb8-48f6-9055-5d0f06b51ede  4       cache         Proving                         store     http://192.168.XX.XX:2345/remote/cache/s-t031131-4         
153c7166-c103-4df6-9e65-a2888cb86df7  3       unsealed      FinalizeReplicaUpdate  primary  seal      http://192.168.XX.45:3457/remote/unsealed/s-t031131-3      
153c7166-c103-4df6-9e65-a2888cb86df7  3       sealed        FinalizeReplicaUpdate  primary  seal      http://192.168.XX.45:3457/remote/sealed/s-t031131-3        
153c7166-c103-4df6-9e65-a2888cb86df7  3       cache         FinalizeReplicaUpdate  primary  seal      http://192.168.XX.45:3457/remote/cache/s-t031131-3         
153c7166-c103-4df6-9e65-a2888cb86df7  3       update        FinalizeReplicaUpdate  primary  seal      http://192.168.XX.45:3457/remote/update/s-t031131-3        
153c7166-c103-4df6-9e65-a2888cb86df7  3       update-cache  FinalizeReplicaUpdate  primary  seal      http://192.168.XX.45:3457/remote/update-cache/s-t031131-3  
21064779-0f18-4de7-a82b-fbdcaaa7d21e  1       unsealed      Proving                primary  seal      http://192.168.XX.45:3456/remote/unsealed/s-t031131-1      
21064779-0f18-4de7-a82b-fbdcaaa7d21e  2       unsealed      Proving                primary  seal      http://192.168.XX.45:3456/remote/unsealed/s-t031131-2  

So data is still located on the lotus-worker... This is the layout:

4c8fcabd-fdb8-48f6-9055-5d0f06b51ede:
        [###################################               ] 103.9 TiB/144.9 TiB 71%
        Unsealed: 3; Sealed: 3; Caches: 5; Reserved: 0 B
        Weight: 10; Use: Store
        Local: /mnt/md126/teststorage/
        URL: http://192.168.XX.XX:2345/remote

153c7166-c103-4df6-9e65-a2888cb86df7:
        [#######                                           ] 519.2 GiB/3.437 TiB 14%
        Unsealed: 1; Sealed: 1; Caches: 1; Reserved: 0 B
        Weight: 10; Use: Seal 
        URL: http://192.168.XX.45:3457/remote (latency: 900µs)

21064779-0f18-4de7-a82b-fbdcaaa7d21e:
        [#######                                           ] 519.2 GiB/3.437 TiB 14%
        Unsealed: 2; Sealed: 0; Caches: 0; Reserved: 0 B
        Weight: 10; Use: Seal 
        URL: http://192.168.XX.45:3456/remote (latency: 1.1ms)

c756abae-2635-4763-aa27-cad5b553966b:
        [#####                                             ] 214.1 GiB/1.833 TiB 11%
        Unsealed: 0; Sealed: 0; Caches: 0; Reserved: 0 B
        Weight: 10; Use: Seal 
        Local: /home/acemgt/.lotusminer
        URL: http://192.168.XX.XX:2345/remote

@ZenGround0
Copy link
Contributor

This should be fixed by #8177, please reopen if you encounter a similar problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 P1: Must be resolved SnapDeals
Projects
None yet
Development

No branches or pull requests

4 participants