Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When creating a snapshot, ledger tool should check the blockstore has the desired slot #26475

Closed
buffalu opened this issue Jul 7, 2022 · 13 comments

Comments

@buffalu
Copy link
Contributor

buffalu commented Jul 7, 2022

Problem

The feedbackk loop for determining if blockstore has a desired slot is super slow. Takes ~15 minutes to find out that you don't have a desired slot when trying to create a snapshot.

Proposed Solution

When blockstore is loaded, check to make sure the desired slot you'd like to replay to is present

@steviez
Copy link
Contributor

steviez commented Jul 7, 2022

@buffalu - Which version of solana-ledger-tool were you using? The following PR makes it so we traverse across slots to their children to see if it can make it to the desired snapshot slot.
#25632

If you are using a version that has this commit, then it is possible that there is a bug or some case that @apfitzge and I didn't consider with that check. If so, we'd appreciate some more details as this PR was in response to us also being frustrated with the exact same issue you're hitting and wanting to fail fast

@buffalu
Copy link
Contributor Author

buffalu commented Jul 7, 2022

i think it might be another issue which i mentioned in discord. pasting the full comment here:

$ cat bounds.txt
Ledger has data for 414702 slots 138670243 to 139104028
  with 370127 rooted slots from 138670243 to 139103980
  and 42 slots past the last root
$ RUST_LOG=info solana-ledger-tool -l /mnt/disks/disk2 create-snapshot 138671999
... tons of logs
[2022-07-07T16:42:11.862188957Z INFO  solana_ledger::blockstore_processor] ledger processed in 17 ms, 429 µs and 285 ns. root slot is 138670016, 1 bank: 138670016
Error: Slot 138671999 is not available

so the first snapshot at the top level is snapshot-138670016-....tar.zst.
the first one in hourly is snapshot-138680309-....tar.zst.
the ledger is supposedly 138670243 to 139104028.

time ->
-----snapshot 138670016 --------- ledger start 138670243 ----- snapshot 138680309

is the problem that this snapshot has a gap from the top level snapshot to the start of ledger? you basically need to start at the snapshot that's above the minimum ledger start slot?

@buffalu
Copy link
Contributor Author

buffalu commented Jul 7, 2022

this was downloaded from ny bucket 5 snapshot 138670016.

perhaps the issue is that there's no snapshot that lines up with the beginning of ledger start, so you can only start processing from the first snapshot you have?

@steviez
Copy link
Contributor

steviez commented Jul 7, 2022

which i mentioned in discord

Thanks for transcribing - I prefer GH once we hone in on a specific problem.

is the problem that this snapshot has a gap from the top level snapshot to the start of ledger?

Yes

you basically need to start at the snapshot that's above the minimum ledger start slot?

Yes, a full snapshot at slot S will give you the full account state at slot S. If you want to know the account state for slot S + n, you need the blocks in [Sc, S + n] where Sc is S's direct child; S isn't actually replayed.

Also, the folder you included is for epoch 321; the slot range is [138_672_000, 139_103_999]. So, that first snapshot is technically outside of the epoch bounds.

@steviez
Copy link
Contributor

steviez commented Jul 7, 2022

Also, you probably already know this but just incase someone less familiar is reading along - the snapshots in hourly subdirectory won't be found by solana-ledger-tool. That is, if you pass --snapshot-archive-path <SOME_DIR> to solana-ledger-tool, snapshots in <SOME_DIR>/sub_directory/ won't be found.

@buffalu
Copy link
Contributor Author

buffalu commented Jul 7, 2022

i didn't know that, thanks for the heads up

@steviez
Copy link
Contributor

steviez commented Jul 7, 2022

I'd have to dig further to confirm, but I believe we put them in the hourly subdirectory to avoid having the automatic snapshot retention policy nuke them on warehouse nodes.

Takes ~15 minutes to find out that you don't have a desired slot when trying to create a snapshot.

Bringing things back to actionable items, did you check which version of solana-ledger-tool you were using? That check I previously mentioned happens before snapshot is unpacked, so there should be minimal time wasted if we don't have the proper blockstore data to advance from existing snapshot slot to desired snapshot creation slot

@buffalu
Copy link
Contributor Author

buffalu commented Jul 7, 2022

i was running 1.10.29.

is the issue that blockstore had those slots so that passed, but it didn't have the snapshot so it just loaded what it had?

@steviez
Copy link
Contributor

steviez commented Jul 7, 2022

i was running 1.10.29.

Ahh, the check was added to master recently, and wasn't backported. So only in master / v1.11.

@steviez
Copy link
Contributor

steviez commented Jul 8, 2022

is the issue that blockstore had those slots so that passed, but it didn't have the snapshot so it just loaded what it had?

To provide more detail, that error comes from here:

solana/ledger-tool/src/main.rs

Lines 2515 to 2523 in d9eee72

Ok((bank_forks, starting_snapshot_hashes)) => {
let mut bank = bank_forks
.read()
.unwrap()
.get(snapshot_slot)
.unwrap_or_else(|| {
eprintln!("Error: Slot {} is not available", snapshot_slot);
exit(1);
});

This is after bank_forks are loaded; processing stopped much earlier because you didn't have the blockstore slots to proceed from loaded snapshot slot. From here, it follows that the bank for desired snapshot creation slot doesn't exist either.

This error could probably be better, I'll look at tweaking the message to be more clear

@buffalu
Copy link
Contributor Author

buffalu commented Jul 8, 2022

right, it takes ~15 minutes to get there even if you don't have that snapshot. would be nice to catch it earlier

@steviez
Copy link
Contributor

steviez commented Jul 8, 2022

That check I previously mentioned happens before snapshot is unpacked, so there should be minimal time wasted

the check was added to master recently, and wasn't backported. So only in master / v1.11.

We catch it earlier in master. If you want to try it out yourself, that'd be cool so we can know for sure. If not, I feel fairly confident and would lean towards closing this issue

@steviez
Copy link
Contributor

steviez commented Jul 11, 2022

Closing this - #25632 added a check in v1.11 that verifies appropriate slots exist prior to creating snapshot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants