-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Ensure remove_bank_snapshot success #31415
Ensure remove_bank_snapshot success #31415
Conversation
Codecov Report
@@ Coverage Diff @@
## master #31415 +/- ##
=======================================
Coverage 81.4% 81.4%
=======================================
Files 731 731
Lines 208491 208531 +40
=======================================
+ Hits 169778 169822 +44
+ Misses 38713 38709 -4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a small nit, otherwise looks good to me
runtime/src/snapshot_utils.rs
Outdated
Err(e) => { | ||
warn!( | ||
"{}", | ||
format!("read_link error {} on {}", e, symlink.display()) | ||
); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Err(e) => { | |
warn!( | |
"{}", | |
format!("read_link error {} on {}", e, symlink.display()) | |
); | |
} | |
} | |
Err(e) => warn!("read_link error {} on {}", e, symlink.display()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Updated the code.
"Unable to read bank snapshot for slot {}: {}. Removing the dir", | ||
slot, err | ||
); | ||
remove_bank_snapshot(slot, &bank_snapshots_dir); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change of behavior to the function does make me wince... The function claims to return the bank snapshots within a directory, yet now it'll also destructively modify the directory too.
Which caller(s) need this new functionality?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_bank_snapshots() does the work of scanning the directories under the top bank_snapshots_dir and return the vector of snapshots (BankSnapshotInfo) for the snapshot dirs not returning any error. This is called in purge_old_bank_snapshots() and do_get_highest_bank_snapshot()
The previous behavior is that for any snapshot dirs which returns error, it generates an error! log and leaves that bad dir on disk.
This new behavior is to always remove any snapshot dir whenever reading from it returns an error, failing to return a valid BankSnapshotInfo.
I think this is the clean way, ensuring there is no any leftover bad dir on the disk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your concern is that the function name get_bank_snapshots() is misleading, doing cleaning work in additional to getting. I agree on that. The real clean way could be as below
Have a SnapshotTopDir struct.
It has the member functions
fn scan_snapshots_dir(): does scanning and cleaning, and building a vector of BankSnapshotInfos.
fn get_snapshots(): return the built vector of BankSnapshotInfos.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I'm trying to understand is what is the call stack that requires this new behavior.
I understand that purge_old_bank_snapshots()
and do_get_highest_bank_snapshot()
are the immediate callers of get_bank_snapshots()
, but they don't fail if there's a bad directory. So it's some other caller further up the chain. That's what I'd like to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no other caller. The software process does not fail if there is a bad directory. It is just cleaner not to leave a bad directory on disk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually in one case the software process could fail without this.
Suppose the account paths are in the memory mounted file system, taking up the memory space. The software process is killed when add_bank_snpashot() is almost done, but not done yet to have the completion file flag set up, so it leaves an incomplete snapshot dir on disk with its account fles taking much memory space. Then, without this cleanup step, loading the new bank could fail due to Out of Memory (OOM).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Back in Add remove_incomplete_bank_snapshot_dir (#31131), the handling on incomplete snapshot dir was added. This is an improved, more solid version of that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it is useful to cleanup bad bank snapshot directories. Based on this discussion, my understanding is that this change is not required for this PR, which is about remove_bank_snapshot()
. A separate PR that introduces a fn to purge invalid files/directories within a given bank snapshot dir feel appropriate.
However, if there's a reason why this change needs to be in this PR, then I am still missing something and do not fully understand. If that's the case, can you help explain more, please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In PR "Add remove_incomplete_bank_snapshot_dir" (#31131), the first commit aaafef7 was to have a separate function to to purge invalid snapshot dirs.
That scans through the snapshots dir and delete the invalid ones.
I reverted that change, and opted to delete the dir in-place inside get_bank_snapshots() because that directory scanning part is duplicated logic, same to the scanning part in get_bank_snapshots. I think it is much simpler to examine a snapshot dir in one place, looking at the scanning result for once, either accept it and construct a BankSnapshotInfo for it, or reject it and cleanup its disk space.
PR 31131 still leaves a few issues.
- The completion check is not done as the 1st check. An incomplete dir caused some other IO failures before reaching the completion check, so the directory was not removed. It is later fixed by " Check incomplete file flag first #31409"
- The handling of incompletion is to call fs::remove_dir_all(bank_snapshot_dir) to remove the dir, leaving the accounts files orphaned. The clean_orphaned_account_snapshot_dirs call at the next process start time will clean up them. It is fine when they are using the disk space. But in the case memory-mounted account-paths, this can cause OOM.
The reason of calling remove_dir_all instead of remove_bank_snapshot is that I was not sure if remove_bank_snapshot is good for a snapshot dir in bad state. I am afraid it may just raise some IO errors and quit, leaving the dir unremoved. This PR is to address that concern, make sure it can remove the dir regardless of the state of the dir. Then I can call it on any bad dir and be sure the dir is cleanly removed. - Among the errors from new_from_dir, only the incompletion error is handled. The rest only result in an error! message. Although they are not likely to happen, it is logically incomplete. Just in case, some such error does happen, that disk space (or memory space) is not freely immediately. This is also fixed in this PR.
Problem
fn remove_bank_snapshot may fail with an IO error, causing an error! log, and leave the snapshot dir on disk. This is unsafe, may cause uncertain subsequent behaviors, and leak the disk space.
Summary of Changes
Make this function not return an IO error. In unlikely file system error cases, it panic via expect(). Otherwise, it succeeds with that dir cleanly removed and the corresponding account_path/snapshot/<slot> removed.
Add a test function to verify the removing result for different cases;
Call it to remove the dir whenever any error is returned from BankSnapshotInfo::new_from_dir().
Fixes #