Skip to content
This repository has been archived by the owner on Jan 22, 2025. It is now read-only.

Ensure remove_bank_snapshot success #31415

Conversation

xiangzhu70
Copy link
Contributor

@xiangzhu70 xiangzhu70 commented Apr 29, 2023

Problem

fn remove_bank_snapshot may fail with an IO error, causing an error! log, and leave the snapshot dir on disk. This is unsafe, may cause uncertain subsequent behaviors, and leak the disk space.

Summary of Changes

Make this function not return an IO error. In unlikely file system error cases, it panic via expect(). Otherwise, it succeeds with that dir cleanly removed and the corresponding account_path/snapshot/<slot> removed.

Add a test function to verify the removing result for different cases;

Call it to remove the dir whenever any error is returned from BankSnapshotInfo::new_from_dir().

Fixes #

@codecov
Copy link

codecov bot commented Apr 29, 2023

Codecov Report

Merging #31415 (08cf1b9) into master (aa7baaf) will increase coverage by 0.0%.
The diff coverage is 91.0%.

@@           Coverage Diff           @@
##           master   #31415   +/-   ##
=======================================
  Coverage    81.4%    81.4%           
=======================================
  Files         731      731           
  Lines      208491   208531   +40     
=======================================
+ Hits       169778   169822   +44     
+ Misses      38713    38709    -4     

@xiangzhu70 xiangzhu70 marked this pull request as ready for review April 29, 2023 03:46
@xiangzhu70 xiangzhu70 self-assigned this Apr 29, 2023
@xiangzhu70 xiangzhu70 requested a review from brooksprumo May 1, 2023 16:18
Copy link
Contributor

@apfitzge apfitzge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a small nit, otherwise looks good to me

Comment on lines 1364 to 1370
Err(e) => {
warn!(
"{}",
format!("read_link error {} on {}", e, symlink.display())
);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Err(e) => {
warn!(
"{}",
format!("read_link error {} on {}", e, symlink.display())
);
}
}
Err(e) => warn!("read_link error {} on {}", e, symlink.display()),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Updated the code.

"Unable to read bank snapshot for slot {}: {}. Removing the dir",
slot, err
);
remove_bank_snapshot(slot, &bank_snapshots_dir);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change of behavior to the function does make me wince... The function claims to return the bank snapshots within a directory, yet now it'll also destructively modify the directory too.

Which caller(s) need this new functionality?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_bank_snapshots() does the work of scanning the directories under the top bank_snapshots_dir and return the vector of snapshots (BankSnapshotInfo) for the snapshot dirs not returning any error. This is called in purge_old_bank_snapshots() and do_get_highest_bank_snapshot()

The previous behavior is that for any snapshot dirs which returns error, it generates an error! log and leaves that bad dir on disk.

This new behavior is to always remove any snapshot dir whenever reading from it returns an error, failing to return a valid BankSnapshotInfo.

I think this is the clean way, ensuring there is no any leftover bad dir on the disk.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your concern is that the function name get_bank_snapshots() is misleading, doing cleaning work in additional to getting. I agree on that. The real clean way could be as below
Have a SnapshotTopDir struct.
It has the member functions
fn scan_snapshots_dir(): does scanning and cleaning, and building a vector of BankSnapshotInfos.
fn get_snapshots(): return the built vector of BankSnapshotInfos.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I'm trying to understand is what is the call stack that requires this new behavior.

I understand that purge_old_bank_snapshots() and do_get_highest_bank_snapshot() are the immediate callers of get_bank_snapshots(), but they don't fail if there's a bad directory. So it's some other caller further up the chain. That's what I'd like to understand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no other caller. The software process does not fail if there is a bad directory. It is just cleaner not to leave a bad directory on disk.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually in one case the software process could fail without this.

Suppose the account paths are in the memory mounted file system, taking up the memory space. The software process is killed when add_bank_snpashot() is almost done, but not done yet to have the completion file flag set up, so it leaves an incomplete snapshot dir on disk with its account fles taking much memory space. Then, without this cleanup step, loading the new bank could fail due to Out of Memory (OOM).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Back in Add remove_incomplete_bank_snapshot_dir (#31131), the handling on incomplete snapshot dir was added. This is an improved, more solid version of that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it is useful to cleanup bad bank snapshot directories. Based on this discussion, my understanding is that this change is not required for this PR, which is about remove_bank_snapshot(). A separate PR that introduces a fn to purge invalid files/directories within a given bank snapshot dir feel appropriate.

However, if there's a reason why this change needs to be in this PR, then I am still missing something and do not fully understand. If that's the case, can you help explain more, please?

Copy link
Contributor Author

@xiangzhu70 xiangzhu70 May 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In PR "Add remove_incomplete_bank_snapshot_dir" (#31131), the first commit aaafef7 was to have a separate function to to purge invalid snapshot dirs.
That scans through the snapshots dir and delete the invalid ones.

I reverted that change, and opted to delete the dir in-place inside get_bank_snapshots() because that directory scanning part is duplicated logic, same to the scanning part in get_bank_snapshots. I think it is much simpler to examine a snapshot dir in one place, looking at the scanning result for once, either accept it and construct a BankSnapshotInfo for it, or reject it and cleanup its disk space.

PR 31131 still leaves a few issues.

  1. The completion check is not done as the 1st check. An incomplete dir caused some other IO failures before reaching the completion check, so the directory was not removed. It is later fixed by " Check incomplete file flag first #31409"
  2. The handling of incompletion is to call fs::remove_dir_all(bank_snapshot_dir) to remove the dir, leaving the accounts files orphaned. The clean_orphaned_account_snapshot_dirs call at the next process start time will clean up them. It is fine when they are using the disk space. But in the case memory-mounted account-paths, this can cause OOM.
    The reason of calling remove_dir_all instead of remove_bank_snapshot is that I was not sure if remove_bank_snapshot is good for a snapshot dir in bad state. I am afraid it may just raise some IO errors and quit, leaving the dir unremoved. This PR is to address that concern, make sure it can remove the dir regardless of the state of the dir. Then I can call it on any bad dir and be sure the dir is cleanly removed.
  3. Among the errors from new_from_dir, only the incompletion error is handled. The rest only result in an error! message. Although they are not likely to happen, it is logically incomplete. Just in case, some such error does happen, that disk space (or memory space) is not freely immediately. This is also fixed in this PR.

@xiangzhu70 xiangzhu70 requested review from apfitzge and brooksprumo May 1, 2023 18:42
@mergify mergify bot added community Community contribution need:merge-assist labels May 4, 2023
@github-actions github-actions bot added the stale [bot only] Added to stale content; results in auto-close after a week. label May 19, 2023
@github-actions github-actions bot closed this May 29, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
community Community contribution need:merge-assist stale [bot only] Added to stale content; results in auto-close after a week.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants