-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
release-23.1: logstore: sync sideloaded storage directories #117295
Conversation
This commit removes the dirCreated field of DiskSideloadStorage, because it is only used in tests, and is reduntant (directory existence check already does the job). Epic: none Release note: none
Epic: none Release note: none
A couple of things to address in the future: sideloaded files removal should happen strictly after a state machine sync; sideloaded files and directories should be cleaned up on startup because their removal is not always durable. Epic: none Release note: none
This commit adds `TestSideloadStorageSync` which demonstrates that the sideloaded log storage can lose files and directories upon system crash. This is due to the fact that the whole directory hierarchy is not properly synced when the directories and files are created. A typical sideloaded storage file (entry 123 at term 44 for range r1234) looks like: `<data-dir>/auxiliary/sideloading/r1XXX/r1234/i123.t44`. Only existence of auxiliary directory is persisted upon its creation, by syncing the <data-dir> when Pebble initializes the store. All other directories (sideloading, r1XXX, r1234) are not persisted upon creation. Epic: none Release note: none
Thanks for opening a backport. Please check the backport criteria before merging:
If your backport adds new functionality, please ensure that the following additional criteria are satisfied:
Also, please add a brief release justification to the body of your PR to justify this |
db2de03
to
68e7642
Compare
@erikgrinaker Note that this had to be modified to be backported. Most of the diff is test-only. The only non-test change is in the |
68e7642
to
d87cf8b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but needs TL approval and a post in #db-backports-point-releases too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r1, 1 of 1 files at r2, 3 of 3 files at r3, 2 of 2 files at r4, 3 of 3 files at r5, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @pav-kv)
pkg/kv/kvserver/logstore/sideload_disk.go
line 272 at r5 (raw file):
// mkdirAllAndSyncParents creates the given directory and all its missing // parents if any. For every newly created directly, it syncs the corresponding
s/directly/directory/
The sideloaded log storage does not sync the hierarchy of directories it creates. This can potentially lead to full or partial loss of its sub-directories in case of a system crash or power off. After this commit, every time sideloaded storage creates a directory, it syncs its parent so that the reference is durable. Epic: none Release note: none
d87cf8b
to
e75bd74
Compare
Backport 5/6 commits from #114616.
/cc @cockroachdb/release
This PR ensures that the hierarchy of directories/files created by the sideloaded log storage is properly synced.
Previously, only the "leaf" files in this hierarchy were fsync-ed. Even though this guarantees the files content and metadata is synced, this still does not guarantee that the references to these files are durable. For example, Linux man page for
fsync
1 says:It means that these files can be lost after a system crash of power off. This leads to issues because:
Pebble WAL syncs are not atomic with the sideloaded files syncs. It is thus possible that raft log metadata "references" a sideloaded file and gets synced, but the file is not yet. A power off/on at this point leads to an internal inconsistency, and can result in a crash loop when raft will try to load these entries to apply and/or send to other replicas.
The durability of entry files is used as a pre-condition to sending raft messages that trigger committing these entries. A coordinated power off/on on a majority of replicas can thus lead to losing committed entries and unrecoverable loss-of-quorum.
This PR fixes the above issues, by syncing parents of all the directories and files that the sideloaded storage creates.
Part of #114411
Epic: none
Release note (bug fix): this commit fixes a durability bug in raft log storage, caused by incorrect syncing of filesystem metadata. It was possible to lose writes of a particular kind (
AddSSTable
) that is e.g. used byRESTORE
. This loss was possible only under power-off or OS crash conditions. As a result, CRDB could enter a crash loop on restart. In the worst case of a correlated power-off/crash across multiple nodes this could lead to loss of quorum or data loss.Release justification: critical bug fix
Footnotes
https://man7.org/linux/man-pages/man2/fsync.2.html ↩