-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
release-23.1: logstore: sync sideloaded storage directories #117383
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The sideloaded storage fsyncs the files that it creates. Even though this guarantees durability of the files content and metadata, this still does not guarantee that the references to these files are durable. For example, Linux man page for fsync [^1] says: ``` Calling fsync() does not necessarily ensure that the entry in the directory containing the file has also reached disk. For that an explicit fsync() on a file descriptor for the directory is also needed. ``` It means that these files can be lost after a system crash of power off. This leads to issues: 1. The storage syncs are not atomic with the sideloaded files syncs. It is thus possible that raft log metadata "references" a sideloaded file and gets synced, but the file is not yet. A power off/on at this point leads to an internal inconsistency, and can result in a crash loop when raft will try to load these entries to apply and/or send to other replicas. 2. The durability of entry files is used as a pre-condition to sending raft messages that trigger committing these entries. A coordinated power off/on on a majority of replicas can thus lead to losing committed entries and unrecoverable loss-of-quorum. This commit fixes the above issues, by syncing the parent directory after writing sideloaded entry files. The natural point for this is MaybeSideloadEntries on the handleRaftReady path. [^1]: https://man7.org/linux/man-pages/man2/fsync.2.html Epic: none Release note (bug fix): this commit fixes a durability bug in raft log storage, caused by incorrect syncing of filesystem metadata. It was possible to lose writes of a particular kind (AddSSTable) that is e.g. used by RESTORE. This loss was possible only under power-off or OS crash conditions. As a result, CRDB could enter a crash loop on restart. In the worst case of a coordinated power-off/crash across multiple nodes this could lead to an unrecoverable loss of quorum.
Thanks for opening a backport. Please check the backport criteria before merging:
If your backport adds new functionality, please ensure that the following additional criteria are satisfied:
Also, please add a brief release justification to the body of your PR to justify this |
blathers-crl
bot
added
the
backport
Label PR's that are backports to older release branches
label
Jan 5, 2024
@erikgrinaker @nvanbenschoten The #117295 backport accidentally did not include the last commit, so I'm sending it in this follow-up. |
erikgrinaker
approved these changes
Jan 5, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Backport 1/6 commits from #114616. This has been backported in #117295 but the last commit was accidentally omitted, so we backport the last commit here.
/cc @cockroachdb/release
This PR ensures that the hierarchy of directories/files created by the sideloaded log storage is properly synced.
Previously, only the "leaf" files in this hierarchy were fsync-ed. Even though this guarantees the files content and metadata is synced, this still does not guarantee that the references to these files are durable. For example, Linux man page for
fsync
1 says:It means that these files can be lost after a system crash of power off. This leads to issues because:
Pebble WAL syncs are not atomic with the sideloaded files syncs. It is thus possible that raft log metadata "references" a sideloaded file and gets synced, but the file is not yet. A power off/on at this point leads to an internal inconsistency, and can result in a crash loop when raft will try to load these entries to apply and/or send to other replicas.
The durability of entry files is used as a pre-condition to sending raft messages that trigger committing these entries. A coordinated power off/on on a majority of replicas can thus lead to losing committed entries and unrecoverable loss-of-quorum.
This PR fixes the above issues, by syncing parents of all the directories and files that the sideloaded storage creates.
Part of #114411
Epic: none
Release note (bug fix): this commit fixes a durability bug in raft log storage, caused by incorrect syncing of filesystem metadata. It was possible to lose writes of a particular kind (
AddSSTable
) that is e.g. used byRESTORE
. This loss was possible only under power-off or OS crash conditions. As a result, CRDB could enter a crash loop on restart. In the worst case of a correlated power-off/crash across multiple nodes this could lead to loss of quorum or data loss.Release justification: critical bug fix
Footnotes
https://man7.org/linux/man-pages/man2/fsync.2.html ↩