-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
release-23.2: logstore: sync sideloaded storage directories #115709
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This commit removes the dirCreated field of DiskSideloadStorage, because it is only used in tests, and is reduntant (directory existence check already does the job). Epic: none Release note: none
Epic: none Release note: none
A couple of things to address in the future: sideloaded files removal should happen strictly after a state machine sync; sideloaded files and directories should be cleaned up on startup because their removal is not always durable. Epic: none Release note: none
The sideloaded storage fsyncs the files that it creates. Even though this guarantees durability of the files content and metadata, this still does not guarantee that the references to these files are durable. For example, Linux man page for fsync [^1] says: ``` Calling fsync() does not necessarily ensure that the entry in the directory containing the file has also reached disk. For that an explicit fsync() on a file descriptor for the directory is also needed. ``` It means that these files can be lost after a system crash of power off. This leads to issues: 1. The storage syncs are not atomic with the sideloaded files syncs. It is thus possible that raft log metadata "references" a sideloaded file and gets synced, but the file is not yet. A power off/on at this point leads to an internal inconsistency, and can result in a crash loop when raft will try to load these entries to apply and/or send to other replicas. 2. The durability of entry files is used as a pre-condition to sending raft messages that trigger committing these entries. A coordinated power off/on on a majority of replicas can thus lead to losing committed entries and unrecoverable loss-of-quorum. This commit fixes the above issues, by syncing the parent directory after writing sideloaded entry files. The natural point for this is MaybeSideloadEntries on the handleRaftReady path. [^1]: https://man7.org/linux/man-pages/man2/fsync.2.html Epic: none Release note (bug fix): this commit fixes a durability bug in raft log storage, caused by incorrect syncing of filesystem metadata. It was possible to lose writes of a particular kind (AddSSTable) that is e.g. used by RESTORE. This loss was possible only under power-off or OS crash conditions. As a result, CRDB could enter a crash loop on restart. In the worst case of a coordinated power-off/crash across multiple nodes this could lead to an unrecoverable loss of quorum.
This commit adds `TestSideloadStorageSync` which demonstrates that the sideloaded log storage can lose files and directories upon system crash. This is due to the fact that the whole directory hierarchy is not properly synced when the directories and files are created. A typical sideloaded storage file (entry 123 at term 44 for range r1234) looks like: `<data-dir>/auxiliary/sideloading/r1XXX/r1234/i123.t44`. Only existence of auxiliary directory is persisted upon its creation, by syncing the <data-dir> when Pebble initializes the store. All other directories (sideloading, r1XXX, r1234) are not persisted upon creation. Epic: none Release note: none
The sideloaded log storage does not sync the hierarchy of directories it creates. This can potentially lead to full or partial loss of its sub-directories in case of a system crash or power off. After this commit, every time sideloaded storage creates a directory, it syncs its parent so that the reference is durable. Epic: none Release note: none
blathers-crl
bot
force-pushed
the
blathers/backport-release-23.2-114616
branch
from
December 6, 2023 17:42
1ad14d0
to
f8af9ff
Compare
blathers-crl
bot
added
blathers-backport
This is a backport that Blathers created automatically.
O-robot
Originated from a bot.
labels
Dec 6, 2023
Thanks for opening a backport. Please check the backport criteria before merging:
If your backport adds new functionality, please ensure that the following additional criteria are satisfied:
Also, please add a brief release justification to the body of your PR to justify this |
blathers-crl
bot
added
the
backport
Label PR's that are backports to older release branches
label
Dec 6, 2023
blathers-crl
bot
requested review from
erikgrinaker,
nvanbenschoten,
pav-kv,
RaduBerinde and
a team
December 6, 2023 17:42
jbowens
approved these changes
Dec 6, 2023
erikgrinaker
approved these changes
Dec 7, 2023
Note that this was merged after 23.2.0-rc branch was cut. |
Created #115841 for the RC too. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
backport
Label PR's that are backports to older release branches
blathers-backport
This is a backport that Blathers created automatically.
O-robot
Originated from a bot.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Backport 6/6 commits from #114616 on behalf of @pavelkalinnikov.
/cc @cockroachdb/release
This PR ensures that the hierarchy of directories/files created by the sideloaded log storage is properly synced.
Previously, only the "leaf" files in this hierarchy were fsync-ed. Even though this guarantees the files content and metadata is synced, this still does not guarantee that the references to these files are durable. For example, Linux man page for
fsync
1 says:It means that these files can be lost after a system crash of power off. This leads to issues because:
Pebble WAL syncs are not atomic with the sideloaded files syncs. It is thus possible that raft log metadata "references" a sideloaded file and gets synced, but the file is not yet. A power off/on at this point leads to an internal inconsistency, and can result in a crash loop when raft will try to load these entries to apply and/or send to other replicas.
The durability of entry files is used as a pre-condition to sending raft messages that trigger committing these entries. A coordinated power off/on on a majority of replicas can thus lead to losing committed entries and unrecoverable loss-of-quorum.
This PR fixes the above issues, by syncing parents of all the directories and files that the sideloaded storage creates.
Part of #114411
Epic: none
Release note (bug fix): this commit fixes a durability bug in raft log storage, caused by incorrect syncing of filesystem metadata. It was possible to lose writes of a particular kind (
AddSSTable
) that is e.g. used byRESTORE
. This loss was possible only under power-off or OS crash conditions. As a result, CRDB could enter a crash loop on restart. In the worst case of a correlated power-off/crash across multiple nodes this could lead to loss of quorum or data loss.Release justification: critical bug fix
Footnotes
https://man7.org/linux/man-pages/man2/fsync.2.html ↩