Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-23.1: logstore: sync sideloaded storage directories #117295

Merged
merged 5 commits into from
Jan 5, 2024

Conversation

pav-kv
Copy link
Collaborator

@pav-kv pav-kv commented Jan 3, 2024

Backport 5/6 commits from #114616.

/cc @cockroachdb/release


This PR ensures that the hierarchy of directories/files created by the sideloaded log storage is properly synced.

Previously, only the "leaf" files in this hierarchy were fsync-ed. Even though this guarantees the files content and metadata is synced, this still does not guarantee that the references to these files are durable. For example, Linux man page for fsync 1 says:

Calling fsync() does not necessarily ensure that the entry in the
directory containing the file has also reached disk.  For that an
explicit fsync() on a file descriptor for the directory is also
needed.

It means that these files can be lost after a system crash of power off. This leads to issues because:

  1. Pebble WAL syncs are not atomic with the sideloaded files syncs. It is thus possible that raft log metadata "references" a sideloaded file and gets synced, but the file is not yet. A power off/on at this point leads to an internal inconsistency, and can result in a crash loop when raft will try to load these entries to apply and/or send to other replicas.

  2. The durability of entry files is used as a pre-condition to sending raft messages that trigger committing these entries. A coordinated power off/on on a majority of replicas can thus lead to losing committed entries and unrecoverable loss-of-quorum.

This PR fixes the above issues, by syncing parents of all the directories and files that the sideloaded storage creates.

Part of #114411

Epic: none

Release note (bug fix): this commit fixes a durability bug in raft log storage, caused by incorrect syncing of filesystem metadata. It was possible to lose writes of a particular kind (AddSSTable) that is e.g. used by RESTORE. This loss was possible only under power-off or OS crash conditions. As a result, CRDB could enter a crash loop on restart. In the worst case of a correlated power-off/crash across multiple nodes this could lead to loss of quorum or data loss.

Release justification: critical bug fix

Footnotes

  1. https://man7.org/linux/man-pages/man2/fsync.2.html

pav-kv added 4 commits January 3, 2024 20:51
This commit removes the dirCreated field of DiskSideloadStorage, because
it is only used in tests, and is reduntant (directory existence check
already does the job).

Epic: none
Release note: none
A couple of things to address in the future: sideloaded files removal
should happen strictly after a state machine sync; sideloaded files and
directories should be cleaned up on startup because their removal is not
always durable.

Epic: none
Release note: none
This commit adds `TestSideloadStorageSync` which demonstrates that the
sideloaded log storage can lose files and directories upon system crash.
This is due to the fact that the whole directory hierarchy is not
properly synced when the directories and files are created.

A typical sideloaded storage file (entry 123 at term 44 for range r1234)
looks like: `<data-dir>/auxiliary/sideloading/r1XXX/r1234/i123.t44`.

Only existence of auxiliary directory is persisted upon its creation, by
syncing the <data-dir> when Pebble initializes the store. All other
directories (sideloading, r1XXX, r1234) are not persisted upon creation.

Epic: none
Release note: none
@pav-kv pav-kv requested a review from a team January 3, 2024 21:22
@pav-kv pav-kv requested a review from a team as a code owner January 3, 2024 21:22
Copy link

blathers-crl bot commented Jan 3, 2024

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Backports should only be created for serious
    issues
    or test-only changes.
  • Backports should not break backwards-compatibility.
  • Backports should change as little code as possible.
  • Backports should not change on-disk formats or node communication protocols.
  • Backports should not add new functionality (except as defined
    here).
  • Backports must not add, edit, or otherwise modify cluster versions; or add version gates.
  • All backports must be reviewed by the owning areas TL and one additional
    TL. For more information as to how that review should be conducted, please consult the backport
    policy
    .
If your backport adds new functionality, please ensure that the following additional criteria are satisfied:
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters. State changes must be further protected such that nodes running old binaries will not be negatively impacted by the new state (with a mixed version test added).
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.
  • Your backport must be accompanied by a post to the appropriate Slack
    channel (#db-backports-point-releases or #db-backports-XX-X-release) for awareness and discussion.

Also, please add a brief release justification to the body of your PR to justify this
backport.

@blathers-crl blathers-crl bot added the backport Label PR's that are backports to older release branches label Jan 3, 2024
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@pav-kv pav-kv force-pushed the backport23.1-114616 branch from db2de03 to 68e7642 Compare January 3, 2024 21:29
@pav-kv pav-kv requested a review from erikgrinaker January 3, 2024 21:30
@pav-kv
Copy link
Collaborator Author

pav-kv commented Jan 3, 2024

@erikgrinaker Note that this had to be modified to be backported. Most of the diff is test-only.

The only non-test change is in the mkdirAllAndSyncParents implementation (last commit). In <= 23.1 the storage engine wasn't implementing vfs.FS and didn't have path helpers, so I swapped in filepath.Dir instead of vfs.FS.PathDir(). This should be ok given that the rest of the code in sideload_disk.go uses other filepath helpers. Tests pass.

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but needs TL approval and a post in #db-backports-point-releases too.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 2 files at r1, 1 of 1 files at r2, 3 of 3 files at r3, 2 of 2 files at r4, 3 of 3 files at r5, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @pav-kv)


pkg/kv/kvserver/logstore/sideload_disk.go line 272 at r5 (raw file):

// mkdirAllAndSyncParents creates the given directory and all its missing
// parents if any. For every newly created directly, it syncs the corresponding

s/directly/directory/

The sideloaded log storage does not sync the hierarchy of directories it
creates. This can potentially lead to full or partial loss of its
sub-directories in case of a system crash or power off.

After this commit, every time sideloaded storage creates a directory, it
syncs its parent so that the reference is durable.

Epic: none
Release note: none
@pav-kv pav-kv force-pushed the backport23.1-114616 branch from d87cf8b to e75bd74 Compare January 4, 2024 22:24
@pav-kv pav-kv merged commit 3983bce into cockroachdb:release-23.1 Jan 5, 2024
6 checks passed
@pav-kv pav-kv deleted the backport23.1-114616 branch January 5, 2024 00:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport Label PR's that are backports to older release branches
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants