Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-23.1: logstore: sync sideloaded storage directories #117383

Merged
merged 1 commit into from
Jan 5, 2024

Conversation

pav-kv
Copy link
Collaborator

@pav-kv pav-kv commented Jan 5, 2024

Backport 1/6 commits from #114616. This has been backported in #117295 but the last commit was accidentally omitted, so we backport the last commit here.

/cc @cockroachdb/release


This PR ensures that the hierarchy of directories/files created by the sideloaded log storage is properly synced.

Previously, only the "leaf" files in this hierarchy were fsync-ed. Even though this guarantees the files content and metadata is synced, this still does not guarantee that the references to these files are durable. For example, Linux man page for fsync 1 says:

Calling fsync() does not necessarily ensure that the entry in the
directory containing the file has also reached disk.  For that an
explicit fsync() on a file descriptor for the directory is also
needed.

It means that these files can be lost after a system crash of power off. This leads to issues because:

  1. Pebble WAL syncs are not atomic with the sideloaded files syncs. It is thus possible that raft log metadata "references" a sideloaded file and gets synced, but the file is not yet. A power off/on at this point leads to an internal inconsistency, and can result in a crash loop when raft will try to load these entries to apply and/or send to other replicas.

  2. The durability of entry files is used as a pre-condition to sending raft messages that trigger committing these entries. A coordinated power off/on on a majority of replicas can thus lead to losing committed entries and unrecoverable loss-of-quorum.

This PR fixes the above issues, by syncing parents of all the directories and files that the sideloaded storage creates.

Part of #114411

Epic: none

Release note (bug fix): this commit fixes a durability bug in raft log storage, caused by incorrect syncing of filesystem metadata. It was possible to lose writes of a particular kind (AddSSTable) that is e.g. used by RESTORE. This loss was possible only under power-off or OS crash conditions. As a result, CRDB could enter a crash loop on restart. In the worst case of a correlated power-off/crash across multiple nodes this could lead to loss of quorum or data loss.

Release justification: critical bug fix

Footnotes

  1. https://man7.org/linux/man-pages/man2/fsync.2.html

The sideloaded storage fsyncs the files that it creates. Even though
this guarantees durability of the files content and metadata, this still
does not guarantee that the references to these files are durable. For
example, Linux man page for fsync [^1] says:

```
Calling fsync() does not necessarily ensure that the entry in the
directory containing the file has also reached disk.  For that an
explicit fsync() on a file descriptor for the directory is also
needed.
```

It means that these files can be lost after a system crash of power off.
This leads to issues:

1. The storage syncs are not atomic with the sideloaded files syncs. It
   is thus possible that raft log metadata "references" a sideloaded
   file and gets synced, but the file is not yet. A power off/on at
   this point leads to an internal inconsistency, and can result in a
   crash loop when raft will try to load these entries to apply and/or
   send to other replicas.

2. The durability of entry files is used as a pre-condition to sending
   raft messages that trigger committing these entries. A coordinated
   power off/on on a majority of replicas can thus lead to losing
   committed entries and unrecoverable loss-of-quorum.

This commit fixes the above issues, by syncing the parent directory
after writing sideloaded entry files. The natural point for this is
MaybeSideloadEntries on the handleRaftReady path.

[^1]: https://man7.org/linux/man-pages/man2/fsync.2.html

Epic: none

Release note (bug fix): this commit fixes a durability bug in raft log
storage, caused by incorrect syncing of filesystem metadata. It was
possible to lose writes of a particular kind (AddSSTable) that is e.g.
used by RESTORE. This loss was possible only under power-off or OS crash
conditions. As a result, CRDB could enter a crash loop on restart. In
the worst case of a coordinated power-off/crash across multiple nodes
this could lead to an unrecoverable loss of quorum.
@pav-kv pav-kv requested review from erikgrinaker, nvanbenschoten and a team January 5, 2024 13:22
Copy link

blathers-crl bot commented Jan 5, 2024

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Backports should only be created for serious
    issues
    or test-only changes.
  • Backports should not break backwards-compatibility.
  • Backports should change as little code as possible.
  • Backports should not change on-disk formats or node communication protocols.
  • Backports should not add new functionality (except as defined
    here).
  • Backports must not add, edit, or otherwise modify cluster versions; or add version gates.
  • All backports must be reviewed by the owning areas TL and one additional
    TL. For more information as to how that review should be conducted, please consult the backport
    policy
    .
If your backport adds new functionality, please ensure that the following additional criteria are satisfied:
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters. State changes must be further protected such that nodes running old binaries will not be negatively impacted by the new state (with a mixed version test added).
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.
  • Your backport must be accompanied by a post to the appropriate Slack
    channel (#db-backports-point-releases or #db-backports-XX-X-release) for awareness and discussion.

Also, please add a brief release justification to the body of your PR to justify this
backport.

@blathers-crl blathers-crl bot added the backport Label PR's that are backports to older release branches label Jan 5, 2024
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@pav-kv
Copy link
Collaborator Author

pav-kv commented Jan 5, 2024

@erikgrinaker @nvanbenschoten The #117295 backport accidentally did not include the last commit, so I'm sending it in this follow-up.

@pav-kv pav-kv merged commit 9b148cb into cockroachdb:release-23.1 Jan 5, 2024
6 checks passed
@pav-kv pav-kv deleted the backport23.1-114616 branch January 5, 2024 17:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport Label PR's that are backports to older release branches
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants