Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-15572: Race condition between log dir roll and log rename dir #14543

Closed
wants to merge 1 commit into from

Conversation

ctrlaltluc
Copy link

@ctrlaltluc ctrlaltluc commented Oct 13, 2023

Description

This PR fixes a race condition between:

  • the log rename dir logic, which can be called during alter replica log dirs or during log dir delete
  • the log flush dir logic, which is called to force fsync when new segments are rolled

This PR overwrites a previous fix in #14280. That PR fixed a similar race condition (only between log flush and log delete) by swallowing NoSuchFileException, to avoid the log dir becoming offline. That was a correct fix for the race condition between log flush and log delete, but is not enough to fix the race condition between log flush and log rename (after dir flush fails, we can lose messages from new segments, if broker fails).

Since both log delete and log alter reached the race condition with log flush through log rename dir, this PR fixes the race condition for both, by synchronizing log flush and log rename dir on the same lock in UnifiedLog. More detailed:

  1. call to localLog.flush was moved under the synchronized block
  2. call to Utils.flushDirIfExists was replaced with call to Utils.flushDir, since swallowing NoSuchFileException is no longer required if race condition is addressed with 1
  3. Utils.flushDirIfExists is removed, since it is no longer used
  4. Unit test simulating concurrent dir rename is removed, since it can no longer happen after addressing with 1

Decided to lock the entire localLog.flush call instead of separating the segment flush part from the dir flush part (in LocalLog), and locking only the dir flush (in UnifiedLog, if call to segment flush returned a boolean for new segments flushed), as the logic was losing cohesion, without much added benefit.

For details on the race condition, including code references, please see the description of https://issues.apache.org/jira/browse/KAFKA-15572 and comments.

Testing

This fix was tested by deploying trunk + patch of this PR to one of our staging clusters and running alter replica log dir on 1.5TB of data across 33863 replica log dirs.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@divijvaidya
Copy link
Contributor

@ocadaruma, would you like to review this one?

@divijvaidya divijvaidya self-assigned this Oct 13, 2023
@divijvaidya divijvaidya added the storage Pull requests that target the storage module label Oct 13, 2023
@ocadaruma
Copy link
Contributor

Hmm, yeah let me review

@divijvaidya
Copy link
Contributor

The fix in this PR has serious performance impact since partition lock is the bottleneck for single partition throughput in Kafka, hence, this decision is not lightly made.

To understand eh problem correctly, in terms of concurrency,

1\ if renaming happens before flushing, then flush will fail will file not found (because it has reference to old directory). The renamed directory will not be flushed here but will eventually be flushed in the next scheduled flush() call.

2\ If renames happens after flushing then, we might have a renamed folder which hasn't been flushed yet. It will be flushed in next flush() call.

@ctrlaltluc Is your primary concern that the "eventual" flush() of renamed directory will decrease durability since the messages will be lost if broker fails?

@ctrlaltluc
Copy link
Author

ctrlaltluc commented Oct 13, 2023

The fix in this PR has serious performance impact since partition lock is the bottleneck for single partition throughput in Kafka, hence, this decision is not lightly made.

To understand eh problem correctly, in terms of concurrency,

1\ if renaming happens before flushing, then flush will fail will file not found (because it has reference to old directory). The renamed directory will not be flushed here but will eventually be flushed in the next scheduled flush() call.

2\ If renames happens after flushing then, we might have a renamed folder which hasn't been flushed yet. It will be flushed in next flush() call.

@ctrlaltluc Is your primary concern that the "eventual" flush() of renamed directory will decrease durability since the messages will be lost if broker fails?

@divijvaidya your understanding is correct.

My primary concern is that, if the directory flush is ignored, if the broker fails until the next flush, any new segment is lost. Flushing the directory is required for synchronizing the directory inode and data, which contains the reference to the new segment inode. Only flushing the segment would only sync the new segment data and inode, but the directory data would not have any reference to it (thus would be inaccessible). This is my understanding (which sounds correct to me) from explanation in db3e5e2.

The edge case described there can still happen, if we wait until the next flush.

LE:
Regarding the performance concerns, we can separate segment flushes from dir flushes (I tested this option too in our staging clusters). The current LocalLog.flush could become LocalLog.flushSegments which can return a boolean true if new segments were flushed (and does not require the lock), then the logic in UnifiedLog.flush can call LocalLog.flushDir which will only flush the dir if there were new segments (and this will require the lock, but synchronization is minimal). Like below:

        val flushedNewSegments = localLog.flushSegments(flushOffset)
        lock synchronized {
          // acquire lock before flushing dir, as dir can be concurrently renamed in renameDir
          if (flushedNewSegments) {
            // we flush log dir for crash consistency, if new segments were flushed
            localLog.flushDir()
          }
          localLog.markFlushed(newRecoveryPoint)
        }

This (i.e. UnifiedLog.flush) is the only current usage of LocalLog.flush, so it is doable.

@ocadaruma
Copy link
Contributor

ocadaruma commented Oct 13, 2023

@ctrlaltluc
As @divijvaidya pointed out, flushing (i.e. calling fsync) under the UnifiedLog#lock could be a serious performance issue especially when disk's latency is high (e.g. using HDD or disk is overloaded) which several patches are proposed regarding this (#13782, #14242)

if the broker fails until the next flush

To be precise, the condition of data loss is "broker server fails at OS/Hardware (≠ process) level until the change is written to the disk by OS", which is considered to be fairly rare and doesn't cause complete data loss (i.e. data lost from all replicas) if we deploy Kafka cluster properly (i.e. locate replicas in different failure domains).

Also, even if we flush the directory, unless we flush the segment on every message append (which is not a common practice in Kafka), data-loss still could happen on server failure so relying on replication for data durability rather than fsync is the Kafka's design decision in my understanding. (As Jack Vanlightly recently summarized).

Given that, I'm not sure if we should fsync inside the lock at the cost of performance impact.

@ctrlaltluc
Copy link
Author

ctrlaltluc commented Oct 13, 2023

Given that, I'm not sure if we should fsync inside the lock at the cost of performance impact.

@ocadaruma thanks for your reply!

You are correct, it is against failure at OS level, not Kafka process level. I agree it is rare and if replication was successful, the data loss is not complete. If this is a conscious design decision, sounds good to me. I was not familiar with Jack Vanlightly's post (very nice explanation!), although I coincidentally read this other post just a few days back. Should subscribe to the RSS. Thanks!

Concluding, I have no issues dropping the PR and closing the ticket as being solved by swallowing NoSuchFileException. The most important part of the fix for us is to avoid having the whole log dir become offline, which is covered.

If there are no other replies or concerns until Monday, I will close this PR and link the JIRA ticket as being solved by https://issues.apache.org/jira/browse/KAFKA-15391.

@ctrlaltluc
Copy link
Author

Closing this PR as per discussion above, as fix was done in #14280 by catching NoSuchFileException.

@ctrlaltluc ctrlaltluc closed this Oct 16, 2023
@ctrlaltluc ctrlaltluc deleted the KAFKA-15572 branch October 16, 2023 08:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
storage Pull requests that target the storage module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants