KAFKA-15572: Race condition between log dir roll and log rename dir #14543

ctrlaltluc · 2023-10-13T09:22:32Z

Description

This PR fixes a race condition between:

the log rename dir logic, which can be called during alter replica log dirs or during log dir delete
the log flush dir logic, which is called to force fsync when new segments are rolled

This PR overwrites a previous fix in #14280. That PR fixed a similar race condition (only between log flush and log delete) by swallowing NoSuchFileException, to avoid the log dir becoming offline. That was a correct fix for the race condition between log flush and log delete, but is not enough to fix the race condition between log flush and log rename (after dir flush fails, we can lose messages from new segments, if broker fails).

Since both log delete and log alter reached the race condition with log flush through log rename dir, this PR fixes the race condition for both, by synchronizing log flush and log rename dir on the same lock in UnifiedLog. More detailed:

call to localLog.flush was moved under the synchronized block
call to Utils.flushDirIfExists was replaced with call to Utils.flushDir, since swallowing NoSuchFileException is no longer required if race condition is addressed with 1
Utils.flushDirIfExists is removed, since it is no longer used
Unit test simulating concurrent dir rename is removed, since it can no longer happen after addressing with 1

Decided to lock the entire localLog.flush call instead of separating the segment flush part from the dir flush part (in LocalLog), and locking only the dir flush (in UnifiedLog, if call to segment flush returned a boolean for new segments flushed), as the logic was losing cohesion, without much added benefit.

For details on the race condition, including code references, please see the description of https://issues.apache.org/jira/browse/KAFKA-15572 and comments.

Testing

This fix was tested by deploying trunk + patch of this PR to one of our staging clusters and running alter replica log dir on 1.5TB of data across 33863 replica log dirs.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

divijvaidya · 2023-10-13T10:56:04Z

@ocadaruma, would you like to review this one?

ocadaruma · 2023-10-13T11:09:57Z

Hmm, yeah let me review

divijvaidya · 2023-10-13T11:12:07Z

The fix in this PR has serious performance impact since partition lock is the bottleneck for single partition throughput in Kafka, hence, this decision is not lightly made.

To understand eh problem correctly, in terms of concurrency,

1\ if renaming happens before flushing, then flush will fail will file not found (because it has reference to old directory). The renamed directory will not be flushed here but will eventually be flushed in the next scheduled flush() call.

2\ If renames happens after flushing then, we might have a renamed folder which hasn't been flushed yet. It will be flushed in next flush() call.

@ctrlaltluc Is your primary concern that the "eventual" flush() of renamed directory will decrease durability since the messages will be lost if broker fails?

ctrlaltluc · 2023-10-13T11:35:54Z

The fix in this PR has serious performance impact since partition lock is the bottleneck for single partition throughput in Kafka, hence, this decision is not lightly made.

To understand eh problem correctly, in terms of concurrency,

1\ if renaming happens before flushing, then flush will fail will file not found (because it has reference to old directory). The renamed directory will not be flushed here but will eventually be flushed in the next scheduled flush() call.

2\ If renames happens after flushing then, we might have a renamed folder which hasn't been flushed yet. It will be flushed in next flush() call.

@ctrlaltluc Is your primary concern that the "eventual" flush() of renamed directory will decrease durability since the messages will be lost if broker fails?

@divijvaidya your understanding is correct.

My primary concern is that, if the directory flush is ignored, if the broker fails until the next flush, any new segment is lost. Flushing the directory is required for synchronizing the directory inode and data, which contains the reference to the new segment inode. Only flushing the segment would only sync the new segment data and inode, but the directory data would not have any reference to it (thus would be inaccessible). This is my understanding (which sounds correct to me) from explanation in db3e5e2.

The edge case described there can still happen, if we wait until the next flush.

LE:
Regarding the performance concerns, we can separate segment flushes from dir flushes (I tested this option too in our staging clusters). The current LocalLog.flush could become LocalLog.flushSegments which can return a boolean true if new segments were flushed (and does not require the lock), then the logic in UnifiedLog.flush can call LocalLog.flushDir which will only flush the dir if there were new segments (and this will require the lock, but synchronization is minimal). Like below:

        val flushedNewSegments = localLog.flushSegments(flushOffset)
        lock synchronized {
          // acquire lock before flushing dir, as dir can be concurrently renamed in renameDir
          if (flushedNewSegments) {
            // we flush log dir for crash consistency, if new segments were flushed
            localLog.flushDir()
          }
          localLog.markFlushed(newRecoveryPoint)
        }

This (i.e. UnifiedLog.flush) is the only current usage of LocalLog.flush, so it is doable.

ocadaruma · 2023-10-13T12:28:47Z

@ctrlaltluc
As @divijvaidya pointed out, flushing (i.e. calling fsync) under the UnifiedLog#lock could be a serious performance issue especially when disk's latency is high (e.g. using HDD or disk is overloaded) which several patches are proposed regarding this (#13782, #14242)

if the broker fails until the next flush

To be precise, the condition of data loss is "broker server fails at OS/Hardware (≠ process) level until the change is written to the disk by OS", which is considered to be fairly rare and doesn't cause complete data loss (i.e. data lost from all replicas) if we deploy Kafka cluster properly (i.e. locate replicas in different failure domains).

Also, even if we flush the directory, unless we flush the segment on every message append (which is not a common practice in Kafka), data-loss still could happen on server failure so relying on replication for data durability rather than fsync is the Kafka's design decision in my understanding. (As Jack Vanlightly recently summarized).

Given that, I'm not sure if we should fsync inside the lock at the cost of performance impact.

ctrlaltluc · 2023-10-13T13:33:07Z

Given that, I'm not sure if we should fsync inside the lock at the cost of performance impact.

@ocadaruma thanks for your reply!

You are correct, it is against failure at OS level, not Kafka process level. I agree it is rare and if replication was successful, the data loss is not complete. If this is a conscious design decision, sounds good to me. I was not familiar with Jack Vanlightly's post (very nice explanation!), although I coincidentally read this other post just a few days back. Should subscribe to the RSS. Thanks!

Concluding, I have no issues dropping the PR and closing the ticket as being solved by swallowing NoSuchFileException. The most important part of the fix for us is to avoid having the whole log dir become offline, which is covered.

If there are no other replies or concerns until Monday, I will close this PR and link the JIRA ticket as being solved by https://issues.apache.org/jira/browse/KAFKA-15391.

ctrlaltluc · 2023-10-16T08:00:52Z

Closing this PR as per discussion above, as fix was done in #14280 by catching NoSuchFileException.

KAFKA-15572: Race condition between log dir roll and log rename dir

193a3ce

divijvaidya self-assigned this Oct 13, 2023

divijvaidya added the storage Pull requests that target the storage module label Oct 13, 2023

ctrlaltluc closed this Oct 16, 2023

ctrlaltluc deleted the KAFKA-15572 branch October 16, 2023 08:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-15572: Race condition between log dir roll and log rename dir #14543

KAFKA-15572: Race condition between log dir roll and log rename dir #14543

ctrlaltluc commented Oct 13, 2023 •

edited

Loading

divijvaidya commented Oct 13, 2023

ocadaruma commented Oct 13, 2023

divijvaidya commented Oct 13, 2023

ctrlaltluc commented Oct 13, 2023 •

edited

Loading

ocadaruma commented Oct 13, 2023 •

edited

Loading

ctrlaltluc commented Oct 13, 2023 •

edited

Loading

ctrlaltluc commented Oct 16, 2023

KAFKA-15572: Race condition between log dir roll and log rename dir #14543

KAFKA-15572: Race condition between log dir roll and log rename dir #14543

Conversation

ctrlaltluc commented Oct 13, 2023 • edited Loading

Description

Testing

Committer Checklist (excluded from commit message)

divijvaidya commented Oct 13, 2023

ocadaruma commented Oct 13, 2023

divijvaidya commented Oct 13, 2023

ctrlaltluc commented Oct 13, 2023 • edited Loading

ocadaruma commented Oct 13, 2023 • edited Loading

ctrlaltluc commented Oct 13, 2023 • edited Loading

ctrlaltluc commented Oct 16, 2023

ctrlaltluc commented Oct 13, 2023 •

edited

Loading

ctrlaltluc commented Oct 13, 2023 •

edited

Loading

ocadaruma commented Oct 13, 2023 •

edited

Loading

ctrlaltluc commented Oct 13, 2023 •

edited

Loading