-
Notifications
You must be signed in to change notification settings - Fork 14.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KAFKA-15572: Race condition between log dir roll and log rename dir #14543
Conversation
@ocadaruma, would you like to review this one? |
Hmm, yeah let me review |
The fix in this PR has serious performance impact since partition lock is the bottleneck for single partition throughput in Kafka, hence, this decision is not lightly made. To understand eh problem correctly, in terms of concurrency, 1\ if renaming happens before flushing, then flush will fail will file not found (because it has reference to old directory). The renamed directory will not be flushed here but will eventually be flushed in the next scheduled flush() call. 2\ If renames happens after flushing then, we might have a renamed folder which hasn't been flushed yet. It will be flushed in next flush() call. @ctrlaltluc Is your primary concern that the "eventual" flush() of renamed directory will decrease durability since the messages will be lost if broker fails? |
@divijvaidya your understanding is correct. My primary concern is that, if the directory flush is ignored, if the broker fails until the next flush, any new segment is lost. Flushing the directory is required for synchronizing the directory inode and data, which contains the reference to the new segment inode. Only flushing the segment would only sync the new segment data and inode, but the directory data would not have any reference to it (thus would be inaccessible). This is my understanding (which sounds correct to me) from explanation in db3e5e2. The edge case described there can still happen, if we wait until the next flush. LE:
This (i.e. |
@ctrlaltluc
To be precise, the condition of data loss is "broker server fails at OS/Hardware (≠ process) level until the change is written to the disk by OS", which is considered to be fairly rare and doesn't cause complete data loss (i.e. data lost from all replicas) if we deploy Kafka cluster properly (i.e. locate replicas in different failure domains). Also, even if we flush the directory, unless we flush the segment on every message append (which is not a common practice in Kafka), data-loss still could happen on server failure so relying on replication for data durability rather than fsync is the Kafka's design decision in my understanding. (As Jack Vanlightly recently summarized). Given that, I'm not sure if we should fsync inside the lock at the cost of performance impact. |
@ocadaruma thanks for your reply! You are correct, it is against failure at OS level, not Kafka process level. I agree it is rare and if replication was successful, the data loss is not complete. If this is a conscious design decision, sounds good to me. I was not familiar with Jack Vanlightly's post (very nice explanation!), although I coincidentally read this other post just a few days back. Should subscribe to the RSS. Thanks! Concluding, I have no issues dropping the PR and closing the ticket as being solved by swallowing If there are no other replies or concerns until Monday, I will close this PR and link the JIRA ticket as being solved by https://issues.apache.org/jira/browse/KAFKA-15391. |
Closing this PR as per discussion above, as fix was done in #14280 by catching |
Description
This PR fixes a race condition between:
This PR overwrites a previous fix in #14280. That PR fixed a similar race condition (only between log flush and log delete) by swallowing
NoSuchFileException
, to avoid the log dir becoming offline. That was a correct fix for the race condition between log flush and log delete, but is not enough to fix the race condition between log flush and log rename (after dir flush fails, we can lose messages from new segments, if broker fails).Since both log delete and log alter reached the race condition with log flush through log rename dir, this PR fixes the race condition for both, by synchronizing log flush and log rename dir on the same lock in
UnifiedLog
. More detailed:localLog.flush
was moved under the synchronized blockUtils.flushDirIfExists
was replaced with call toUtils.flushDir
, since swallowingNoSuchFileException
is no longer required if race condition is addressed with 1Utils.flushDirIfExists
is removed, since it is no longer usedDecided to lock the entire
localLog.flush
call instead of separating the segment flush part from the dir flush part (inLocalLog
), and locking only the dir flush (inUnifiedLog
, if call to segment flush returned a boolean for new segments flushed), as the logic was losing cohesion, without much added benefit.For details on the race condition, including code references, please see the description of https://issues.apache.org/jira/browse/KAFKA-15572 and comments.
Testing
This fix was tested by deploying trunk + patch of this PR to one of our staging clusters and running alter replica log dir on 1.5TB of data across 33863 replica log dirs.
Committer Checklist (excluded from commit message)