Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-15046: Get rid of unnecessary fsyncs inside UnifiedLog.lock to stabilize performance #14242

Merged
merged 7 commits into from
Nov 29, 2023

Conversation

ocadaruma
Copy link
Contributor

@ocadaruma ocadaruma commented Aug 18, 2023

JIRA ticket: https://issues.apache.org/jira/browse/KAFKA-15046

  • While any blocking operation under holding the UnifiedLog.lock could lead to serious performance (even availability) issues, currently there are several paths that calls fsync(2) inside the lock
    • In the meantime the lock is held, all subsequent produces against the partition may block
    • This easily causes all request-handlers to be busy on bad disk performance
    • Even worse, when a disk experiences tens of seconds of glitch (it's not rare in spinning drives), it makes the broker to unable to process any requests with unfenced from the cluster (i.e. "zombie" like status)
  • This PR gets rid of 4 cases of essentially-unnecessary fsync(2) calls performed under the lock:
    • (1) ProducerStateManager.takeSnapshot at UnifiedLog.roll
      • I moved fsync(2) call to the scheduler thread as part of existing "flush-log" job (before incrementing recovery point)
      • Since it's still ensured that the snapshot is flushed before incrementing recovery point, this change shouldn't cause any problem
    • (2) ProducerStateManager.removeAndMarkSnapshotForDeletion as part of log segment deletion
      • This method calls Utils.atomicMoveWithFallback with needFlushParentDir = true internally, which calls fsync.
      • I changed it to call Utils.atomicMoveWithFallback with needFlushParentDir = false (which is consistent behavior with index files deletion. index files deletion also doesn't flush parent dir)
      • This change shouldn't cause problems neither.
    • (3) LeaderEpochFileCache.truncateFromStart when incrementing log-start-offset
      • This path is called from deleteRecords on request-handler threads.
      • Here, we don't need fsync(2) either actually.
      • On unclean shutdown, few leader epochs might be remained in the file but it will be handled by LogLoader on start-up so not a problem
    • (4) LeaderEpochFileCache.truncateFromEnd as part of log truncation
      • Likewise, we don't need fsync(2) here, since any epochs which are untruncated on unclean shutdown will be handled on log loading procedure
  • Please refer JIRA ticket for the further details and the performance experiment result

To check if these changes don't cause a problem, below consistency expectation table will be helpful:

No File Consistency expectation Description Note
1 ProducerStateSnapshot Snapshot files on the disk before the recovery point should be consistent with the log segments - On restart after unclean shutdown, Kafka will skip the snapshot recovery procedure before the recovery point.
- If the snapshot content before recovery point is not consistent with the log, it will cause a problem like idempotency violation due to the missing producer state.
Hence, the inconsistency after the recovery point is acceptable because it will be recovered to the consistent state on the log loading procedure
2 ProducerStateSnapshot Deleted snapshot files on the disk should be eventually consistent with log segments - On log segment deletion by any reasons (e.g. retention, topic deletion), corresponding snapshot files will be deleted.
- Even when the broker crashes by power failure before the files are deleted from the actual disk, they should be eventually deleted from the disk.
3 LeaderEpochCheckpoint All leader epoch entry (i.e. epoch and the start offset) in the log segments have to exist also in leader-epoch checkpoint file on the disk - If some epoch entries are missing from the checkpoint file, upon restart after power failure, Kafka may restore stale leader epoch cache.
- It will return wrong entry when reading the leader epoch cache (e.g. in list offsets request handling)
On the other hand, surplus entries (prefixes or suffixes) in the checkpoint file on the disk are acceptable, because even on restart after power failure, it will be truncated anyways on log loading procedure.

We can confirm the changes are valid based on above table like this:

  • Change (1): fsync for ProducerStateManager.takeSnapshot is moved to the async scheduler thread
    • This preserves consistency expectation No.1, since we still increment recovery point only after fsync is performed
  • Change (2): ProducerStateManager.removeAndMarkSnapshotForDeletion no longer flushes parent dir
    • This preserves consistency expectation No.2, since snapshot files will be deleted eventually even after power failure.
      • ScenarioA: Snapshot files are renamed to -deleted for segment deletion by log retention, but broker crashes by power failure before the rename is persisted to the disk
        • In this case, some snapshot files's name would be reverted to -deleted suffix stripped, then ProducerStateManager would load these snapshot files unnecessarily.
        • However, since producer state will be truncated based on log-segment upon log loading procedure, it won't be a problem.
      • ScenarioB: Snapshot files are renamed to -deleted for topic deletion, but broker crashes by power failure before the rename is persisted to the disk
        • Also, in this case, snapshot file's name would be reverted to -deleted suffix stripped.
        • However, on topic-deletion, parent log-dir is already renamed to -delete and it's fsynced anyways, the revert of snapshot file wouldn't be a problem. Parent log dirs will be deleted anyways after resuming topic deletion procedure.
  • Change (3): LeaderEpochFileCache.truncateFromStart doesn't call fsync
    • This preserves consistency expectation No.3, since we still call fsync on LeaderEpochFileCache.assign.
  • Change (4): LeaderEpochFileCache. truncateFromEnd doesn't call fsync
    • Same explanation as Change (3)

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@ocadaruma
Copy link
Contributor Author

Could you take a look? @showuon

@divijvaidya divijvaidya added the storage Pull requests that target the storage module label Aug 21, 2023
@ijuma ijuma requested review from junrao and lbradstreet August 28, 2023 13:23
@showuon showuon self-requested a review August 29, 2023 08:50
@showuon
Copy link
Contributor

showuon commented Sep 4, 2023

@ocadaruma , thanks for the improvement! Some high level questions:

  1. Although you've added comments in the JIRA, it'd better you add your analysis and what/why you've changed in the PR description.
  2. moving fsync call to the scheduler thread in takeSnapshot case makes sense to me, since we have every info in memory cache. And the log recovery can recover the snapshot when unclean shutdown.
  3. For removeAndMarkSnapshotForDeletion, I didn't see this fix, could you explain it?
  4. For LeaderEpochFileCache#truncateXXX, I agree that as long as the memory cache is up-to-date, it should be fine. We can always recover from logs when unclean shutdown.
  5. nit: In the PR, could we make less code change for easier review? That is, we can create a overloaded method, and take one more parameter (boolean sync), and delegate the original method implementation to the new method (default to true). So, the only place we need to change, is the places we want to false the sync flush, which will make the PR much clear IMO.

Thank you.

@ocadaruma
Copy link
Contributor Author

ocadaruma commented Sep 4, 2023

@showuon
Thank you for your review.

  1. Got it. I revised the PR description to include analysis.
  2. The actual change is made against SnapshotFile.java#renameTo, which is called from removeAndMarkSnapshotForDeletion.

We can always recover from logs when unclean shutdown.

Yes. However, precisely, removing fsync on LeaderEpochFileCache's truncation doesn't cause extra recovery even on unclean shutdown IMO. The reason:

  • Since we still fsync on LeaderEpochFileCache#assign, we can still ensure all necessary leader epochs are in leader-epoch cache file
  • Even when truncation is not flushed (so "should-be-truncated" epochs may be left on the epoch file on unclean shutdown), log-loading procedure will truncate the epoch file as necessary (based on the log start/end offset). It's a fairly right-weight operation comparing to the recovering from the log.
  • Hmm, I intentionally didn't create a overloaded method because I was afraid a bit that default (fsync: true) method is used casually in the future code changes even for places which fsync isn't necessary.

@divijvaidya
Copy link
Contributor

Hey @ocadaruma
This is an important change, so I despite low engagement on this from me, I do believe that this change is critical. Let's think carefully about this.

I would like to discuss the changes individually.

Let's start with (1)

The side-effect of moving producer snapshot flush to async thread is that it while earlier it was guaranteed that producer snapshot is present and consistent when segment (and others) are flushed. If for some reason, the producer fsync failed, we would not have scheduled a flush for segment and friends. But now, since we are flushing snapshot async & quietly, it might be possible that we have segment and indexes on disk but we don't have a producer snapshot.

This is ok for server restart, because on restart, we will rebuild the snapshot by scanning last few segments.
This is ok even if the server doesn't restart because we will be using in-memory value of producer state and we might take another snapshot associated with next segment later.

To summarize, Kafka does not expect producer snapshot on disk to be strongly consistent with rest of files such as log segment and transaction index. @jolshan (as an expert on trx index and producer snapshot) do you agree with this statement?

If we agree on this, then (1) is a safe change IMO.

@divijvaidya divijvaidya self-assigned this Oct 25, 2023
@ocadaruma
Copy link
Contributor Author

ocadaruma commented Oct 25, 2023

@divijvaidya Thank you for your review.

But now, since we are flushing snapshot async & quietly

During reading your comment, I realized that "quietly" could be a problem so we might need to change producer-state flushing to throw IOException in case of failure. (still "async" isn't a problem though)

If we ignore producer-state-flush failure here, recovery-point might be incremented even with stale on-disk producer state snapshot. So, in case of restart after power failure, the broker might restore stale producer state without rebuilding (since recovery point is incremented) which could cause idempotency issues.

I'll update the PR after Justine's comment

@divijvaidya
Copy link
Contributor

If we ignore producer-state-flush failure here, recovery-point might be incremented even with stale on-disk producer state snapshot. So, in case of restart after power failure, the broker might restore stale producer state without rebuilding (since recovery point is incremented) which could cause idempotency issues.

Great point. May I suggest that we document the consistency expectations of producer snapshot with segment on the disk. From what you mentioned, it sounds like "Kafka expects producer snapshot to be strongly consistent with the segment data on disk before the recovery checkpoint but doesn't expect after the checkpoint. The inconsistency after the checkpoint is acceptable because....blah blah"

We verify that expectations with experts such as Justine and Jun. Based on that we can make a decision of quietly vs. async etc. The documentation will also help future contributions reason about code base. Initially, you can put the documentation in the description of this PR itself and later we can find a home for it in Kafka website docs.

We need to do the same exercise for other files that you are changing in this PR.

Copy link
Contributor

@junrao junrao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ocadaruma : Thanks for the PR. Left a few comments.

@divijvaidya : Yes, your understanding on (1) is correct.

updateHighWatermarkWithLogEndOffset()
// Schedule an asynchronous flush of the old segment
scheduler.scheduleOnce("flush-log", () => flushUptoOffsetExclusive(newSegment.baseOffset))
scheduler.scheduleOnce("flush-log", () => {
maybeSnapshot.ifPresent(f => Utils.flushFileQuietly(f.toPath, "producer-snapshot"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we fail to flush the snapshot, it seems that we should propagate the IOException to logDirFailureChannel like in flushUptoOffsetExclusive. Otherwise, we could be skipping the recovery of producer state when we should.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I also noticed that. I'll fix

// then causing ISR shrink or high produce response time degradation in remote scope on high fsync latency.
// - Even when stale epochs remained in LeaderEpoch file due to the unclean shutdown, it will be handled by
// another truncateFromEnd call on log loading procedure so it won't be a problem
flush(false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's kind of weird to call flush with sync = false since the only thing that flush does is to sync. Could we just avoid calling flush?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flush(false) still write the epoch entries to the file. (but without fsync)

If we don't call flush here, some entries will remain in the file even after the log truncation.

I'm guessing it wouldn't be a problem realistically at least in the current implementation (since, on log truncation on the follower, it will call LeaderEpochFileCache#assign on log append which anyways flush in-memory epoch entries to the file) though, we should still write to the file here IMO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't call flush here, some entries will remain in the file even after the log truncation.

That's true. But if we write the new content to the file without flushing, it seems that those old entries could still exist in the file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

those old entries could still exist in the file

Yeah, precisely, the content on the device (not file) could be still old. As long as we read the file in usual way (i.e. not through O_DIRECT), we can see the latest data.

The staleness on the device arises only when the server experiences power failure before OS flushes the page cache.
In this case, indeed the content could be rolled back to old state.

But it won't be a problem because leader-epoch file will be truncated again to match to the log file upon loading procedure anyways (this is the case mentioned in (3) in PR description)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, it sounds like that you agree that there is little value to call flush without sync. Should we remove the call then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I take another look at the code and found that flushing to the file (without fsync) is necessary.

The point here is if there's any code path that reloads the leader-epoch cache from the file.
I found it's possible, so not flushing could be a problem in below scenario

  • (1) AlterReplicaDir is initiated
  • (2) truncation happens on futureLog
    • LeaderEpochFileCache.truncateFromEnd is called, but it isn't flushed to the file
  • (3) future log caught up and the renameDir is called
    • This will reload the leader-epoch cache from the file, which is stale
    • Then wrong leader-epoch may be returned (e.g. for list-offsets request)

So we still should flush to the file even without fsync.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @ocadaruma. Good point! So, we could still write to the file without flushing. The name flush implies that it fsyncs to disks. How about renaming it to sth like writeToFile?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good. I'll fix like that

@@ -152,7 +152,7 @@ private List<EpochEntry> removeWhileMatching(Iterator<Map.Entry<Integer, EpochEn
}

public LeaderEpochFileCache cloneWithLeaderEpochCheckpoint(LeaderEpochCheckpoint leaderEpochCheckpoint) {
flushTo(leaderEpochCheckpoint);
flushTo(leaderEpochCheckpoint, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cloneWithLeaderEpochCheckpoint seems no longer used. Could we just remove it?

/**
* Take a snapshot at the current end offset if one does not already exist, then return the snapshot file if taken.
*/
public Optional<File> takeSnapshot(boolean sync) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ProducerStateManager.truncateFullyAndReloadSnapshots removes all snapshot files and then calls loadSnapshots(), which should return empty. I am wondering what happens if we have an pending async snapshot flush and the flush is called after the underlying file is deleted because of ProducerStateManager.truncateFullyAndReloadSnapshots. Will that cause the file to be recreated or will it get an IOException? The former will be bad since the content won't be correct. For the latter, it would be useful to distinguish that from a real disk IO error to avoid unnecessarily crash the broker.

Copy link
Contributor Author

@ocadaruma ocadaruma Oct 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that's a good point. I overlooked snapshot files would be cleaned up upon receiving OffsetMovedToRemoteStorage.

In this case, if async flush is performed against non-existent file, it would throw IOException so we should catch it and ignore if it's NoSuchFileException.
(Since file creation is still done in original thread so shouldn't conflict with truncateFullyAndReloadSnapshots. Only fsync is moved to async thread)

I'll fix that

@jolshan
Copy link
Member

jolshan commented Oct 25, 2023

Thanks -- just catching up with the discussion. Just to clarify when we say:

This is ok for server restart, because on restart, we will rebuild the snapshot by scanning last few segments.

In the restart case, we may take a slight performance hit on startup since we may have to scan more segments.
And yeah, we should definitely not update the recovery point until the flush is completed successfully.

If we ignore producer-state-flush failure here, recovery-point might be incremented even with stale on-disk producer state snapshot. So, in case of restart after power failure, the broker might restore stale producer state without rebuilding (since recovery point is incremented) which could cause idempotency issues.

@ocadaruma ocadaruma requested a review from junrao November 3, 2023 07:14
@ocadaruma
Copy link
Contributor Author

@divijvaidya Hi, thanks for your suggestion. I updated the PR description to include consistency expectations and the analysis of the validity of the changes.

Copy link
Contributor

@junrao junrao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ocadaruma : Thanks for the updated PR. A few more comments.

updateHighWatermarkWithLogEndOffset()
// Schedule an asynchronous flush of the old segment
scheduler.scheduleOnce("flush-log", () => flushUptoOffsetExclusive(newSegment.baseOffset))
scheduler.scheduleOnce("flush-log", () => {
maybeSnapshot.ifPresent(f => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we get rid of {?

@@ -63,7 +63,7 @@ public File file() {
public void renameTo(String newSuffix) throws IOException {
File renamed = new File(Utils.replaceSuffix(file.getPath(), "", newSuffix));
try {
Utils.atomicMoveWithFallback(file.toPath(), renamed.toPath());
Utils.atomicMoveWithFallback(file.toPath(), renamed.toPath(), false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works since it's ok to lose a file to be deleted. Perhaps it's better to rename the method to sth like renameToDelete so that it's clear that this is not a generic method for arbitrary renaming.

// then causing ISR shrink or high produce response time degradation in remote scope on high fsync latency.
// - Even when stale epochs remained in LeaderEpoch file due to the unclean shutdown, it will be handled by
// another truncateFromEnd call on log loading procedure so it won't be a problem
flush(false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, it sounds like that you agree that there is little value to call flush without sync. Should we remove the call then?

Utils.flushFileIfExists(f.toPath)
}
})
flushUptoOffsetExclusive(newSegment.baseOffset)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to add a test to verify that the recovery point is only advanced after the producer state has been flushed to disk?

@ocadaruma ocadaruma requested a review from junrao November 18, 2023 20:15
Copy link
Contributor

@junrao junrao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ocadaruma : Thanks for the updated PR. Just a couple of minor comments.

@@ -60,10 +60,10 @@ public File file() {
return file;
}

public void renameTo(String newSuffix) throws IOException {
public void renameToDelete(String newSuffix) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we remove newSuffix since it's always DELETED_FILE_SUFFIX?

// then causing ISR shrink or high produce response time degradation in remote scope on high fsync latency.
// - Even when stale epochs remained in LeaderEpoch file due to the unclean shutdown, it will be handled by
// another truncateFromEnd call on log loading procedure so it won't be a problem
flush(false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @ocadaruma. Good point! So, we could still write to the file without flushing. The name flush implies that it fsyncs to disks. How about renaming it to sth like writeToFile?

@ocadaruma ocadaruma requested a review from junrao November 21, 2023 23:28
Copy link
Contributor

@junrao junrao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ocadaruma : Thanks for the updated PR. Just a minor comment.

lock.readLock().lock();
try {
leaderEpochCheckpoint.write(epochs.values());
this.checkpoint.write(epochs.values(), sync);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this ?

@ocadaruma ocadaruma requested a review from junrao November 22, 2023 00:02
Copy link
Contributor

@junrao junrao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ocadaruma : Thanks for the updated PR. The code LGTM. One of the build failed. You could trigger a rebuild by closing the PR, waiting for 30 secs and reopening it.

@ocadaruma
Copy link
Contributor Author

closing once to rebuild

@ocadaruma ocadaruma closed this Nov 22, 2023
@ocadaruma ocadaruma reopened this Nov 22, 2023
@ocadaruma ocadaruma closed this Nov 23, 2023
@ocadaruma ocadaruma reopened this Nov 23, 2023
@ocadaruma ocadaruma closed this Nov 23, 2023
@ocadaruma ocadaruma reopened this Nov 23, 2023
@ocadaruma
Copy link
Contributor Author

ocadaruma commented Nov 24, 2023

Hmm, I couldn't get all-green even after several runs due to flakiness which is irrelevant to this change..
Finally got all-green

@ocadaruma ocadaruma closed this Nov 26, 2023
@ocadaruma ocadaruma reopened this Nov 26, 2023
@ocadaruma ocadaruma requested a review from junrao November 26, 2023 22:46
@junrao
Copy link
Contributor

junrao commented Nov 27, 2023

@ocadaruma : Thanks for rerunning the tests. The latest run still has 21 test failures. Are they related to the PR?

@ocadaruma
Copy link
Contributor Author

@junrao Oh I misinterpreted as all green with only checking pipeline-view but I had to check tests view.

I checked. Seems none of them are related to this change, and failures are due to the flakiness because all failed tests still succeeded on at least some JDK build.

@junrao
Copy link
Contributor

junrao commented Nov 28, 2023

@ocadaruma : Thanks for looking into the failed tests. If those are unrelated to this PR, it would be useful to file jiras for flaky tests not already tracked. Also, could you resolve the conflict?

@ocadaruma
Copy link
Contributor Author

Copy link
Contributor

@junrao junrao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ocadaruma : Thanks for triaging the failed tests. LGTM

@junrao junrao merged commit d71d063 into apache:trunk Nov 29, 2023
1 check failed
@ocadaruma ocadaruma deleted the kafka-15046 branch November 29, 2023 21:39
ex172000 pushed a commit to ex172000/kafka that referenced this pull request Dec 15, 2023
…stabilize performance (apache#14242)

While any blocking operation under holding the UnifiedLog.lock could lead to serious performance (even availability) issues, currently there are several paths that calls fsync(2) inside the lock
In the meantime the lock is held, all subsequent produces against the partition may block
This easily causes all request-handlers to be busy on bad disk performance
Even worse, when a disk experiences tens of seconds of glitch (it's not rare in spinning drives), it makes the broker to unable to process any requests with unfenced from the cluster (i.e. "zombie" like status)
This PR gets rid of 4 cases of essentially-unnecessary fsync(2) calls performed under the lock:
(1) ProducerStateManager.takeSnapshot at UnifiedLog.roll
I moved fsync(2) call to the scheduler thread as part of existing "flush-log" job (before incrementing recovery point)
Since it's still ensured that the snapshot is flushed before incrementing recovery point, this change shouldn't cause any problem
(2) ProducerStateManager.removeAndMarkSnapshotForDeletion as part of log segment deletion
This method calls Utils.atomicMoveWithFallback with needFlushParentDir = true internally, which calls fsync.
I changed it to call Utils.atomicMoveWithFallback with needFlushParentDir = false (which is consistent behavior with index files deletion. index files deletion also doesn't flush parent dir)
This change shouldn't cause problems neither.
(3) LeaderEpochFileCache.truncateFromStart when incrementing log-start-offset
This path is called from deleteRecords on request-handler threads.
Here, we don't need fsync(2) either actually.
On unclean shutdown, few leader epochs might be remained in the file but it will be handled by LogLoader on start-up so not a problem
(4) LeaderEpochFileCache.truncateFromEnd as part of log truncation
Likewise, we don't need fsync(2) here, since any epochs which are untruncated on unclean shutdown will be handled on log loading procedure

Reviewers: Luke Chen <[email protected]>, Divij Vaidya <[email protected]>, Justine Olshan <[email protected]>, Jun Rao <[email protected]>
yyu1993 pushed a commit to yyu1993/kafka that referenced this pull request Feb 15, 2024
…stabilize performance (apache#14242)

While any blocking operation under holding the UnifiedLog.lock could lead to serious performance (even availability) issues, currently there are several paths that calls fsync(2) inside the lock
In the meantime the lock is held, all subsequent produces against the partition may block
This easily causes all request-handlers to be busy on bad disk performance
Even worse, when a disk experiences tens of seconds of glitch (it's not rare in spinning drives), it makes the broker to unable to process any requests with unfenced from the cluster (i.e. "zombie" like status)
This PR gets rid of 4 cases of essentially-unnecessary fsync(2) calls performed under the lock:
(1) ProducerStateManager.takeSnapshot at UnifiedLog.roll
I moved fsync(2) call to the scheduler thread as part of existing "flush-log" job (before incrementing recovery point)
Since it's still ensured that the snapshot is flushed before incrementing recovery point, this change shouldn't cause any problem
(2) ProducerStateManager.removeAndMarkSnapshotForDeletion as part of log segment deletion
This method calls Utils.atomicMoveWithFallback with needFlushParentDir = true internally, which calls fsync.
I changed it to call Utils.atomicMoveWithFallback with needFlushParentDir = false (which is consistent behavior with index files deletion. index files deletion also doesn't flush parent dir)
This change shouldn't cause problems neither.
(3) LeaderEpochFileCache.truncateFromStart when incrementing log-start-offset
This path is called from deleteRecords on request-handler threads.
Here, we don't need fsync(2) either actually.
On unclean shutdown, few leader epochs might be remained in the file but it will be handled by LogLoader on start-up so not a problem
(4) LeaderEpochFileCache.truncateFromEnd as part of log truncation
Likewise, we don't need fsync(2) here, since any epochs which are untruncated on unclean shutdown will be handled on log loading procedure

Reviewers: Luke Chen <[email protected]>, Divij Vaidya <[email protected]>, Justine Olshan <[email protected]>, Jun Rao <[email protected]>
AnatolyPopov pushed a commit to aiven/kafka that referenced this pull request Feb 16, 2024
…stabilize performance (apache#14242)

While any blocking operation under holding the UnifiedLog.lock could lead to serious performance (even availability) issues, currently there are several paths that calls fsync(2) inside the lock
In the meantime the lock is held, all subsequent produces against the partition may block
This easily causes all request-handlers to be busy on bad disk performance
Even worse, when a disk experiences tens of seconds of glitch (it's not rare in spinning drives), it makes the broker to unable to process any requests with unfenced from the cluster (i.e. "zombie" like status)
This PR gets rid of 4 cases of essentially-unnecessary fsync(2) calls performed under the lock:
(1) ProducerStateManager.takeSnapshot at UnifiedLog.roll
I moved fsync(2) call to the scheduler thread as part of existing "flush-log" job (before incrementing recovery point)
Since it's still ensured that the snapshot is flushed before incrementing recovery point, this change shouldn't cause any problem
(2) ProducerStateManager.removeAndMarkSnapshotForDeletion as part of log segment deletion
This method calls Utils.atomicMoveWithFallback with needFlushParentDir = true internally, which calls fsync.
I changed it to call Utils.atomicMoveWithFallback with needFlushParentDir = false (which is consistent behavior with index files deletion. index files deletion also doesn't flush parent dir)
This change shouldn't cause problems neither.
(3) LeaderEpochFileCache.truncateFromStart when incrementing log-start-offset
This path is called from deleteRecords on request-handler threads.
Here, we don't need fsync(2) either actually.
On unclean shutdown, few leader epochs might be remained in the file but it will be handled by LogLoader on start-up so not a problem
(4) LeaderEpochFileCache.truncateFromEnd as part of log truncation
Likewise, we don't need fsync(2) here, since any epochs which are untruncated on unclean shutdown will be handled on log loading procedure

Reviewers: Luke Chen <[email protected]>, Divij Vaidya <[email protected]>, Justine Olshan <[email protected]>, Jun Rao <[email protected]>
clolov pushed a commit to clolov/kafka that referenced this pull request Apr 5, 2024
…stabilize performance (apache#14242)

While any blocking operation under holding the UnifiedLog.lock could lead to serious performance (even availability) issues, currently there are several paths that calls fsync(2) inside the lock
In the meantime the lock is held, all subsequent produces against the partition may block
This easily causes all request-handlers to be busy on bad disk performance
Even worse, when a disk experiences tens of seconds of glitch (it's not rare in spinning drives), it makes the broker to unable to process any requests with unfenced from the cluster (i.e. "zombie" like status)
This PR gets rid of 4 cases of essentially-unnecessary fsync(2) calls performed under the lock:
(1) ProducerStateManager.takeSnapshot at UnifiedLog.roll
I moved fsync(2) call to the scheduler thread as part of existing "flush-log" job (before incrementing recovery point)
Since it's still ensured that the snapshot is flushed before incrementing recovery point, this change shouldn't cause any problem
(2) ProducerStateManager.removeAndMarkSnapshotForDeletion as part of log segment deletion
This method calls Utils.atomicMoveWithFallback with needFlushParentDir = true internally, which calls fsync.
I changed it to call Utils.atomicMoveWithFallback with needFlushParentDir = false (which is consistent behavior with index files deletion. index files deletion also doesn't flush parent dir)
This change shouldn't cause problems neither.
(3) LeaderEpochFileCache.truncateFromStart when incrementing log-start-offset
This path is called from deleteRecords on request-handler threads.
Here, we don't need fsync(2) either actually.
On unclean shutdown, few leader epochs might be remained in the file but it will be handled by LogLoader on start-up so not a problem
(4) LeaderEpochFileCache.truncateFromEnd as part of log truncation
Likewise, we don't need fsync(2) here, since any epochs which are untruncated on unclean shutdown will be handled on log loading procedure

Reviewers: Luke Chen <[email protected]>, Divij Vaidya <[email protected]>, Justine Olshan <[email protected]>, Jun Rao <[email protected]>
Copy link
Contributor

@junrao junrao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ocadaruma : Added another comment related to this PR.

synchronized (lock) {
// write to temp file and then swap with the existing file
try (FileOutputStream fileOutputStream = new FileOutputStream(tempPath.toFile());
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(fileOutputStream, StandardCharsets.UTF_8))) {
CheckpointWriteBuffer<T> checkpointWriteBuffer = new CheckpointWriteBuffer<>(writer, version, formatter);
checkpointWriteBuffer.write(entries);
writer.flush();
fileOutputStream.getFD().sync();
if (sync) {
fileOutputStream.getFD().sync();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ocadaruma : I realized a potential issue with this change. The issue is that if sync is false, we don't force a flush to disk. However, the OS could flush partial content of the leader epoch file. If the broker has a hard failure, the leader epoch file could be corrupted. In the recovery path, since we always expect the leader epoch file to be well-formed, a corrupted leader epoch file will fail the recovery.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@junrao Hmm, that's true. Thanks for pointing out.
Created a ticket for this and assigned me. https://issues.apache.org/jira/browse/KAFKA-16541

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @ocadaruma !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
storage Pull requests that target the storage module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants