KAFKA-15046: Get rid of unnecessary fsyncs inside UnifiedLog.lock to stabilize performance #14242

ocadaruma · 2023-08-18T10:18:10Z

JIRA ticket: https://issues.apache.org/jira/browse/KAFKA-15046

While any blocking operation under holding the UnifiedLog.lock could lead to serious performance (even availability) issues, currently there are several paths that calls fsync(2) inside the lock
- In the meantime the lock is held, all subsequent produces against the partition may block
- This easily causes all request-handlers to be busy on bad disk performance
- Even worse, when a disk experiences tens of seconds of glitch (it's not rare in spinning drives), it makes the broker to unable to process any requests with unfenced from the cluster (i.e. "zombie" like status)
This PR gets rid of 4 cases of essentially-unnecessary fsync(2) calls performed under the lock:
- (1) ProducerStateManager.takeSnapshot at UnifiedLog.roll
  - I moved fsync(2) call to the scheduler thread as part of existing "flush-log" job (before incrementing recovery point)
  - Since it's still ensured that the snapshot is flushed before incrementing recovery point, this change shouldn't cause any problem
- (2) ProducerStateManager.removeAndMarkSnapshotForDeletion as part of log segment deletion
  - This method calls Utils.atomicMoveWithFallback with needFlushParentDir = true internally, which calls fsync.
  - I changed it to call Utils.atomicMoveWithFallback with needFlushParentDir = false (which is consistent behavior with index files deletion. index files deletion also doesn't flush parent dir)
  - This change shouldn't cause problems neither.
- (3) LeaderEpochFileCache.truncateFromStart when incrementing log-start-offset
  - This path is called from deleteRecords on request-handler threads.
  - Here, we don't need fsync(2) either actually.
  - On unclean shutdown, few leader epochs might be remained in the file but it will be handled by LogLoader on start-up so not a problem
- (4) LeaderEpochFileCache.truncateFromEnd as part of log truncation
  - Likewise, we don't need fsync(2) here, since any epochs which are untruncated on unclean shutdown will be handled on log loading procedure
Please refer JIRA ticket for the further details and the performance experiment result

To check if these changes don't cause a problem, below consistency expectation table will be helpful:

No	File	Consistency expectation	Description	Note
1	ProducerStateSnapshot	Snapshot files on the disk before the recovery point should be consistent with the log segments	- On restart after unclean shutdown, Kafka will skip the snapshot recovery procedure before the recovery point. - If the snapshot content before recovery point is not consistent with the log, it will cause a problem like idempotency violation due to the missing producer state.	Hence, the inconsistency after the recovery point is acceptable because it will be recovered to the consistent state on the log loading procedure
2	ProducerStateSnapshot	Deleted snapshot files on the disk should be eventually consistent with log segments	- On log segment deletion by any reasons (e.g. retention, topic deletion), corresponding snapshot files will be deleted. - Even when the broker crashes by power failure before the files are deleted from the actual disk, they should be eventually deleted from the disk.
3	LeaderEpochCheckpoint	All leader epoch entry (i.e. epoch and the start offset) in the log segments have to exist also in leader-epoch checkpoint file on the disk	- If some epoch entries are missing from the checkpoint file, upon restart after power failure, Kafka may restore stale leader epoch cache. - It will return wrong entry when reading the leader epoch cache (e.g. in list offsets request handling)	On the other hand, surplus entries (prefixes or suffixes) in the checkpoint file on the disk are acceptable, because even on restart after power failure, it will be truncated anyways on log loading procedure.

We can confirm the changes are valid based on above table like this:

Change (1): fsync for ProducerStateManager.takeSnapshot is moved to the async scheduler thread
- This preserves consistency expectation No.1, since we still increment recovery point only after fsync is performed
Change (2): ProducerStateManager.removeAndMarkSnapshotForDeletion no longer flushes parent dir
- This preserves consistency expectation No.2, since snapshot files will be deleted eventually even after power failure.
  - ScenarioA: Snapshot files are renamed to -deleted for segment deletion by log retention, but broker crashes by power failure before the rename is persisted to the disk
    - In this case, some snapshot files's name would be reverted to -deleted suffix stripped, then ProducerStateManager would load these snapshot files unnecessarily.
    - However, since producer state will be truncated based on log-segment upon log loading procedure, it won't be a problem.
  - ScenarioB: Snapshot files are renamed to -deleted for topic deletion, but broker crashes by power failure before the rename is persisted to the disk
    - Also, in this case, snapshot file's name would be reverted to -deleted suffix stripped.
    - However, on topic-deletion, parent log-dir is already renamed to -delete and it's fsynced anyways, the revert of snapshot file wouldn't be a problem. Parent log dirs will be deleted anyways after resuming topic deletion procedure.
Change (3): LeaderEpochFileCache.truncateFromStart doesn't call fsync
- This preserves consistency expectation No.3, since we still call fsync on LeaderEpochFileCache.assign.
Change (4): LeaderEpochFileCache. truncateFromEnd doesn't call fsync
- Same explanation as Change (3)

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

ocadaruma · 2023-08-18T10:35:25Z

Could you take a look? @showuon

…stabilize performance

showuon · 2023-09-04T07:18:17Z

@ocadaruma , thanks for the improvement! Some high level questions:

Although you've added comments in the JIRA, it'd better you add your analysis and what/why you've changed in the PR description.
moving fsync call to the scheduler thread in takeSnapshot case makes sense to me, since we have every info in memory cache. And the log recovery can recover the snapshot when unclean shutdown.
For removeAndMarkSnapshotForDeletion, I didn't see this fix, could you explain it?
For LeaderEpochFileCache#truncateXXX, I agree that as long as the memory cache is up-to-date, it should be fine. We can always recover from logs when unclean shutdown.
nit: In the PR, could we make less code change for easier review? That is, we can create a overloaded method, and take one more parameter (boolean sync), and delegate the original method implementation to the new method (default to true). So, the only place we need to change, is the places we want to false the sync flush, which will make the PR much clear IMO.

Thank you.

ocadaruma · 2023-09-04T07:53:54Z

@showuon
Thank you for your review.

Got it. I revised the PR description to include analysis.
The actual change is made against SnapshotFile.java#renameTo, which is called from removeAndMarkSnapshotForDeletion.

We can always recover from logs when unclean shutdown.

Yes. However, precisely, removing fsync on LeaderEpochFileCache's truncation doesn't cause extra recovery even on unclean shutdown IMO. The reason:

Since we still fsync on LeaderEpochFileCache#assign, we can still ensure all necessary leader epochs are in leader-epoch cache file
Even when truncation is not flushed (so "should-be-truncated" epochs may be left on the epoch file on unclean shutdown), log-loading procedure will truncate the epoch file as necessary (based on the log start/end offset). It's a fairly right-weight operation comparing to the recovering from the log.

Hmm, I intentionally didn't create a overloaded method because I was afraid a bit that default (fsync: true) method is used casually in the future code changes even for places which fsync isn't necessary.

divijvaidya · 2023-10-25T10:22:19Z

Hey @ocadaruma
This is an important change, so I despite low engagement on this from me, I do believe that this change is critical. Let's think carefully about this.

I would like to discuss the changes individually.

Let's start with (1)

The side-effect of moving producer snapshot flush to async thread is that it while earlier it was guaranteed that producer snapshot is present and consistent when segment (and others) are flushed. If for some reason, the producer fsync failed, we would not have scheduled a flush for segment and friends. But now, since we are flushing snapshot async & quietly, it might be possible that we have segment and indexes on disk but we don't have a producer snapshot.

This is ok for server restart, because on restart, we will rebuild the snapshot by scanning last few segments.
This is ok even if the server doesn't restart because we will be using in-memory value of producer state and we might take another snapshot associated with next segment later.

To summarize, Kafka does not expect producer snapshot on disk to be strongly consistent with rest of files such as log segment and transaction index. @jolshan (as an expert on trx index and producer snapshot) do you agree with this statement?

If we agree on this, then (1) is a safe change IMO.

ocadaruma · 2023-10-25T11:45:01Z

@divijvaidya Thank you for your review.

But now, since we are flushing snapshot async & quietly

During reading your comment, I realized that "quietly" could be a problem so we might need to change producer-state flushing to throw IOException in case of failure. (still "async" isn't a problem though)

If we ignore producer-state-flush failure here, recovery-point might be incremented even with stale on-disk producer state snapshot. So, in case of restart after power failure, the broker might restore stale producer state without rebuilding (since recovery point is incremented) which could cause idempotency issues.

I'll update the PR after Justine's comment

divijvaidya · 2023-10-25T14:26:33Z

If we ignore producer-state-flush failure here, recovery-point might be incremented even with stale on-disk producer state snapshot. So, in case of restart after power failure, the broker might restore stale producer state without rebuilding (since recovery point is incremented) which could cause idempotency issues.

Great point. May I suggest that we document the consistency expectations of producer snapshot with segment on the disk. From what you mentioned, it sounds like "Kafka expects producer snapshot to be strongly consistent with the segment data on disk before the recovery checkpoint but doesn't expect after the checkpoint. The inconsistency after the checkpoint is acceptable because....blah blah"

We verify that expectations with experts such as Justine and Jun. Based on that we can make a decision of quietly vs. async etc. The documentation will also help future contributions reason about code base. Initially, you can put the documentation in the description of this PR itself and later we can find a home for it in Kafka website docs.

We need to do the same exercise for other files that you are changing in this PR.

junrao

@ocadaruma : Thanks for the PR. Left a few comments.

@divijvaidya : Yes, your understanding on (1) is correct.

junrao · 2023-10-18T23:28:45Z

core/src/main/scala/kafka/log/UnifiedLog.scala

    updateHighWatermarkWithLogEndOffset()
    // Schedule an asynchronous flush of the old segment
-    scheduler.scheduleOnce("flush-log", () => flushUptoOffsetExclusive(newSegment.baseOffset))
+    scheduler.scheduleOnce("flush-log", () => {
+      maybeSnapshot.ifPresent(f => Utils.flushFileQuietly(f.toPath, "producer-snapshot"))


If we fail to flush the snapshot, it seems that we should propagate the IOException to logDirFailureChannel like in flushUptoOffsetExclusive. Otherwise, we could be skipping the recovery of producer state when we should.

Yeah, I also noticed that. I'll fix

junrao · 2023-10-18T23:46:56Z

storage/src/main/java/org/apache/kafka/storage/internals/epoch/LeaderEpochFileCache.java

+                //     then causing ISR shrink or high produce response time degradation in remote scope on high fsync latency.
+                // - Even when stale epochs remained in LeaderEpoch file due to the unclean shutdown, it will be handled by
+                //   another truncateFromEnd call on log loading procedure so it won't be a problem
+                flush(false);


It's kind of weird to call flush with sync = false since the only thing that flush does is to sync. Could we just avoid calling flush?

flush(false) still write the epoch entries to the file. (but without fsync)

If we don't call flush here, some entries will remain in the file even after the log truncation.

I'm guessing it wouldn't be a problem realistically at least in the current implementation (since, on log truncation on the follower, it will call LeaderEpochFileCache#assign on log append which anyways flush in-memory epoch entries to the file) though, we should still write to the file here IMO.

If we don't call flush here, some entries will remain in the file even after the log truncation.

That's true. But if we write the new content to the file without flushing, it seems that those old entries could still exist in the file?

those old entries could still exist in the file

Yeah, precisely, the content on the device (not file) could be still old. As long as we read the file in usual way (i.e. not through O_DIRECT), we can see the latest data.

The staleness on the device arises only when the server experiences power failure before OS flushes the page cache.
In this case, indeed the content could be rolled back to old state.

But it won't be a problem because leader-epoch file will be truncated again to match to the log file upon loading procedure anyways (this is the case mentioned in (3) in PR description)

So, it sounds like that you agree that there is little value to call flush without sync. Should we remove the call then?

I take another look at the code and found that flushing to the file (without fsync) is necessary.

The point here is if there's any code path that reloads the leader-epoch cache from the file.
I found it's possible, so not flushing could be a problem in below scenario

(1) AlterReplicaDir is initiated

(2) truncation happens on futureLog

LeaderEpochFileCache.truncateFromEnd is called, but it isn't flushed to the file

(3) future log caught up and the renameDir is called

This will reload the leader-epoch cache from the file, which is stale

Then wrong leader-epoch may be returned (e.g. for list-offsets request)

So we still should flush to the file even without fsync.

Thanks, @ocadaruma. Good point! So, we could still write to the file without flushing. The name flush implies that it fsyncs to disks. How about renaming it to sth like writeToFile?

That sounds good. I'll fix like that

junrao · 2023-10-18T23:53:58Z

storage/src/main/java/org/apache/kafka/storage/internals/epoch/LeaderEpochFileCache.java

@@ -152,7 +152,7 @@ private List<EpochEntry> removeWhileMatching(Iterator<Map.Entry<Integer, EpochEn
    }

    public LeaderEpochFileCache cloneWithLeaderEpochCheckpoint(LeaderEpochCheckpoint leaderEpochCheckpoint) {
-        flushTo(leaderEpochCheckpoint);
+        flushTo(leaderEpochCheckpoint, true);


cloneWithLeaderEpochCheckpoint seems no longer used. Could we just remove it?

junrao · 2023-10-25T18:06:06Z

storage/src/main/java/org/apache/kafka/storage/internals/log/ProducerStateManager.java

+    /**
+     * Take a snapshot at the current end offset if one does not already exist, then return the snapshot file if taken.
+     */
+    public Optional<File> takeSnapshot(boolean sync) throws IOException {


ProducerStateManager.truncateFullyAndReloadSnapshots removes all snapshot files and then calls loadSnapshots(), which should return empty. I am wondering what happens if we have an pending async snapshot flush and the flush is called after the underlying file is deleted because of ProducerStateManager.truncateFullyAndReloadSnapshots. Will that cause the file to be recreated or will it get an IOException? The former will be bad since the content won't be correct. For the latter, it would be useful to distinguish that from a real disk IO error to avoid unnecessarily crash the broker.

Thanks, that's a good point. I overlooked snapshot files would be cleaned up upon receiving OffsetMovedToRemoteStorage.

In this case, if async flush is performed against non-existent file, it would throw IOException so we should catch it and ignore if it's NoSuchFileException.
(Since file creation is still done in original thread so shouldn't conflict with truncateFullyAndReloadSnapshots. Only fsync is moved to async thread)

I'll fix that

jolshan · 2023-10-25T18:22:25Z

Thanks -- just catching up with the discussion. Just to clarify when we say:

This is ok for server restart, because on restart, we will rebuild the snapshot by scanning last few segments.

In the restart case, we may take a slight performance hit on startup since we may have to scan more segments.
And yeah, we should definitely not update the recovery point until the flush is completed successfully.

If we ignore producer-state-flush failure here, recovery-point might be incremented even with stale on-disk producer state snapshot. So, in case of restart after power failure, the broker might restore stale producer state without rebuilding (since recovery point is incremented) which could cause idempotency issues.

ocadaruma · 2023-11-03T07:15:02Z

@divijvaidya Hi, thanks for your suggestion. I updated the PR description to include consistency expectations and the analysis of the validity of the changes.

junrao

@ocadaruma : Thanks for the updated PR. A few more comments.

junrao · 2023-11-06T17:31:47Z

core/src/main/scala/kafka/log/UnifiedLog.scala

    updateHighWatermarkWithLogEndOffset()
    // Schedule an asynchronous flush of the old segment
-    scheduler.scheduleOnce("flush-log", () => flushUptoOffsetExclusive(newSegment.baseOffset))
+    scheduler.scheduleOnce("flush-log", () => {
+      maybeSnapshot.ifPresent(f => {


Could we get rid of {?

junrao · 2023-11-06T17:39:38Z

storage/src/main/java/org/apache/kafka/storage/internals/log/SnapshotFile.java

@@ -63,7 +63,7 @@ public File file() {
    public void renameTo(String newSuffix) throws IOException {
        File renamed = new File(Utils.replaceSuffix(file.getPath(), "", newSuffix));
        try {
-            Utils.atomicMoveWithFallback(file.toPath(), renamed.toPath());
+            Utils.atomicMoveWithFallback(file.toPath(), renamed.toPath(), false);


This works since it's ok to lose a file to be deleted. Perhaps it's better to rename the method to sth like renameToDelete so that it's clear that this is not a generic method for arbitrary renaming.

junrao · 2023-11-06T17:44:53Z

storage/src/main/java/org/apache/kafka/storage/internals/epoch/LeaderEpochFileCache.java

+                //     then causing ISR shrink or high produce response time degradation in remote scope on high fsync latency.
+                // - Even when stale epochs remained in LeaderEpoch file due to the unclean shutdown, it will be handled by
+                //   another truncateFromEnd call on log loading procedure so it won't be a problem
+                flush(false);


So, it sounds like that you agree that there is little value to call flush without sync. Should we remove the call then?

junrao · 2023-11-06T17:49:49Z

core/src/main/scala/kafka/log/UnifiedLog.scala

+          Utils.flushFileIfExists(f.toPath)
+        }
+      })
+      flushUptoOffsetExclusive(newSegment.baseOffset)


Is it possible to add a test to verify that the recovery point is only advanced after the producer state has been flushed to disk?

junrao

@ocadaruma : Thanks for the updated PR. Just a couple of minor comments.

junrao · 2023-11-21T19:59:27Z

storage/src/main/java/org/apache/kafka/storage/internals/log/SnapshotFile.java

@@ -60,10 +60,10 @@ public File file() {
        return file;
    }

-    public void renameTo(String newSuffix) throws IOException {
+    public void renameToDelete(String newSuffix) throws IOException {


Could we remove newSuffix since it's always DELETED_FILE_SUFFIX?

junrao · 2023-11-21T21:23:14Z

storage/src/main/java/org/apache/kafka/storage/internals/epoch/LeaderEpochFileCache.java

+                //     then causing ISR shrink or high produce response time degradation in remote scope on high fsync latency.
+                // - Even when stale epochs remained in LeaderEpoch file due to the unclean shutdown, it will be handled by
+                //   another truncateFromEnd call on log loading procedure so it won't be a problem
+                flush(false);


Thanks, @ocadaruma. Good point! So, we could still write to the file without flushing. The name flush implies that it fsyncs to disks. How about renaming it to sth like writeToFile?

junrao

@ocadaruma : Thanks for the updated PR. Just a minor comment.

junrao · 2023-11-21T23:47:53Z

storage/src/main/java/org/apache/kafka/storage/internals/epoch/LeaderEpochFileCache.java

        lock.readLock().lock();
        try {
-            leaderEpochCheckpoint.write(epochs.values());
+            this.checkpoint.write(epochs.values(), sync);


Do we need this ?

junrao

@ocadaruma : Thanks for the updated PR. The code LGTM. One of the build failed. You could trigger a rebuild by closing the PR, waiting for 30 secs and reopening it.

ocadaruma · 2023-11-22T23:08:43Z

closing once to rebuild

ocadaruma · 2023-11-24T02:15:28Z

~~Hmm, I couldn't get all-green even after several runs due to flakiness which is irrelevant to this change..~~
Finally got all-green

junrao · 2023-11-27T17:56:41Z

@ocadaruma : Thanks for rerunning the tests. The latest run still has 21 test failures. Are they related to the PR?

ocadaruma · 2023-11-27T23:27:57Z

@junrao Oh I misinterpreted as all green with only checking pipeline-view but I had to check tests view.

I checked. Seems none of them are related to this change, and failures are due to the flakiness because all failed tests still succeeded on at least some JDK build.

junrao · 2023-11-28T17:46:08Z

@ocadaruma : Thanks for looking into the failed tests. If those are unrelated to this PR, it would be useful to file jiras for flaky tests not already tracked. Also, could you resolve the conflict?

ocadaruma · 2023-11-29T08:37:43Z

@junrao Thank you for the suggestion.
I resolved the conflict.

Also created tickets for flaky tests which don't have corresponding JIRA ticket now.

junrao

@ocadaruma : Thanks for triaging the failed tests. LGTM

…stabilize performance (apache#14242) While any blocking operation under holding the UnifiedLog.lock could lead to serious performance (even availability) issues, currently there are several paths that calls fsync(2) inside the lock In the meantime the lock is held, all subsequent produces against the partition may block This easily causes all request-handlers to be busy on bad disk performance Even worse, when a disk experiences tens of seconds of glitch (it's not rare in spinning drives), it makes the broker to unable to process any requests with unfenced from the cluster (i.e. "zombie" like status) This PR gets rid of 4 cases of essentially-unnecessary fsync(2) calls performed under the lock: (1) ProducerStateManager.takeSnapshot at UnifiedLog.roll I moved fsync(2) call to the scheduler thread as part of existing "flush-log" job (before incrementing recovery point) Since it's still ensured that the snapshot is flushed before incrementing recovery point, this change shouldn't cause any problem (2) ProducerStateManager.removeAndMarkSnapshotForDeletion as part of log segment deletion This method calls Utils.atomicMoveWithFallback with needFlushParentDir = true internally, which calls fsync. I changed it to call Utils.atomicMoveWithFallback with needFlushParentDir = false (which is consistent behavior with index files deletion. index files deletion also doesn't flush parent dir) This change shouldn't cause problems neither. (3) LeaderEpochFileCache.truncateFromStart when incrementing log-start-offset This path is called from deleteRecords on request-handler threads. Here, we don't need fsync(2) either actually. On unclean shutdown, few leader epochs might be remained in the file but it will be handled by LogLoader on start-up so not a problem (4) LeaderEpochFileCache.truncateFromEnd as part of log truncation Likewise, we don't need fsync(2) here, since any epochs which are untruncated on unclean shutdown will be handled on log loading procedure Reviewers: Luke Chen <[email protected]>, Divij Vaidya <[email protected]>, Justine Olshan <[email protected]>, Jun Rao <[email protected]>

junrao

@ocadaruma : Added another comment related to this PR.

junrao · 2024-04-08T17:06:04Z

server-common/src/main/java/org/apache/kafka/server/common/CheckpointFile.java

        synchronized (lock) {
            // write to temp file and then swap with the existing file
            try (FileOutputStream fileOutputStream = new FileOutputStream(tempPath.toFile());
                 BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(fileOutputStream, StandardCharsets.UTF_8))) {
                CheckpointWriteBuffer<T> checkpointWriteBuffer = new CheckpointWriteBuffer<>(writer, version, formatter);
                checkpointWriteBuffer.write(entries);
                writer.flush();
-                fileOutputStream.getFD().sync();
+                if (sync) {
+                    fileOutputStream.getFD().sync();


@ocadaruma : I realized a potential issue with this change. The issue is that if sync is false, we don't force a flush to disk. However, the OS could flush partial content of the leader epoch file. If the broker has a hard failure, the leader epoch file could be corrupted. In the recovery path, since we always expect the leader epoch file to be well-formed, a corrupted leader epoch file will fail the recovery.

@junrao Hmm, that's true. Thanks for pointing out.
Created a ticket for this and assigned me. https://issues.apache.org/jira/browse/KAFKA-16541

Thanks, @ocadaruma !

divijvaidya added the storage Pull requests that target the storage module label Aug 21, 2023

ijuma requested review from junrao and lbradstreet August 28, 2023 13:23

KAFKA-15046: Get rid of unnecessary fsyncs inside UnifiedLog.lock to …

184b031

…stabilize performance

ocadaruma force-pushed the kafka-15046 branch from 2f08dbc to 184b031 Compare August 29, 2023 07:50

showuon self-requested a review August 29, 2023 08:50

ocadaruma mentioned this pull request Oct 13, 2023

KAFKA-15572: Race condition between log dir roll and log rename dir #14543

Closed

3 tasks

divijvaidya mentioned this pull request Oct 25, 2023

MINOR: Rewrite the meta.properties handling code in Java and fix some issues #14628

Merged

divijvaidya self-assigned this Oct 25, 2023

junrao reviewed Oct 25, 2023

View reviewed changes

address comments

b3d53e3

ocadaruma requested a review from junrao November 3, 2023 07:14

junrao reviewed Nov 6, 2023

View reviewed changes

ocadaruma added 2 commits November 18, 2023 17:46

address comments

136b5a9

added a test

165ec5e

ocadaruma requested a review from junrao November 18, 2023 20:15

junrao reviewed Nov 21, 2023

View reviewed changes

address feedbacks

8691c87

ocadaruma requested a review from junrao November 21, 2023 23:28

junrao reviewed Nov 21, 2023

View reviewed changes

address feedbacks

7e40de4

ocadaruma requested a review from junrao November 22, 2023 00:02

junrao reviewed Nov 22, 2023

View reviewed changes

ocadaruma closed this Nov 22, 2023

ocadaruma reopened this Nov 22, 2023

ocadaruma closed this Nov 23, 2023

ocadaruma reopened this Nov 23, 2023

ocadaruma closed this Nov 23, 2023

ocadaruma reopened this Nov 23, 2023

ocadaruma closed this Nov 26, 2023

ocadaruma reopened this Nov 26, 2023

ocadaruma requested a review from junrao November 26, 2023 22:46

Merge remote-tracking branch 'origin/trunk' into kafka-15046

6be4d05

junrao approved these changes Nov 29, 2023

View reviewed changes

junrao merged commit d71d063 into apache:trunk Nov 29, 2023
1 check failed

ocadaruma deleted the kafka-15046 branch November 29, 2023 21:39

junrao reviewed Apr 8, 2024

View reviewed changes

ocadaruma mentioned this pull request May 19, 2024

KAFKA-16541 Fix potential leader-epoch checkpoint file corruption #15993

Merged

3 tasks

junrao mentioned this pull request Jul 19, 2024

KAFKA-17142: Fix deadlock caused by LogManagerTest#testLogRecoveryMetrics #16614

Merged

3 tasks

KAFKA-15046: Get rid of unnecessary fsyncs inside UnifiedLog.lock to stabilize performance #14242

KAFKA-15046: Get rid of unnecessary fsyncs inside UnifiedLog.lock to stabilize performance #14242

Conversation

ocadaruma commented Aug 18, 2023 • edited Loading

Committer Checklist (excluded from commit message)

ocadaruma commented Aug 18, 2023

showuon commented Sep 4, 2023 • edited Loading

ocadaruma commented Sep 4, 2023 • edited Loading

divijvaidya commented Oct 25, 2023

ocadaruma commented Oct 25, 2023 • edited Loading

divijvaidya commented Oct 25, 2023

junrao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ocadaruma Oct 27, 2023 • edited Loading

Choose a reason for hiding this comment

jolshan commented Oct 25, 2023

ocadaruma commented Nov 3, 2023

junrao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

junrao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

junrao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

junrao left a comment

Choose a reason for hiding this comment

ocadaruma commented Nov 22, 2023

ocadaruma commented Nov 24, 2023 • edited Loading

junrao commented Nov 27, 2023

ocadaruma commented Nov 27, 2023

junrao commented Nov 28, 2023

ocadaruma commented Nov 29, 2023

junrao left a comment

Choose a reason for hiding this comment

junrao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ocadaruma commented Aug 18, 2023 •

edited

Loading

showuon commented Sep 4, 2023 •

edited

Loading

ocadaruma commented Sep 4, 2023 •

edited

Loading

ocadaruma commented Oct 25, 2023 •

edited

Loading

ocadaruma Oct 27, 2023 •

edited

Loading

ocadaruma commented Nov 24, 2023 •

edited

Loading