Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CELEBORN-1472] Reduce CongestionController#userBufferStatuses call times. #2583

Closed
wants to merge 4 commits into from

Conversation

leixm
Copy link
Contributor

@leixm leixm commented Jun 20, 2024

What changes were proposed in this pull request?

Reduce CongestionController#userBufferStatuses call times.

Why are the changes needed?

When we use sort based shuffle writer, The number of PushMergedData requests has increased which make CongestionController#produceBytes taking up much cpu time.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing uts.

@leixm
Copy link
Contributor Author

leixm commented Jun 20, 2024

WechatIMG89

@leixm
Copy link
Contributor Author

leixm commented Jun 20, 2024

@AngersZhuuuu @pan3793 @waitinfuture PTAL.

@@ -1282,6 +1282,8 @@ class PushDataHandler(val workerSource: WorkerSource) extends BaseMessageHandler
fileWriter.decrementPendingWrites()
}
}

updateBytesProduced(fileWriters.head, body.readableBytes())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems this changes the behavior, PartitionDataWriters in fileWriters are different objects for handlePushMergedData and body contains data for those objects. IIUC we need to update each object for its own body, instead of only updating the first one with the whole body.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I will modify another place, it seems that there is no need to call userBufferStatuses.computeIfAbsent every time

@leixm
Copy link
Contributor Author

leixm commented Jun 21, 2024

Found through stress testing, we reduced the CPU usage of produceBytes from 25.07% to 16.58%
Before this PR
kk1

After this PR
kk2

@leixm leixm changed the title [CELEBORN-1472] Reduce CongestionController#produceBytes call times. [CELEBORN-1472] Reduce CongestionController#produceBytes cpu tie Jun 21, 2024
@leixm leixm changed the title [CELEBORN-1472] Reduce CongestionController#produceBytes cpu tie [CELEBORN-1472] Reduce CongestionController#produceBytes cpu time Jun 21, 2024
@AngersZhuuuu
Copy link
Contributor

Pls update pr title and pr desc

@leixm leixm changed the title [CELEBORN-1472] Reduce CongestionController#produceBytes cpu time [CELEBORN-1472] Reduce CongestionController#userBufferStatuses call times. Jun 24, 2024
@leixm
Copy link
Contributor Author

leixm commented Jun 24, 2024

@AngersZhuuuu @waitinfuture PTAL.

Copy link
Contributor

@AngersZhuuuu AngersZhuuuu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link

codecov bot commented Jun 24, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 39.98%. Comparing base (0298cfb) to head (60b660d).
Report is 22 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2583      +/-   ##
==========================================
- Coverage   40.49%   39.98%   -0.50%     
==========================================
  Files         222      233      +11     
  Lines       14289    14695     +406     
  Branches     1291     1338      +47     
==========================================
+ Hits         5785     5875      +90     
- Misses       8173     8487     +314     
- Partials      331      333       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -153,6 +156,10 @@ public PartitionDataWriter(
this.mapIdBitMap = new RoaringBitmap();
}
takeBuffer();
CongestionController congestionController = CongestionController.instance();
if (!isMemoryShuffleFile.get() && congestionController != null) {
userBufferInfo = congestionController.getUserBuffer(getDiskFileInfo().getUserIdentifier());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the CongestionController should control all file write operations including memory file ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I keep the original code logic unchanged. In the original logic, CongestionController does not manage memory files. Maybe I can correct this logic in the next PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I keep the original code logic unchanged. In the original logic, CongestionController does not manage memory files. Maybe I can correct this logic in the next PR?

LGTM

congestionController ->
congestionController.produceBytes(diskFileInfo.getUserIdentifier(), numBytes));
if (userBufferInfo != null) {
userBufferInfo.updateInfo(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@RexXiong RexXiong closed this in d362d9f Jun 25, 2024
RexXiong pushed a commit that referenced this pull request Jun 25, 2024
…imes

### What changes were proposed in this pull request?
Reduce  CongestionController#userBufferStatuses call times.

### Why are the changes needed?
When we use sort based shuffle writer, The number of PushMergedData requests has increased which make CongestionController#produceBytes taking up much cpu time.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing uts.

Closes #2583 from leixm/issue_1472.

Authored-by: Xianming Lei <[email protected]>
Signed-off-by: Shuang <[email protected]>
(cherry picked from commit d362d9f)
Signed-off-by: Shuang <[email protected]>
@RexXiong
Copy link
Contributor

Merge to main(v0.6.0) and branch-0.5(v0.5.1)

@leixm
Copy link
Contributor Author

leixm commented Jun 25, 2024

Thank you for you review. @AngersZhuuuu @RexXiong

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants