[CELEBORN-1660] Cache available workers and only count the available workers device free capacity #2827

turboFei · 2024-10-18T23:21:49Z

What changes were proposed in this pull request?

cache the available workers
Only count the available workers device free capacity.
place the metrics_AvailableWorkerCount_Value in overall and metrics_WorkerCount_Value in Master part

Why are the changes needed?

Cache the available workers to reduce the computation that need to loop the workers frequently.
To have an accurate device capacity overview that does not include the excluded workers, decommissioning workers, etc.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

UT.

master/src/main/scala/org/apache/celeborn/service/deploy/master/Master.scala

RexXiong · 2024-10-22T03:17:22Z

I think we can cache available workers(we can manually update the cache when occur exclude/shutdown..), the current process of computing the list of available workers can be resource-intensive, especially in large clusters.

turboFei · 2024-10-22T07:39:51Z

I think we can cache available workers(we can manually update the cache when occur exclude/shutdown..), the current process of computing the list of available workers can be resource-intensive, especially in large clusters.

Before this, how about using map instead set for workers?
#2839
cc @RexXiong

turboFei · 2024-11-04T00:59:20Z

This PR is ready. cc @FMX @RexXiong @zaynt4606

zaynt4606

LGTM, #2856 can be covered by this PR.

turboFei · 2024-11-06T03:43:38Z

will rebase the code

save check ut doc handle app heart beat UT UT master metrics description

turboFei · 2024-11-07T22:00:13Z

Hi @FMX and @RexXiong by any chance to help review this PR, thanks very much

turboFei changed the title ~~Only count the available workers device capacity~~ [CELEBORN-1660] Only count the available workers device capacity Oct 18, 2024

turboFei commented Oct 18, 2024

View reviewed changes

master/src/main/scala/org/apache/celeborn/service/deploy/master/Master.scala Outdated Show resolved Hide resolved

turboFei force-pushed the device_free branch from e9be8c9 to 67789d3 Compare October 19, 2024 02:36

turboFei changed the title ~~[CELEBORN-1660] Only count the available workers device capacity~~ [CELEBORN-1660] Only count the available workers device free capacity Oct 19, 2024

turboFei requested review from FMX and RexXiong October 19, 2024 02:44

FMX reviewed Oct 22, 2024

View reviewed changes

master/src/main/scala/org/apache/celeborn/service/deploy/master/Master.scala Outdated Show resolved Hide resolved

turboFei force-pushed the device_free branch 2 times, most recently from b5f209b to ebd53c5 Compare November 1, 2024 08:35

turboFei marked this pull request as draft November 1, 2024 08:58

turboFei force-pushed the device_free branch from ebd53c5 to aa8e468 Compare November 1, 2024 09:04

turboFei marked this pull request as ready for review November 1, 2024 09:07

turboFei marked this pull request as draft November 1, 2024 09:16

turboFei marked this pull request as ready for review November 1, 2024 18:20

turboFei force-pushed the device_free branch 2 times, most recently from 1563458 to 4a1202c Compare November 1, 2024 18:23

turboFei changed the title ~~[CELEBORN-1660] Only count the available workers device free capacity~~ [CELEBORN-1660] Cache available workers and only count the available workers device free capacity Nov 1, 2024

turboFei mentioned this pull request Nov 4, 2024

[CELEBORN-1636][FOLLOWUP] Cache available workers in Master #2856

Closed

zaynt4606 approved these changes Nov 4, 2024

View reviewed changes

[CELEBORN-1660] Only count the available workers device free capacity

469b027

save check ut doc handle app heart beat UT UT master metrics description

turboFei force-pushed the device_free branch from 222d974 to 469b027 Compare November 6, 2024 03:48

turboFei requested review from SteNicholas and FMX November 6, 2024 07:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CELEBORN-1660] Cache available workers and only count the available workers device free capacity #2827

[CELEBORN-1660] Cache available workers and only count the available workers device free capacity #2827

turboFei commented Oct 18, 2024 •

edited

Loading

RexXiong commented Oct 22, 2024

turboFei commented Oct 22, 2024

turboFei commented Nov 4, 2024

zaynt4606 left a comment

turboFei commented Nov 6, 2024

turboFei commented Nov 7, 2024

[CELEBORN-1660] Cache available workers and only count the available workers device free capacity #2827

Are you sure you want to change the base?

[CELEBORN-1660] Cache available workers and only count the available workers device free capacity #2827

Conversation

turboFei commented Oct 18, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

RexXiong commented Oct 22, 2024

turboFei commented Oct 22, 2024

turboFei commented Nov 4, 2024

zaynt4606 left a comment

Choose a reason for hiding this comment

turboFei commented Nov 6, 2024

turboFei commented Nov 7, 2024

turboFei commented Oct 18, 2024 •

edited

Loading