-
Notifications
You must be signed in to change notification settings - Fork 359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CELEBORN-1660] Cache available workers and only count the available workers device free capacity #2827
base: main
Are you sure you want to change the base?
Conversation
master/src/main/scala/org/apache/celeborn/service/deploy/master/Master.scala
Outdated
Show resolved
Hide resolved
master/src/main/scala/org/apache/celeborn/service/deploy/master/Master.scala
Outdated
Show resolved
Hide resolved
I think we can cache available workers(we can manually update the cache when occur exclude/shutdown..), the current process of computing the list of available workers can be resource-intensive, especially in large clusters. |
Before this, how about using map instead set for workers? |
b5f209b
to
ebd53c5
Compare
1563458
to
4a1202c
Compare
This PR is ready. cc @FMX @RexXiong @zaynt4606 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, #2856 can be covered by this PR.
will rebase the code |
save check ut doc handle app heart beat UT UT master metrics description
What changes were proposed in this pull request?
Master
partWhy are the changes needed?
Cache the available workers to reduce the computation that need to loop the workers frequently.
To have an accurate device capacity overview that does not include the excluded workers, decommissioning workers, etc.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
UT.