-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#2299: Do not deploy LB when average load is smaller than estimated load balancing cost #2333
#2299: Do not deploy LB when average load is smaller than estimated load balancing cost #2333
Conversation
Pipelines resultsPR tests (gcc-12, ubuntu, mpich, verbose) Build for fe210d6 (2024-08-06 20:51:37 UTC)
PR tests (gcc-12, ubuntu, mpich, verbose, kokkos) Build for 2777ffb (2024-09-06 10:05:27 UTC)
PR tests (clang-9, ubuntu, mpich) Build for 2777ffb (2024-09-06 10:05:27 UTC)
PR tests (clang-13, alpine, mpich) Build for 2777ffb (2024-09-06 10:05:27 UTC)
PR tests (gcc-9, ubuntu, mpich, zoltan, json schema test) Build for 2777ffb (2024-09-06 10:05:27 UTC)
PR tests (gcc-10, ubuntu, openmpi, no LB) Build for 2777ffb (2024-09-06 10:05:27 UTC)
PR tests (gcc-8, ubuntu, mpich, address sanitizer) Build for 2777ffb (2024-09-06 10:05:27 UTC)
PR tests (clang-12, ubuntu, mpich) Build for 2777ffb (2024-09-06 10:05:27 UTC)
PR tests (clang-10, ubuntu, mpich) Build for 2777ffb (2024-09-06 10:05:27 UTC)
PR tests (clang-11, ubuntu, mpich) Build for 2777ffb (2024-09-06 10:05:27 UTC)
PR tests (clang-13, ubuntu, mpich) Build for 2777ffb (2024-09-06 10:05:27 UTC)
PR tests (intel icpx, ubuntu, mpich, verbose) Build for 2777ffb (2024-09-06 10:05:27 UTC)
PR tests (gcc-11, ubuntu, mpich, trace runtime, coverage) Build for 2777ffb (2024-09-06 10:05:27 UTC)
PR tests (clang-14, ubuntu, mpich, verbose) Build for 2777ffb (2024-09-06 10:05:27 UTC)
PR tests (nvidia cuda 11.2, gcc-9, ubuntu, mpich) Build for 2777ffb (2024-09-06 10:05:27 UTC)
PR tests (nvidia cuda 12.2.0, gcc-9, ubuntu, mpich, verbose) Build for 2777ffb (2024-09-06 10:05:27 UTC)
PR tests (intel icpc, ubuntu, mpich) Build for 2777ffb (2024-09-06 10:05:27 UTC)
|
@thearusable |
@cz4rs The target branch of this PR is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR needs to target develop
fe210d6
to
0e0ba9c
Compare
@cz4rs @thearusable Not related to this PR but most recent (on develop) spack workflow failed with that message (https://dev.azure.com/DARMA-tasking/DARMA/_build/results?buildId=60702&view=logs&j=3dc8fd7e-4368-5a92-293e-d53cefc8c4b3&t=1364e02f-4e1e-51cf-2821-fd252a3c0e34&l=10) |
@JacobDomagala I will take a look on that |
f6ffd20
to
b8b2fc1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
d0a88e1
to
aae763f
Compare
*/ | ||
double getCollectiveEpochCost() const { | ||
// 100 ns | ||
return 0.0000001; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ideally would use std chrono types and literals. Barring that, at least use scientific notation and specify the units.
Also, I think 100ns is on the low side, but it really doesn't matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I changed it to std::chrono::nanoseconds
.
@@ -160,7 +160,8 @@ void GreedyLB::loadStats() { | |||
bool should_lb = false; | |||
this_load_begin = this_load; | |||
|
|||
if (avg_load > 0.0000000001) { | |||
// Use an estimated load-balancing cost on average rank load to load-balance | |||
if (avg_load > getCollectiveEpochCost()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I came here to note that I think calculation like this should use max load, rather than average. Otherwise, if a large-scale job has maximal imbalance of a very fine-grain problem, averaging might mask it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lifflander What should we use in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the discussion on the meeting, we should use the max
load instead of avg
load and move the logic for checking that to BaseLB
to be used only by the real load balancers.
dfdaf61
to
e987173
Compare
57667db
to
e02a04a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
e02a04a
to
2777ffb
Compare
@lifflander PR rebased. Ready to merge. |
Closes #2299