#2299: Do not deploy LB when average load is smaller than estimated load balancing cost #2333

thearusable · 2024-08-06T14:55:35Z

github-actions · 2024-08-06T17:12:34Z

Pipelines results

PR tests (gcc-12, ubuntu, mpich, verbose)

Build for fe210d6 (2024-08-06 20:51:37 UTC)

Build failed for unknown reason. Check build logs

Build log

PR tests (gcc-12, ubuntu, mpich, verbose, kokkos)

Build for 2777ffb (2024-09-06 10:05:27 UTC)

Compilation - successful

Testing - passed

Build log

PR tests (clang-9, ubuntu, mpich)

Build for 2777ffb (2024-09-06 10:05:27 UTC)

Compilation - successful

Testing - passed

Build log

PR tests (clang-13, alpine, mpich)

Build for 2777ffb (2024-09-06 10:05:27 UTC)

Compilation - successful

Testing - passed

Build log

PR tests (gcc-9, ubuntu, mpich, zoltan, json schema test)

Build for 2777ffb (2024-09-06 10:05:27 UTC)

Compilation - successful

Testing - passed

Build log

PR tests (gcc-10, ubuntu, openmpi, no LB)

Build for 2777ffb (2024-09-06 10:05:27 UTC)

Compilation - successful

Testing - passed

Build log

PR tests (gcc-8, ubuntu, mpich, address sanitizer)

Build for 2777ffb (2024-09-06 10:05:27 UTC)

Compilation - successful

Testing - passed

Build log

PR tests (clang-12, ubuntu, mpich)

Build for 2777ffb (2024-09-06 10:05:27 UTC)

Compilation - successful

Testing - passed

Build log

PR tests (clang-10, ubuntu, mpich)

Build for 2777ffb (2024-09-06 10:05:27 UTC)

Compilation - successful

Testing - passed

Build log

PR tests (clang-11, ubuntu, mpich)

Build for 2777ffb (2024-09-06 10:05:27 UTC)

Compilation - successful

Testing - passed

Build log

PR tests (clang-13, ubuntu, mpich)

Build for 2777ffb (2024-09-06 10:05:27 UTC)

Compilation - successful

Testing - passed

Build log

PR tests (intel icpx, ubuntu, mpich, verbose)

Build for 2777ffb (2024-09-06 10:05:27 UTC)

Compilation - successful

Testing - passed

Build log

PR tests (gcc-11, ubuntu, mpich, trace runtime, coverage)

Build for 2777ffb (2024-09-06 10:05:27 UTC)

Compilation - successful

Testing - passed

Build log

PR tests (clang-14, ubuntu, mpich, verbose)

Build for 2777ffb (2024-09-06 10:05:27 UTC)

Compilation - successful

Testing - passed

Build log

PR tests (nvidia cuda 11.2, gcc-9, ubuntu, mpich)

Build for 2777ffb (2024-09-06 10:05:27 UTC)

/vt/src/vt/pipe/pipe_manager.impl.h(135): warning: missing return statement at end of non-void function "vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&vt::vrt::collection::lb::GreedyLB::collectHandler, Target=vt::objgroup::proxy::ProxyElm<vt::vrt::collection::lb::GreedyLB>]"
          detected during:
            instantiation of "auto vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&vt::vrt::collection::lb::GreedyLB::collectHandler, Target=vt::objgroup::proxy::ProxyElm<vt::vrt::collection::lb::GreedyLB>]" 
/vt/src/vt/objgroup/proxy/proxy_objgroup.impl.h(221): here
            instantiation of "vt::objgroup::proxy::Proxy<ObjT>::PendingSendType vt::objgroup::proxy::Proxy<ObjT>::reduce<f,Op,Target,Args...>(Target, Args &&...) const [with ObjT=vt::vrt::collection::lb::GreedyLB, f=&vt::vrt::collection::lb::GreedyLB::collectHandler, Op=vt::collective::PlusOp, Target=vt::objgroup::proxy::ProxyElm<vt::vrt::collection::lb::GreedyLB>, Args=<vt::vrt::collection::lb::GreedyPayload>]" 
/vt/src/vt/vrt/collection/balance/greedylb/greedylb.cc(222): here

/vt/src/vt/pipe/pipe_manager.impl.h(135): warning: missing return statement at end of non-void function "vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&MyObj::handler, Target=vt::objgroup::proxy::ProxyElm<MyObj>]"
          detected during instantiation of "auto vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&MyObj::handler, Target=vt::objgroup::proxy::ProxyElm<MyObj>]" 
/vt/examples/callback/callback.cc(147): here

/vt/src/vt/pipe/pipe_manager.impl.h(135): warning: missing return statement at end of non-void function "vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&colHan, Target=vt::vrt::collection::VrtElmProxy<MyCol, vt::Index1D>]"
          detected during instantiation of "auto vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&colHan, Target=vt::vrt::collection::VrtElmProxy<MyCol, vt::Index1D>]" 
/vt/examples/callback/callback.cc(153): here

/vt/src/vt/pipe/pipe_manager.impl.h(135): warning: missing return statement at end of non-void function "vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&MyObj::handler, Target=vt::objgroup::proxy::ProxyElm<MyObj>]"
          detected during instantiation of "auto vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&MyObj::handler, Target=vt::objgroup::proxy::ProxyElm<MyObj>]" 
/vt/examples/callback/callback.cc(147): here

/vt/src/vt/pipe/pipe_manager.impl.h(135): warning: missing return statement at end of non-void function "vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&colHan, Target=vt::vrt::collection::VrtElmProxy<MyCol, vt::Index1D>]"
          detected during instantiation of "auto vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&colHan, Target=vt::vrt::collection::VrtElmProxy<MyCol, vt::Index1D>]" 
/vt/examples/callback/callback.cc(153%0D%0A%0D%0A%0D%0A ==> And there is more. Read log. <==

Build log

PR tests (nvidia cuda 12.2.0, gcc-9, ubuntu, mpich, verbose)

Build for 2777ffb (2024-09-06 10:05:27 UTC)

/vt/lib/CLI/CLI/CLI11.hpp(1029): warning #2361-D: invalid narrowing conversion from "double" to "unsigned long"
          TT { std::declval<CC>() }
               ^
          detected during:
            instantiation of "vt::CLI::detail::is_direct_constructible<T, C>::test [with T=std::vector<std::string, std::allocator<std::string>>, C=double]" based on template arguments <std::vector<std::string, std::allocator<std::string>>, double> at line 1041
            instantiation of class "vt::CLI::detail::is_direct_constructible<T, C> [with T=std::vector<std::string, std::allocator<std::string>>, C=double]" at line 5005
            instantiation of "void vt::CLI::Option::results(T &) const [with T=std::vector<std::string, std::allocator<std::string>>]" at line 5034
            instantiation of "T vt::CLI::Option::as<T>() const [with T=std::vector<std::string, std::allocator<std::string>>]" at line 7315

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

/vt/lib/CLI/CLI/CLI11.hpp(1029): warning #2361-D: invalid narrowing conversion from "int" to "unsigned long"
          TT { std::declval<CC>() }
               ^
          detected during:
            instantiation of "vt::CLI::detail::is_direct_constructible<T, C>::test [with T=std::vector<std::string, std::allocator<std::string>>, C=int]" based on template arguments <std::vector<std::string, std::allocator<std::string>>, int> at line 1041
            instantiation of class "vt::CLI::detail::is_direct_constructible<T, C> [with T=std::vector<std::string, std::allocator<std::string>>, C=int]" at line 5005
            instantiation of "void vt::CLI::Option::results(T &) const [with T=std::vector<std::string, std::allocator<std::string>>]" at line 5034
            instantiation of "T vt::CLI::Option::as<T>() const [with T=std::vector<std::string, std::allocator<std::string>>]" at line 7315

/vt/tests/perf/send_cost.cc(169): warning #177-D: variable "prevNode" was declared but never referenced
    auto const prevNode = (thisNode - 1 + num_nodes_) % num_nodes_;
               ^

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

Testing - passed

Build log

PR tests (intel icpc, ubuntu, mpich)

Build for 2777ffb (2024-09-06 10:05:27 UTC)

remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhi%0D%0A%0D%0A%0D%0A ==> And there is more. Read log. <==

Build log

cz4rs · 2024-08-06T17:15:05Z

@thearusable invalid project name "DARMA-tasking/vt": must consist only of lowercase alphanumeric characters, hyphens, and underscores as well as start with a letter or number strikes again 🤷

thearusable · 2024-08-06T17:48:10Z

@thearusable invalid project name "DARMA-tasking/vt": must consist only of lowercase alphanumeric characters, hyphens, and underscores as well as start with a letter or number strikes again 🤷

@cz4rs The target branch of this PR is 2201-implement-memory-aware-temperedlb-in-vt-rebased, which is over two months old. This likely explains the error.

nlslatt

PR needs to target develop

JacobDomagala · 2024-08-07T12:14:22Z

@thearusable invalid project name "DARMA-tasking/vt": must consist only of lowercase alphanumeric characters, hyphens, and underscores as well as start with a letter or number strikes again 🤷

@cz4rs The target branch of this PR is 2201-implement-memory-aware-temperedlb-in-vt-rebased, which is over two months old. This likely explains the error.

@cz4rs @thearusable Not related to this PR but most recent (on develop) spack workflow failed with that message (https://dev.azure.com/DARMA-tasking/DARMA/_build/results?buildId=60702&view=logs&j=3dc8fd7e-4368-5a92-293e-d53cefc8c4b3&t=1364e02f-4e1e-51cf-2821-fd252a3c0e34&l=10)

thearusable · 2024-08-07T12:25:17Z

@thearusable invalid project name "DARMA-tasking/vt": must consist only of lowercase alphanumeric characters, hyphens, and underscores as well as start with a letter or number strikes again 🤷

@cz4rs The target branch of this PR is 2201-implement-memory-aware-temperedlb-in-vt-rebased, which is over two months old. This likely explains the error.

@cz4rs @thearusable Not related to this PR but most recent (on develop) spack workflow failed with that message (https://dev.azure.com/DARMA-tasking/DARMA/_build/results?buildId=60702&view=logs&j=3dc8fd7e-4368-5a92-293e-d53cefc8c4b3&t=1364e02f-4e1e-51cf-2821-fd252a3c0e34&l=10)

@JacobDomagala I will take a look on that

cz4rs

Looks good to me.

PhilMiller · 2024-08-27T14:34:42Z

src/vt/vrt/collection/balance/baselb/baselb.h

+   */
+  double getCollectiveEpochCost() const {
+    // 100 ns
+    return 0.0000001;


This ideally would use std chrono types and literals. Barring that, at least use scientific notation and specify the units.

Also, I think 100ns is on the low side, but it really doesn't matter.

Ok, I changed it to std::chrono::nanoseconds.

PhilMiller · 2024-08-27T14:36:43Z

src/vt/vrt/collection/balance/greedylb/greedylb.cc

@@ -160,7 +160,8 @@ void GreedyLB::loadStats() {
  bool should_lb = false;
  this_load_begin = this_load;

-  if (avg_load > 0.0000000001) {
+  // Use an estimated load-balancing cost on average rank load to load-balance
+  if (avg_load > getCollectiveEpochCost()) {


I came here to note that I think calculation like this should use max load, rather than average. Otherwise, if a large-scale job has maximal imbalance of a very fine-grain problem, averaging might mask it.

@lifflander What should we use in this case?

After the discussion on the meeting, we should use the max load instead of avg load and move the logic for checking that to BaseLB to be used only by the real load balancers.

lifflander

Looks good to me!

…er of nodes

thearusable · 2024-09-06T10:31:44Z

@lifflander PR rebased. Ready to merge.

Stale

thearusable requested review from lifflander, cz4rs, JacobDomagala, ppebay and nlslatt August 6, 2024 14:56

thearusable self-assigned this Aug 6, 2024

thearusable marked this pull request as ready for review August 6, 2024 14:56

thearusable marked this pull request as draft August 6, 2024 16:33

nlslatt previously requested changes Aug 6, 2024

View reviewed changes

thearusable force-pushed the 2299-dont-deploy-lb-with-to-small-average-load branch from fe210d6 to 0e0ba9c Compare August 7, 2024 11:20

thearusable changed the base branch from 2201-implement-memory-aware-temperedlb-in-vt-rebased to develop August 7, 2024 11:21

thearusable force-pushed the 2299-dont-deploy-lb-with-to-small-average-load branch 3 times, most recently from f6ffd20 to b8b2fc1 Compare August 9, 2024 13:34

thearusable marked this pull request as ready for review August 9, 2024 13:40

thearusable changed the title ~~#2299: Change minimal load for triggering LB to minimal modeled object load divided by number of nodes.~~ #2299: Do not deploy LB when average load is smaller then estimated load balancing cost Aug 9, 2024

thearusable changed the title ~~#2299: Do not deploy LB when average load is smaller then estimated load balancing cost~~ #2299: Do not deploy LB when average load is smaller than estimated load balancing cost Aug 9, 2024

thearusable requested a review from nlslatt August 9, 2024 13:45

cz4rs approved these changes Aug 27, 2024

View reviewed changes

thearusable force-pushed the 2299-dont-deploy-lb-with-to-small-average-load branch from d0a88e1 to aae763f Compare August 27, 2024 14:13

PhilMiller reviewed Aug 27, 2024

View reviewed changes

thearusable force-pushed the 2299-dont-deploy-lb-with-to-small-average-load branch from dfdaf61 to e987173 Compare August 28, 2024 13:53

thearusable marked this pull request as draft August 30, 2024 12:40

thearusable force-pushed the 2299-dont-deploy-lb-with-to-small-average-load branch from 57667db to e02a04a Compare September 4, 2024 17:46

thearusable marked this pull request as ready for review September 4, 2024 17:47

thearusable requested a review from cz4rs September 5, 2024 17:59

lifflander approved these changes Sep 5, 2024

View reviewed changes

thearusable added 6 commits September 6, 2024 12:05

#2299: Use square root of LoadType epsilon for minimal bound

0f491cf

#2299: Switch condition for lb to minimal object load divided by numb…

a338ac3

…er of nodes

#2299: Add method with hardcoded load balancing cost

215d87d

#2299: Make getCollectiveEpochCost() a protected method

51670af

#2299: Replace double with std::chrono

dad3b19

#2299: Move logic for checking max load to BaseLB

2777ffb

thearusable force-pushed the 2299-dont-deploy-lb-with-to-small-average-load branch from e02a04a to 2777ffb Compare September 6, 2024 10:05

nlslatt merged commit 7f294c3 into develop Sep 6, 2024
26 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#2299: Do not deploy LB when average load is smaller than estimated load balancing cost #2333

#2299: Do not deploy LB when average load is smaller than estimated load balancing cost #2333

thearusable commented Aug 6, 2024

github-actions bot commented Aug 6, 2024 •

edited

Loading

cz4rs commented Aug 6, 2024

thearusable commented Aug 6, 2024

nlslatt left a comment

JacobDomagala commented Aug 7, 2024

thearusable commented Aug 7, 2024

cz4rs left a comment

PhilMiller Aug 27, 2024

thearusable Aug 27, 2024

PhilMiller Aug 27, 2024

thearusable Sep 3, 2024

thearusable Sep 3, 2024

lifflander left a comment

thearusable commented Sep 6, 2024

#2299: Do not deploy LB when average load is smaller than estimated load balancing cost #2333

#2299: Do not deploy LB when average load is smaller than estimated load balancing cost #2333

Conversation

thearusable commented Aug 6, 2024

github-actions bot commented Aug 6, 2024 • edited Loading

Pipelines results

cz4rs commented Aug 6, 2024

thearusable commented Aug 6, 2024

nlslatt left a comment

Choose a reason for hiding this comment

JacobDomagala commented Aug 7, 2024

thearusable commented Aug 7, 2024

cz4rs left a comment

Choose a reason for hiding this comment

PhilMiller Aug 27, 2024

Choose a reason for hiding this comment

thearusable Aug 27, 2024

Choose a reason for hiding this comment

PhilMiller Aug 27, 2024

Choose a reason for hiding this comment

thearusable Sep 3, 2024

Choose a reason for hiding this comment

thearusable Sep 3, 2024

Choose a reason for hiding this comment

lifflander left a comment

Choose a reason for hiding this comment

thearusable commented Sep 6, 2024

github-actions bot commented Aug 6, 2024 •

edited

Loading