Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#2299: Do not deploy LB when average load is smaller than estimated load balancing cost #2333

Merged
merged 6 commits into from
Sep 6, 2024

Conversation

thearusable
Copy link
Contributor

Closes #2299

@thearusable thearusable self-assigned this Aug 6, 2024
@thearusable thearusable marked this pull request as ready for review August 6, 2024 14:56
@thearusable thearusable marked this pull request as draft August 6, 2024 16:33
Copy link

github-actions bot commented Aug 6, 2024

Pipelines results

PR tests (gcc-12, ubuntu, mpich, verbose)

Build for fe210d6 (2024-08-06 20:51:37 UTC)

Build failed for unknown reason. Check build logs


Build log


PR tests (gcc-12, ubuntu, mpich, verbose, kokkos)

Build for 2777ffb (2024-09-06 10:05:27 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (clang-9, ubuntu, mpich)

Build for 2777ffb (2024-09-06 10:05:27 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (clang-13, alpine, mpich)

Build for 2777ffb (2024-09-06 10:05:27 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (gcc-9, ubuntu, mpich, zoltan, json schema test)

Build for 2777ffb (2024-09-06 10:05:27 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (gcc-10, ubuntu, openmpi, no LB)

Build for 2777ffb (2024-09-06 10:05:27 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (gcc-8, ubuntu, mpich, address sanitizer)

Build for 2777ffb (2024-09-06 10:05:27 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (clang-12, ubuntu, mpich)

Build for 2777ffb (2024-09-06 10:05:27 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (clang-10, ubuntu, mpich)

Build for 2777ffb (2024-09-06 10:05:27 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (clang-11, ubuntu, mpich)

Build for 2777ffb (2024-09-06 10:05:27 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (clang-13, ubuntu, mpich)

Build for 2777ffb (2024-09-06 10:05:27 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (intel icpx, ubuntu, mpich, verbose)

Build for 2777ffb (2024-09-06 10:05:27 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (gcc-11, ubuntu, mpich, trace runtime, coverage)

Build for 2777ffb (2024-09-06 10:05:27 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (clang-14, ubuntu, mpich, verbose)

Build for 2777ffb (2024-09-06 10:05:27 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (nvidia cuda 11.2, gcc-9, ubuntu, mpich)

Build for 2777ffb (2024-09-06 10:05:27 UTC)

/vt/src/vt/pipe/pipe_manager.impl.h(135): warning: missing return statement at end of non-void function "vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&vt::vrt::collection::lb::GreedyLB::collectHandler, Target=vt::objgroup::proxy::ProxyElm<vt::vrt::collection::lb::GreedyLB>]"
          detected during:
            instantiation of "auto vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&vt::vrt::collection::lb::GreedyLB::collectHandler, Target=vt::objgroup::proxy::ProxyElm<vt::vrt::collection::lb::GreedyLB>]" 
/vt/src/vt/objgroup/proxy/proxy_objgroup.impl.h(221): here
            instantiation of "vt::objgroup::proxy::Proxy<ObjT>::PendingSendType vt::objgroup::proxy::Proxy<ObjT>::reduce<f,Op,Target,Args...>(Target, Args &&...) const [with ObjT=vt::vrt::collection::lb::GreedyLB, f=&vt::vrt::collection::lb::GreedyLB::collectHandler, Op=vt::collective::PlusOp, Target=vt::objgroup::proxy::ProxyElm<vt::vrt::collection::lb::GreedyLB>, Args=<vt::vrt::collection::lb::GreedyPayload>]" 
/vt/src/vt/vrt/collection/balance/greedylb/greedylb.cc(222): here

/vt/src/vt/pipe/pipe_manager.impl.h(135): warning: missing return statement at end of non-void function "vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&MyObj::handler, Target=vt::objgroup::proxy::ProxyElm<MyObj>]"
          detected during instantiation of "auto vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&MyObj::handler, Target=vt::objgroup::proxy::ProxyElm<MyObj>]" 
/vt/examples/callback/callback.cc(147): here

/vt/src/vt/pipe/pipe_manager.impl.h(135): warning: missing return statement at end of non-void function "vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&colHan, Target=vt::vrt::collection::VrtElmProxy<MyCol, vt::Index1D>]"
          detected during instantiation of "auto vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&colHan, Target=vt::vrt::collection::VrtElmProxy<MyCol, vt::Index1D>]" 
/vt/examples/callback/callback.cc(153): here

/vt/src/vt/pipe/pipe_manager.impl.h(135): warning: missing return statement at end of non-void function "vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&MyObj::handler, Target=vt::objgroup::proxy::ProxyElm<MyObj>]"
          detected during instantiation of "auto vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&MyObj::handler, Target=vt::objgroup::proxy::ProxyElm<MyObj>]" 
/vt/examples/callback/callback.cc(147): here

/vt/src/vt/pipe/pipe_manager.impl.h(135): warning: missing return statement at end of non-void function "vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&colHan, Target=vt::vrt::collection::VrtElmProxy<MyCol, vt::Index1D>]"
          detected during instantiation of "auto vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&colHan, Target=vt::vrt::collection::VrtElmProxy<MyCol, vt::Index1D>]" 
/vt/examples/callback/callback.cc(153%0D%0A%0D%0A%0D%0A ==> And there is more. Read log. <==

Build log


PR tests (nvidia cuda 12.2.0, gcc-9, ubuntu, mpich, verbose)

Build for 2777ffb (2024-09-06 10:05:27 UTC)

/vt/lib/CLI/CLI/CLI11.hpp(1029): warning #2361-D: invalid narrowing conversion from "double" to "unsigned long"
          TT { std::declval<CC>() }
               ^
          detected during:
            instantiation of "vt::CLI::detail::is_direct_constructible<T, C>::test [with T=std::vector<std::string, std::allocator<std::string>>, C=double]" based on template arguments <std::vector<std::string, std::allocator<std::string>>, double> at line 1041
            instantiation of class "vt::CLI::detail::is_direct_constructible<T, C> [with T=std::vector<std::string, std::allocator<std::string>>, C=double]" at line 5005
            instantiation of "void vt::CLI::Option::results(T &) const [with T=std::vector<std::string, std::allocator<std::string>>]" at line 5034
            instantiation of "T vt::CLI::Option::as<T>() const [with T=std::vector<std::string, std::allocator<std::string>>]" at line 7315

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

/vt/lib/CLI/CLI/CLI11.hpp(1029): warning #2361-D: invalid narrowing conversion from "int" to "unsigned long"
          TT { std::declval<CC>() }
               ^
          detected during:
            instantiation of "vt::CLI::detail::is_direct_constructible<T, C>::test [with T=std::vector<std::string, std::allocator<std::string>>, C=int]" based on template arguments <std::vector<std::string, std::allocator<std::string>>, int> at line 1041
            instantiation of class "vt::CLI::detail::is_direct_constructible<T, C> [with T=std::vector<std::string, std::allocator<std::string>>, C=int]" at line 5005
            instantiation of "void vt::CLI::Option::results(T &) const [with T=std::vector<std::string, std::allocator<std::string>>]" at line 5034
            instantiation of "T vt::CLI::Option::as<T>() const [with T=std::vector<std::string, std::allocator<std::string>>]" at line 7315

/vt/tests/perf/send_cost.cc(169): warning #177-D: variable "prevNode" was declared but never referenced
    auto const prevNode = (thisNode - 1 + num_nodes_) % num_nodes_;
               ^

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

Testing - passed

Build log


PR tests (intel icpc, ubuntu, mpich)

Build for 2777ffb (2024-09-06 10:05:27 UTC)

remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhi%0D%0A%0D%0A%0D%0A ==> And there is more. Read log. <==

Build log


@cz4rs
Copy link
Contributor

cz4rs commented Aug 6, 2024

@thearusable invalid project name "DARMA-tasking/vt": must consist only of lowercase alphanumeric characters, hyphens, and underscores as well as start with a letter or number strikes again 🤷

@thearusable
Copy link
Contributor Author

@thearusable invalid project name "DARMA-tasking/vt": must consist only of lowercase alphanumeric characters, hyphens, and underscores as well as start with a letter or number strikes again 🤷

@cz4rs The target branch of this PR is 2201-implement-memory-aware-temperedlb-in-vt-rebased, which is over two months old. This likely explains the error.

nlslatt
nlslatt previously requested changes Aug 6, 2024
Copy link
Collaborator

@nlslatt nlslatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR needs to target develop

@thearusable thearusable force-pushed the 2299-dont-deploy-lb-with-to-small-average-load branch from fe210d6 to 0e0ba9c Compare August 7, 2024 11:20
@thearusable thearusable changed the base branch from 2201-implement-memory-aware-temperedlb-in-vt-rebased to develop August 7, 2024 11:21
@JacobDomagala
Copy link
Contributor

@thearusable invalid project name "DARMA-tasking/vt": must consist only of lowercase alphanumeric characters, hyphens, and underscores as well as start with a letter or number strikes again 🤷

@cz4rs The target branch of this PR is 2201-implement-memory-aware-temperedlb-in-vt-rebased, which is over two months old. This likely explains the error.

@cz4rs @thearusable Not related to this PR but most recent (on develop) spack workflow failed with that message (https://dev.azure.com/DARMA-tasking/DARMA/_build/results?buildId=60702&view=logs&j=3dc8fd7e-4368-5a92-293e-d53cefc8c4b3&t=1364e02f-4e1e-51cf-2821-fd252a3c0e34&l=10)

@thearusable
Copy link
Contributor Author

@thearusable invalid project name "DARMA-tasking/vt": must consist only of lowercase alphanumeric characters, hyphens, and underscores as well as start with a letter or number strikes again 🤷

@cz4rs The target branch of this PR is 2201-implement-memory-aware-temperedlb-in-vt-rebased, which is over two months old. This likely explains the error.

@cz4rs @thearusable Not related to this PR but most recent (on develop) spack workflow failed with that message (https://dev.azure.com/DARMA-tasking/DARMA/_build/results?buildId=60702&view=logs&j=3dc8fd7e-4368-5a92-293e-d53cefc8c4b3&t=1364e02f-4e1e-51cf-2821-fd252a3c0e34&l=10)

@JacobDomagala I will take a look on that

@thearusable thearusable force-pushed the 2299-dont-deploy-lb-with-to-small-average-load branch 3 times, most recently from f6ffd20 to b8b2fc1 Compare August 9, 2024 13:34
@thearusable thearusable marked this pull request as ready for review August 9, 2024 13:40
@thearusable thearusable changed the title #2299: Change minimal load for triggering LB to minimal modeled object load divided by number of nodes. #2299: Do not deploy LB when average load is smaller then estimated load balancing cost Aug 9, 2024
@thearusable thearusable changed the title #2299: Do not deploy LB when average load is smaller then estimated load balancing cost #2299: Do not deploy LB when average load is smaller than estimated load balancing cost Aug 9, 2024
@thearusable thearusable requested a review from nlslatt August 9, 2024 13:45
Copy link
Contributor

@cz4rs cz4rs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@thearusable thearusable force-pushed the 2299-dont-deploy-lb-with-to-small-average-load branch from d0a88e1 to aae763f Compare August 27, 2024 14:13
*/
double getCollectiveEpochCost() const {
// 100 ns
return 0.0000001;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ideally would use std chrono types and literals. Barring that, at least use scientific notation and specify the units.

Also, I think 100ns is on the low side, but it really doesn't matter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I changed it to std::chrono::nanoseconds.

@@ -160,7 +160,8 @@ void GreedyLB::loadStats() {
bool should_lb = false;
this_load_begin = this_load;

if (avg_load > 0.0000000001) {
// Use an estimated load-balancing cost on average rank load to load-balance
if (avg_load > getCollectiveEpochCost()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I came here to note that I think calculation like this should use max load, rather than average. Otherwise, if a large-scale job has maximal imbalance of a very fine-grain problem, averaging might mask it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lifflander What should we use in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the discussion on the meeting, we should use the max load instead of avg load and move the logic for checking that to BaseLB to be used only by the real load balancers.

@thearusable thearusable force-pushed the 2299-dont-deploy-lb-with-to-small-average-load branch from dfdaf61 to e987173 Compare August 28, 2024 13:53
@thearusable thearusable marked this pull request as draft August 30, 2024 12:40
@thearusable thearusable force-pushed the 2299-dont-deploy-lb-with-to-small-average-load branch from 57667db to e02a04a Compare September 4, 2024 17:46
@thearusable thearusable marked this pull request as ready for review September 4, 2024 17:47
@thearusable thearusable requested a review from cz4rs September 5, 2024 17:59
Copy link
Collaborator

@lifflander lifflander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@thearusable thearusable force-pushed the 2299-dont-deploy-lb-with-to-small-average-load branch from e02a04a to 2777ffb Compare September 6, 2024 10:05
@thearusable
Copy link
Contributor Author

@lifflander PR rebased. Ready to merge.

@nlslatt nlslatt merged commit 7f294c3 into develop Sep 6, 2024
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Do not deploy LB when average load is too small
6 participants