Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#1934: Add parameter to control minimal retention of historical LB data #1996

Open
wants to merge 82 commits into
base: develop
Choose a base branch
from

Conversation

thearusable
Copy link
Contributor

Closes #1934

@thearusable thearusable requested a review from cz4rs October 17, 2022 12:59
@github-actions
Copy link

github-actions bot commented Oct 17, 2022

Pipelines results

PR tests (gcc-12, ubuntu, mpich)

Build for a8ff97d (2024-01-30 17:22:21 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (clang-3.9, ubuntu, mpich)

Build for 0b5ee2e (2022-11-11 13:45:39 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (gcc-10, ubuntu, openmpi, no LB)

Build for 0b2489f (2024-09-26 16:27:09 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (gcc-5, ubuntu, mpich)

Build for 0b5ee2e (2022-11-11 13:45:39 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (gcc-6, ubuntu, mpich)

Build for 0b5ee2e (2022-11-11 13:45:39 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (gcc-9, ubuntu, mpich, zoltan)

Build for 4e93971 (2023-03-28 10:16:37 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (clang-5.0, ubuntu, mpich)

Build for 0b5ee2e (2022-11-11 13:45:39 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (gcc-7, ubuntu, mpich, trace runtime, LB)

Build for 4e93971 (2023-03-28 10:16:37 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (nvidia cuda 11.0, ubuntu, mpich)

Build for dec4a3e (2023-04-04 10:47:19 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (nvidia cuda 10.1, ubuntu, mpich)

Build for 0b5ee2e (2022-11-11 13:45:39 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (clang-9, ubuntu, mpich)

Build for 0b2489f (2024-09-26 16:27:09 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (clang-13, alpine, mpich)

Build for 0b2489f (2024-09-26 16:27:09 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (gcc-8, ubuntu, mpich, address sanitizer)

Build for 0b2489f (2024-09-26 16:27:09 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (clang-11, ubuntu, mpich)

Build for 0b2489f (2024-09-26 16:27:09 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (clang-12, ubuntu, mpich)

Build for 0b2489f (2024-09-26 16:27:09 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (clang-13, ubuntu, mpich)

Build for 0b2489f (2024-09-26 16:27:09 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (intel icpx, ubuntu, mpich)

Build for 0b5ee2e (2022-11-11 13:45:39 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (clang-14, ubuntu, mpich)

Build for a8ff97d (2024-01-30 17:22:21 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (clang-10, ubuntu, mpich)

Build for 0b2489f (2024-09-26 16:27:09 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (gcc-11, ubuntu, mpich, json schema test)

Build for 4e93971 (2023-03-28 10:16:37 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (intel icpc, ubuntu, mpich)

Build for 0b2489f (2024-09-26 16:27:09 UTC)

remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhi%0D%0A%0D%0A%0D%0A ==> And there is more. Read log. <==

Build log


PR tests (nvidia cuda 11.2, ubuntu, mpich)

Build for dec4a3e (2023-04-04 10:47:19 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (gcc-9, ubuntu, mpich, zoltan, json schema test)

Build for 0b2489f (2024-09-26 16:27:09 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (gcc-11, ubuntu, mpich, trace runtime, coverage)

Build for 0b2489f (2024-09-26 16:27:09 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (nvidia cuda 11.2, gcc-9, ubuntu, mpich)

Build for 0b2489f (2024-09-26 16:27:09 UTC)

/vt/src/vt/pipe/pipe_manager.impl.h(135): warning: missing return statement at end of non-void function "vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&vt::vrt::collection::lb::GreedyLB::collectHandler, Target=vt::objgroup::proxy::ProxyElm<vt::vrt::collection::lb::GreedyLB>]"
          detected during:
            instantiation of "auto vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&vt::vrt::collection::lb::GreedyLB::collectHandler, Target=vt::objgroup::proxy::ProxyElm<vt::vrt::collection::lb::GreedyLB>]" 
/vt/src/vt/objgroup/proxy/proxy_objgroup.impl.h(221): here
            instantiation of "vt::objgroup::proxy::Proxy<ObjT>::PendingSendType vt::objgroup::proxy::Proxy<ObjT>::reduce<f,Op,Target,Args...>(Target, Args &&...) const [with ObjT=vt::vrt::collection::lb::GreedyLB, f=&vt::vrt::collection::lb::GreedyLB::collectHandler, Op=vt::collective::PlusOp, Target=vt::objgroup::proxy::ProxyElm<vt::vrt::collection::lb::GreedyLB>, Args=<vt::vrt::collection::lb::GreedyPayload>]" 
/vt/src/vt/vrt/collection/balance/greedylb/greedylb.cc(222): here

/vt/src/vt/pipe/pipe_manager.impl.h(135): warning: missing return statement at end of non-void function "vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&MyObj::handler, Target=vt::objgroup::proxy::ProxyElm<MyObj>]"
          detected during instantiation of "auto vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&MyObj::handler, Target=vt::objgroup::proxy::ProxyElm<MyObj>]" 
/vt/examples/callback/callback.cc(147): here

/vt/src/vt/pipe/pipe_manager.impl.h(135): warning: missing return statement at end of non-void function "vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&colHan, Target=vt::vrt::collection::VrtElmProxy<MyCol, vt::Index1D>]"
          detected during instantiation of "auto vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&colHan, Target=vt::vrt::collection::VrtElmProxy<MyCol, vt::Index1D>]" 
/vt/examples/callback/callback.cc(153): here

/vt/src/vt/pipe/pipe_manager.impl.h(135): warning: missing return statement at end of non-void function "vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&MyObj::handler, Target=vt::objgroup::proxy::ProxyElm<MyObj>]"
          detected during instantiation of "auto vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&MyObj::handler, Target=vt::objgroup::proxy::ProxyElm<MyObj>]" 
/vt/examples/callback/callback.cc(147): here

/vt/src/vt/pipe/pipe_manager.impl.h(135): warning: missing return statement at end of non-void function "vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&colHan, Target=vt::vrt::collection::VrtElmProxy<MyCol, vt::Index1D>]"
          detected during instantiation of "auto vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&colHan, Target=vt::vrt::collection::VrtElmProxy<MyCol, vt::Index1D>]" 
/vt/examples/callback/callback.cc(153%0D%0A%0D%0A%0D%0A ==> And there is more. Read log. <==

Build log


PR tests (gcc-12, ubuntu, mpich, verbose)

Build for 4be442c (2024-06-12 13:05:01 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (intel icpx, ubuntu, mpich, verbose)

Build for 0b2489f (2024-09-26 16:27:09 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (clang-14, ubuntu, mpich, verbose)

Build for 0b2489f (2024-09-26 16:27:09 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (nvidia cuda 12.2.0, gcc-9, ubuntu, mpich, verbose)

Build for 0b2489f (2024-09-26 16:27:09 UTC)

/vt/lib/CLI/CLI/CLI11.hpp(1029): warning #2361-D: invalid narrowing conversion from "double" to "unsigned long"
          TT { std::declval<CC>() }
               ^
          detected during:
            instantiation of "vt::CLI::detail::is_direct_constructible<T, C>::test [with T=std::vector<std::string, std::allocator<std::string>>, C=double]" based on template arguments <std::vector<std::string, std::allocator<std::string>>, double> at line 1041
            instantiation of class "vt::CLI::detail::is_direct_constructible<T, C> [with T=std::vector<std::string, std::allocator<std::string>>, C=double]" at line 5005
            instantiation of "void vt::CLI::Option::results(T &) const [with T=std::vector<std::string, std::allocator<std::string>>]" at line 5034
            instantiation of "T vt::CLI::Option::as<T>() const [with T=std::vector<std::string, std::allocator<std::string>>]" at line 7315

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

/vt/lib/CLI/CLI/CLI11.hpp(1029): warning #2361-D: invalid narrowing conversion from "int" to "unsigned long"
          TT { std::declval<CC>() }
               ^
          detected during:
            instantiation of "vt::CLI::detail::is_direct_constructible<T, C>::test [with T=std::vector<std::string, std::allocator<std::string>>, C=int]" based on template arguments <std::vector<std::string, std::allocator<std::string>>, int> at line 1041
            instantiation of class "vt::CLI::detail::is_direct_constructible<T, C> [with T=std::vector<std::string, std::allocator<std::string>>, C=int]" at line 5005
            instantiation of "void vt::CLI::Option::results(T &) const [with T=std::vector<std::string, std::allocator<std::string>>]" at line 5034
            instantiation of "T vt::CLI::Option::as<T>() const [with T=std::vector<std::string, std::allocator<std::string>>]" at line 7315

/vt/tests/perf/send_cost.cc(169): warning #177-D: variable "prevNode" was declared but never referenced
    auto const prevNode = (thisNode - 1 + num_nodes_) % num_nodes_;
               ^

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

Testing - passed

Build log


PR tests (gcc-12, ubuntu, mpich, verbose, kokkos)

Build for 0b2489f (2024-09-26 16:27:09 UTC)

Compilation - successful

Testing - passed

Build log


@thearusable thearusable self-assigned this Oct 24, 2022
@thearusable thearusable force-pushed the 1934-add-config-parameter-for-lb-data-retention branch from 84321c3 to 1fa5810 Compare October 25, 2022 13:50
@thearusable thearusable marked this pull request as ready for review October 25, 2022 14:08
@codecov
Copy link

codecov bot commented Oct 25, 2022

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.93%. Comparing base (7be56f2) to head (d074668).
Report is 495 commits behind head on develop.

Current head d074668 differs from pull request most recent head c6d265f

Please upload reports for the commit c6d265f to get more accurate results.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #1996      +/-   ##
===========================================
- Coverage    85.48%   84.93%   -0.55%     
===========================================
  Files          722      723       +1     
  Lines        25907    25662     -245     
===========================================
- Hits         22146    21796     -350     
- Misses        3761     3866     +105     
Files Coverage Δ
src/vt/configs/arguments/app_config.h 100.00% <ø> (ø)
src/vt/configs/arguments/args.cc 94.57% <100.00%> (ø)
.../vt/vrt/collection/balance/lb_invoke/lb_manager.cc 80.96% <100.00%> (+0.05%) ⬆️
...c/vt/vrt/collection/balance/lb_invoke/lb_manager.h 100.00% <100.00%> (ø)
src/vt/vrt/collection/balance/node_lb_data.cc 84.45% <100.00%> (+0.10%) ⬆️
src/vt/vrt/collection/balance/node_lb_data.h 100.00% <100.00%> (ø)
tests/unit/collection/test_lb_data_retention.cc 100.00% <100.00%> (ø)

... and 271 files with indirect coverage changes

@PhilMiller
Copy link
Member

Other than the minor interface thing I noted, I'd approve this.

@thearusable thearusable force-pushed the 1934-add-config-parameter-for-lb-data-retention branch from 181ce43 to 1422bbc Compare October 27, 2022 13:32
cz4rs
cz4rs previously approved these changes Oct 27, 2022
Copy link
Contributor

@cz4rs cz4rs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

Copy link
Member

@PhilMiller PhilMiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks ready to go.

Could you think through whether there are any weird concerns if a new model gets set and that changes the number of phases retained? In particular, I can think of the possibility that we keep some stale data forever if a new model demands fewer phases than its predecessor.

@thearusable
Copy link
Contributor Author

@PhilMiller Sorry for the delay, I missed your message.

From what I understand the scenario that you described is the only one which can have some weird behaviors. LB data is being cleaned after each phase so that will occur rather frequently. So if there is more data than model needs then it will not be retained for long.
The only way to retain that data even if it is not needed is to use the configuration LBType::NoLB, in this case LB will not be run so also the data will not be removed.

@thearusable thearusable force-pushed the 1934-add-config-parameter-for-lb-data-retention branch from 1422bbc to 0b5ee2e Compare November 11, 2022 13:45
Copy link
Member

@PhilMiller PhilMiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my latest comment about covering the edge case

@PhilMiller
Copy link
Member

The reason I see the potential for retention is that the cleaning removes data from a particular past phase at a fixed offset from the current phase. if the model's requested retention falls, then there will be a range of phases whose data gets skipped over.

@thearusable thearusable force-pushed the 1934-add-config-parameter-for-lb-data-retention branch from 0b5ee2e to eee6783 Compare December 6, 2022 10:38
…the containers rather than clearing their contents.
@thearusable thearusable force-pushed the 1934-add-config-parameter-for-lb-data-retention branch from 87f126c to ddc3c57 Compare September 23, 2024 20:15
proxy_new.broadcastCollective<TestCol::colHandler>();
});
// Go to the next phase.
vt::thePhase()->nextPhaseCollective();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we reset the current phase to 0 in restoreFromFileInPlace, then this line will cause vt to abort due to having different phases in LBData and in the stats message:

vtAssert(lb_data.getPhase() == msg->getPhase(), "Phases must match");

@lifflander / @nlslatt - Do we need to reset the current phase to 0 in restoreFromFileInPlace?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my recent testing, it appears that the resetPhase method is essential for the proper functioning of VT.
My previous tests were successful because I did not fully recreate VT, which resulted in the preservation of PhaseManager, NodeLBData, and other VT modules.
When we create a new fresh instance and restore the collection from the checkpoint, every module will commence execution from phase 0.

Upon restoring the collection from the checkpoint, we will retain only one phase of history, even if multiple phases are serialized. Upon restoration, the current phase is reset to 0, rendering other phases inaccessible.

@thearusable thearusable requested a review from nlslatt September 26, 2024 18:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add configuration for minimum phase of historical LB data to retain
5 participants