Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1835: fix remaining issues in RawData::getModeledComm #1842

Merged
merged 8 commits into from
Jun 21, 2022

Conversation

cz4rs
Copy link
Contributor

@cz4rs cz4rs commented Jun 6, 2022

fixes #1835
fixes #1836
fixes #1837

  • fix subphases handling
  • add per message and per byte weights
  • move the implementation into a separate load model

Note: these were tiny issues (created per comment), so it's convenient to fix them with a single PR.

TODO:

  • add unit test for the new model (see test_model_comm_overhead.nompi.cc)
  • discussion: should we avoid adding a new model and simply replace the logic in CommOverhead? (no)

@github-actions
Copy link

github-actions bot commented Jun 6, 2022

Pipelines results

PR tests (gcc-6, ubuntu, mpich)

Build for 05cc300

Compilation - successful

Testing - passed

Build log


PR tests (gcc-5, ubuntu, mpich)

Build for 05cc300

Compilation - successful

Testing - passed

Build log


PR tests (clang-3.9, ubuntu, mpich)

Build for 05cc300

Compilation - successful

Testing - passed

Build log


PR tests (gcc-9, ubuntu, mpich, zoltan)

Build for 05cc300

Compilation - successful

Testing - passed

Build log


PR tests (gcc-7, ubuntu, mpich, trace runtime, LB)

Build for 05cc300

Compilation - successful

Testing - passed

Build log


PR tests (gcc-8, ubuntu, mpich, address sanitizer)

Build for 05cc300

Compilation - successful

Testing - passed

Build log


PR tests (gcc-10, ubuntu, openmpi, no LB)

Build for 05cc300

Compilation - successful

Testing - passed

Build log


PR tests (clang-5.0, ubuntu, mpich)

Build for 05cc300

Compilation - successful

Testing - passed

Build log


PR tests (clang-9, ubuntu, mpich)

Build for 05cc300

Compilation - successful

Testing - passed

Build log


PR tests (nvidia cuda 10.1, ubuntu, mpich)

Build for 05cc300

Compilation - successful

Testing - passed

Build log


PR tests (clang-13, alpine, mpich)

Build for 05cc300

Compilation - successful

Testing - passed

Build log


PR tests (nvidia cuda 11.0, ubuntu, mpich)

Build for 05cc300

Compilation - successful

Testing - passed

Build log


PR tests (clang-12, ubuntu, mpich)

Build for 05cc300

Compilation - successful

Testing - passed

Build log


PR tests (clang-11, ubuntu, mpich)

Build for 05cc300

Compilation - successful

Testing - passed

Build log


PR tests (clang-13, ubuntu, mpich)

Build for 05cc300

Compilation - successful

Testing - passed

Build log


PR tests (clang-14, ubuntu, mpich)

Build for 05cc300

Compilation - successful

Testing - passed

Build log


PR tests (intel icpx, ubuntu, mpich)

Build for 05cc300

Compilation - successful

Testing - passed

Build log


PR tests (gcc-11, ubuntu, mpich)

Build for 05cc300

Compilation - successful

Testing - passed

Build log


PR tests (gcc-12, ubuntu, mpich)

Build for 05cc300

Compilation - successful

Testing - passed

Build log


PR tests (intel icpc, ubuntu, mpich)

Build for 05cc300

Compilation - successful

Testing - passed

Build log


PR tests (clang-10, ubuntu, mpich)

Build for 05cc300

Compilation - successful

Testing - passed

Build log


@codecov
Copy link

codecov bot commented Jun 6, 2022

Codecov Report

Merging #1842 (05cc300) into develop (a525788) will increase coverage by 0.05%.
The diff coverage is 90.54%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #1842      +/-   ##
===========================================
+ Coverage    84.33%   84.39%   +0.05%     
===========================================
  Files          758      760       +2     
  Lines        26715    26776      +61     
===========================================
+ Hits         22531    22598      +67     
+ Misses        4184     4178       -6     
Impacted Files Coverage Δ
src/vt/vrt/collection/balance/model/raw_data.cc 100.00% <ø> (+28.88%) ⬆️
src/vt/vrt/collection/balance/model/raw_data.h 100.00% <ø> (ø)
.../vrt/collection/balance/model/weighted_messages.cc 84.21% <84.21%> (ø)
...t/collection/test_model_weighted_messages.nompi.cc 91.66% <91.66%> (ø)
...t/vrt/collection/balance/model/weighted_messages.h 100.00% <100.00%> (ø)
src/vt/scheduler/base_unit.cc 87.50% <0.00%> (-1.39%) ⬇️
src/vt/utils/mpi_limits/mpi_max_tag.cc 92.85% <0.00%> (-0.48%) ⬇️
src/vt/runnable/runnable.cc 70.73% <0.00%> (-0.36%) ⬇️
src/vt/pool/pool.cc 87.50% <0.00%> (-0.13%) ⬇️
...it/collection/test_model_select_subphases.nompi.cc 93.44% <0.00%> (-0.11%) ⬇️
... and 11 more

// observer pointer to the underlying comm data
std::unordered_map<PhaseType, CommMapType> const* proc_comm_;

TimeType per_msg_weight_ = 0.001;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up for discussion what the default values should be.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ppebay Does LBAF use a per-message weight of zero (meaning it only considers bytes, not the number of messages)?

Copy link
Contributor Author

@cz4rs cz4rs Jun 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going with following defaults:

 per_msg_weight_  = 0.0;
 per_byte_weight_ = 1.0;

to make the behavior consistent with current LBAF implementation. This means TemperedWMin's beta parameter is acting as per byte coefficient.


namespace vt { namespace vrt { namespace collection { namespace balance {

struct CommModel : public ComposedModel {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming: how should this be called?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something reflecting that the model is an affine combination of the number of messages and their size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with WeightedMessages.

@cz4rs cz4rs force-pushed the 1835-subphases-handling-modeled-comm branch 4 times, most recently from 9283f7b to b34f45e Compare June 8, 2022 15:27
@cz4rs cz4rs marked this pull request as ready for review June 8, 2022 15:47
@cz4rs cz4rs force-pushed the 1835-subphases-handling-modeled-comm branch from b34f45e to 25d4c28 Compare June 15, 2022 16:13
Copy link
Collaborator

@nlslatt nlslatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good. It could be made a little more rigorous by making the different subphases not all have the same load but I'm okay with it as is. I think there are some long lines that should be wrapped.

@cz4rs cz4rs force-pushed the 1835-subphases-handling-modeled-comm branch from 781c7dc to 6d6be22 Compare June 20, 2022 09:00
@cz4rs cz4rs force-pushed the 1835-subphases-handling-modeled-comm branch from 6d6be22 to 05cc300 Compare June 20, 2022 09:03
@cz4rs
Copy link
Contributor Author

cz4rs commented Jun 20, 2022

This looks pretty good. It could be made a little more rigorous by making the different subphases not all have the same load but I'm okay with it as is. I think there are some long lines that should be wrapped.

Rebased on top of develop + minor formatting fixes (05cc300).

@nlslatt nlslatt merged commit 7d7013f into develop Jun 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants