Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#991: LoadModel: Add API to calculate how many phases of past load data need to be kept #995

Merged
merged 13 commits into from
Sep 1, 2020

Conversation

PhilMiller
Copy link
Member

@PhilMiller PhilMiller commented Aug 18, 2020

Fixes: #991

@PhilMiller
Copy link
Member Author

I haven't figured out where to integrate this with the rest of the LB instrumentation and invocation infrastructure yet.

@PhilMiller PhilMiller requested a review from nlslatt August 18, 2020 22:47
@codecov
Copy link

codecov bot commented Aug 18, 2020

Codecov Report

Merging #995 into develop will increase coverage by 0.02%.
The diff coverage is 85.00%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #995      +/-   ##
===========================================
+ Coverage    77.62%   77.65%   +0.02%     
===========================================
  Files          666      667       +1     
  Lines        25552    25613      +61     
===========================================
+ Hits         19834    19889      +55     
- Misses        5718     5724       +6     
Impacted Files Coverage Δ
...c/vt/vrt/collection/balance/model/composed_model.h 100.00% <ø> (ø)
...rc/vt/vrt/collection/balance/model/linear_model.cc 80.00% <0.00%> (-12.31%) ⬇️
src/vt/vrt/collection/balance/model/linear_model.h 100.00% <ø> (ø)
src/vt/vrt/collection/balance/model/load_model.h 100.00% <ø> (ø)
...t/vrt/collection/balance/model/naive_persistence.h 100.00% <ø> (ø)
...c/vt/vrt/collection/balance/model/per_collection.h 100.00% <ø> (ø)
...lection/balance/model/persistence_median_last_n.cc 84.21% <0.00%> (-9.91%) ⬇️
...llection/balance/model/persistence_median_last_n.h 100.00% <ø> (ø)
src/vt/vrt/collection/balance/model/raw_data.h 100.00% <ø> (+14.28%) ⬆️
src/vt/vrt/collection/balance/node_stats.h 100.00% <ø> (ø)
... and 16 more

Copy link
Collaborator

@lifflander lifflander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the part of this that actually calls getNumPastPhasesNeeded from NodeStats. I assume that's coming.

@lifflander
Copy link
Collaborator

@PhilMiller A couple notes:

  • NodeStats::clearStats does not clear subphase data
  • I think we need to also remove the history from ElemStats. It also contains a redundant copy that completely unnecessary.

@PhilMiller
Copy link
Member Author

So, ElemStats is going to migrate with each object, while NodeStats records whatever was in place in a given phase. If we're concerned about the redundancy, NodeStats would seem to be the natural choice to flatten into an aggregated accessor for the individual ElemStats

@PhilMiller
Copy link
Member Author

And yeah, the omission in clearStats was a mistake

* \return How many phases of past load statistics will be needed to
* satisfy the requested history
*/
virtual int getNumPastPhasesNeeded(int look_back = 0) = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Force caller to choose value?

Has "How many phases into the past.." yet most usages are max(other, look_back). Or perhaps "..the minimum number of phases.." or so?

Copy link
Contributor

@pnstickne pnstickne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be useful to have an overall test capture.

@PhilMiller PhilMiller force-pushed the 991-limit-nodestats-by-model-needs branch from 096fcbd to 8fdf45b Compare August 28, 2020 20:58
@PhilMiller PhilMiller marked this pull request as draft August 28, 2020 21:06
@PhilMiller
Copy link
Member Author

I'd like to see #1001 merged before this. Its testing was important in getting some of this right, and it carries some bug fixes of its own.

@PhilMiller
Copy link
Member Author

Right now, this still only addresses NodeStats and not ElemStats. Half the leakage handled, at least.

@PhilMiller PhilMiller force-pushed the 991-limit-nodestats-by-model-needs branch from 1169bd7 to 0e91408 Compare August 31, 2020 17:37
@PhilMiller
Copy link
Member Author

There's a limitation this change will impose on load balancing usage. The LB interval can't be shorter than the minimum look-back range that the models define, or the NodeStats structure on a process p won't have all necessary data for any objects that have immigrated to p in the last LB run

@PhilMiller PhilMiller force-pushed the 991-limit-nodestats-by-model-needs branch from 20c9288 to 72e0bfb Compare August 31, 2020 19:26
@PhilMiller PhilMiller marked this pull request as ready for review August 31, 2020 21:26
Copy link
Contributor

@JacobDomagala JacobDomagala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

@PhilMiller
Copy link
Member Author

There's a limitation this change will impose on load balancing usage. The LB interval can't be shorter than the minimum look-back range that the models define, or the NodeStats structure on a process p won't have all necessary data for any objects that have immigrated to p in the last LB run

This is not a problem for the default configuration, using NaivePersistence, because its look-back range of 1 will always be satisfied.

Copy link
Collaborator

@lifflander lifflander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this looks like a definite improvement to me.

The only thing lacking is incremental LB stat file output so the stats aren't lost from clearing them outside of the live window.

@PhilMiller PhilMiller force-pushed the 991-limit-nodestats-by-model-needs branch from e4c1f97 to 7b2aa6b Compare August 31, 2020 23:00
Copy link
Collaborator

Codacy Here is an overview of what got changed by this pull request:

Clones added
============
- tests/unit/collection/test_model_raw_data.nompi.cc  3
         

Clones removed
==============
+ tests/unit/collection/test_model_persistence_median_last_n.nompi.cc  -1
+ tests/unit/collection/test_model_linear_model.nompi.cc  -1
         

See the complete overview on Codacy

@PhilMiller PhilMiller merged commit 2fc7655 into develop Sep 1, 2020
@PhilMiller PhilMiller deleted the 991-limit-nodestats-by-model-needs branch September 1, 2020 15:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Limit unbounded growing memory usage of LB stats
4 participants