Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ElementStats unordered_map capacity keeps growing #1304

Closed
lifflander opened this issue Mar 5, 2021 · 8 comments
Closed

ElementStats unordered_map capacity keeps growing #1304

lifflander opened this issue Mar 5, 2021 · 8 comments
Assignees

Comments

@lifflander
Copy link
Collaborator

What Needs to be Done?

Here's the temporary solution to this problem that stops memory growth from occurring in bvh. I'm don't think this is a good solution to the problem---just a hack to get bvh working.

Maybe we should have a serialize overload that doesn't save the capacity?

diff --git a/src/vt/vrt/collection/balance/elm_stats.cc b/src/vt/vrt/collection/balance/elm_stats.cc
index 96ff0181c..f99663346 100644
--- a/src/vt/vrt/collection/balance/elm_stats.cc
+++ b/src/vt/vrt/collection/balance/elm_stats.cc
@@ -220,6 +220,11 @@ void ElementStats::releaseStatsFromUnneededPhases(PhaseType phase, unsigned int
     phase_comm_.erase(phase - look_back);
     subphase_comm_.erase(phase - look_back);
   }
+
+  phase_timings_.rehash(0);
+  subphase_timings_.rehash(0);
+  phase_comm_.rehash(0);
+  subphase_comm_.rehash(0);
 }

@cz4rs @PhilMiller @nmm0

@lifflander
Copy link
Collaborator Author

Here are the heaptrack images without the patch. With the patch above in place the memory usage is constant around 10 MiB total.

Screen Shot 2021-03-05 at 1 15 21 PM

Screen Shot 2021-03-05 at 1 15 33 PM

@cz4rs
Copy link
Contributor

cz4rs commented Mar 6, 2021

Seeing rehash(0) "fixing" the issue, I would expect that the bucket count is increasing uncontrollably during serdes - but I couldn't produce such behavior with unit tests (in checkpoint). Both memory consumption and bucket_count remain stable when doing insertion / removal + serialization / deserialization.
No proper fix so far, I'll continue the work on this on Monday.

@lifflander
Copy link
Collaborator Author

Sounds good---no rush

@PhilMiller
Copy link
Member

Given the findings in #1302, we should probably just leave off the rehash functionality in general in serialize(unordered_map) for now. The Intel ICE is coming from template code instantiated by that call.

@cz4rs
Copy link
Contributor

cz4rs commented Mar 8, 2021

Issue created and PR posted merged - this removes the capacity preserving code (until a solution that fixes the memory growth and doesn't crash icc is found).

@lifflander
Copy link
Collaborator Author

@cz4rs @PhilMiller Anyone opposed to closing this now?

@cz4rs
Copy link
Contributor

cz4rs commented Mar 8, 2021

@cz4rs @PhilMiller Anyone opposed to closing this now?

Fine with me 👍 Should I create an issue in checkpoint to re-implement capacity handling eventually? (there is also #1302 which could be used to keep track)

@lifflander
Copy link
Collaborator Author

@cz4rs @PhilMiller Anyone opposed to closing this now?

Fine with me 👍 Should I create an issue in checkpoint to re-implement capacity handling eventually? (there is also #1302 which could be used to keep track)

Yes, I think that's the best way to go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants