Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stats: symbolize strings in HeapStatData and ThreadLocalStore #4281

Closed
wants to merge 9 commits into from

Conversation

ambuc
Copy link
Contributor

@ambuc ambuc commented Aug 28, 2018

Description: On the heels of #3927, and in pursuit of lower heap consumption for Envoy (#3585), this implementation uses the new Stats::SymbolTable library to symbolize std::string instances within HeapStatData and ThreadLocalStore.

Preliminary heap allocation improvements here:

short short long long
current impl w/ symbol table current impl w/ symbol table improvement improvement
clusters allocated on heap allocated on heap allocated on heap allocated on heap short long
(#) (b) (b) (b) (b) (%) (%)
10 5242880 5242880 5242880 5242880 0.00% -0.00%
100 13631488 13631488 15728640 14680064 0.00% -6.67%
1000 95420416 94371840 112197632 103809024 -1.10% -7.48%
10000 897556480 881827840 1064288256 974110720 -1.75% -8.47%

NB: For these results, build with -c opt and then query /stats for server.memory_heap_size. A PR with this query which does not necessitate the generation of all stats is here: #4361. All runs are against envoy-static with --disable-hot-restart. "short" cluster names are ~10 chrs long, and "long" cluster names are ~60 chrs long.

Risk Level: Medium -- this affects non-hot-restart binaries, so there won't be any issues with trying to restart against a different internal memory representation. But any PR which changes internal memory stuff comes with some risk.
Testing: As a drop-in replacement for HeapStatData / ThreadLocalStore, there are no new tests for these classes.
Docs Changes: N/A
Release Notes: N/A, no user-facing changes.
Fixes #3585

@dnoe
Copy link
Contributor

dnoe commented Aug 28, 2018

@mrice32 can you take an initial pass at this? Thanks

@ambuc
Copy link
Contributor Author

ambuc commented Aug 28, 2018

Looks like asan/tsan caught a mutex issue inside thread_local_store.cc -- fixing now.

Copy link
Member

@mrice32 mrice32 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Some initial comments.

source/common/common/hash.h Outdated Show resolved Hide resolved
source/common/stats/heap_stat_data.h Outdated Show resolved Hide resolved
source/common/stats/heap_stat_data.h Outdated Show resolved Hide resolved
source/common/stats/symbol_table_impl.h Outdated Show resolved Hide resolved
source/common/stats/symbol_table_impl.h Outdated Show resolved Hide resolved
source/common/stats/symbol_table_impl.h Outdated Show resolved Hide resolved
source/common/stats/thread_local_store.cc Show resolved Hide resolved
Signed-off-by: James Buckland <[email protected]>
};

using StatNamePtr = std::unique_ptr<StatName>;

struct StatNameHash_ {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the trailing underscore here? and StatNameCompare_ ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must have thought it was idiomatic for some reason. Will change.

source/common/stats/symbol_table_impl.h Outdated Show resolved Hide resolved
source/common/stats/symbol_table_impl.h Outdated Show resolved Hide resolved
source/common/stats/thread_local_store.cc Show resolved Hide resolved
This has the benefit of avoiding a dynamic_cast inside operator==()s.

Signed-off-by: James Buckland <[email protected]>
@ambuc
Copy link
Contributor Author

ambuc commented Aug 30, 2018

@ all, I'm continuing to profile this for both heap alloc. and time performance. This won't be ready to review (let alone merge) until I'm confident it doesn't trade memory for slowness.

@@ -104,21 +104,21 @@ class SymbolTable {

// Stores the symbol to be used at next insertion. This should exist ahead of insertion time so
// that if insertion succeeds, the value written is the correct one.
Symbol next_symbol_ = 0;
Symbol next_symbol_ = 0 GUARDED_BY(lock_);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Symbol next_symbol_(0) GUARDED_BY(lock_) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to make this explicit in a constructor, easier than fighting a macro.

// the destructors of the individual stat objects, which are not protected by locks.
// A locally held symbol table which encodes stat names as StatNamePtrs and decodes StatNamePtrs
// back into strings.
SymbolTable table_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add commentary here that table_ itself is not protected by mutex_ and guarantees thread safety on its own.

return (a->key() == b->key());
}
struct HeapStatCompare {
bool operator()(const HeapStatData* a, const HeapStatData* b) const { return (*a == *b); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: don't need parens around *a == *b

struct StatNamePtrCompare {
bool operator()(const StatName* a, const StatName* b) const {
// This extracts the underlying statnames.
return (*a == *b);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra parens

struct StatNameUniquePtrCompare {
bool operator()(const StatNamePtr& a, const StatNamePtr& b) const {
// This extracts the underlying statnames.
return (*a == *b);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

source/common/stats/thread_local_store.cc Show resolved Hide resolved
@PiotrSikora
Copy link
Contributor

Following up on the question I've asked in #4190 - how exactly are you measuring memory usage and what's the concurrency setting you're using in your tests? The 70x jump in memory usage cannot be attributed only to the increase in number of clusters.

@ambuc
Copy link
Contributor Author

ambuc commented Sep 5, 2018

@PiotrSikora I'm running bazel build //source/exe:envoy-static -c opt and then bazel-bin/source/exe/envoy-static --disable-hot-restart --config-path <path>, where <path> points to a file like https://gist.github.com/ambuc/50181fa5f64485041696730c67e9f29f, except I have variants containing 10, 100, 1000, and 10000 clusters, with both short "cluster" n and long "cluster.abcdefghijkl.abcdefghijklm.abcdefghijklmn.cluster" n names. I'm not passing any explicit concurrency arguments when running these tests.

@PiotrSikora
Copy link
Contributor

@ambuc but how are you checking memory usage? Using RSS value from the ps ux output or using server.memory_allocated (i.e. curl -s 127.0.0.1:9901/stats | grep server.memory_allocated)?

As for the concurrency, you can check it via curl -s 127.0.0.1:9901/stats | grep server.concurrency, which should shed some light on your numbers.

@jmarantz
Copy link
Contributor

jmarantz commented Sep 5, 2018

Per Piotr's comment, can you edit the description in this PR to include a pointer to the "how to run memory profiling" thing that you wrote, right above or below the table?

@jmarantz
Copy link
Contributor

jmarantz commented Sep 5, 2018

Also you pinged me a table showing the CPU consumption impact of this PR. Can you add that to the description as well?

@ambuc
Copy link
Contributor Author

ambuc commented Sep 5, 2018

@PiotrSikora Ah, I see. I'm not doing either of those -- I was compiling with tcmalloc (as in https://github.com/envoyproxy/envoy/blob/master/bazel/PPROF.md) and using the high-water mark pprof spit out. I'll run this again with server.memory_allocated.

@ambuc
Copy link
Contributor Author

ambuc commented Sep 5, 2018

@PiotrSikora I'm not convinced that using server.memory_allocated is a particularly good way of inspecting memory usage, since it actively consumes memory which was not in-use before the call. Check out https://github.com/envoyproxy/envoy/blob/master/source/server/http/admin.cc#L524-L535 for a by-inspection proof of this -- for large configs, the creation of all_stats is a nontrivial amount of storage. I will perform this analysis via ps ux.

@ all, is this a "bug"? Should I take out an issue to fix this behavior? It would be nice for a memory monitoring feature to not significantly change the memory on which it reports.

@PiotrSikora
Copy link
Contributor

PiotrSikora commented Sep 5, 2018

@ambuc yeah, I'm aware of this (and I had a patch that emits memory stats to logs upon receiving SIGUSR2), but the memory overhead of generating /stats is negligible in grand scheme of things, and is much lower than the overhead of tcmalloc free lists, so it was still much more accurate in my tests than RSS from ps ux.

@jmarantz
Copy link
Contributor

jmarantz commented Sep 6, 2018

I suspect that in the context of this benchmark which generates large numbers of long-named clusters, the overhead of generating all /stats is likely to be significant, as it makes a map of fully elaborated strings for all stat names, thus defeating the purpose of storing them as a vector of shared symbol references in steady-state.

I think if we add a query-param to /stats to just emit a single stat without elaborating all of them into a map, the results would more accurately reflect the value of what James has done.

@ambuc
Copy link
Contributor Author

ambuc commented Sep 6, 2018

@jmarantz @PiotrSikora I have a PR here #4361 which lets you inspect Memory::Stats::totalCurrentlyAllocated() and Memory::Stats::totalCurrentlyReserved() without needing to generate all stats. Using that patch, I've updated the table at the top of the page -- it looks like 8% improvement for the worst-case scenario, and never worse.

(Also, all these trials are run with concurrency 12.)

@stale
Copy link

stale bot commented Sep 13, 2018

This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Sep 13, 2018
@ambuc
Copy link
Contributor Author

ambuc commented Sep 19, 2018

Per discussion in #4196, I am going to wait on this change, which might lock us into a stat implementation unnecessarily. Closing for now.

@ambuc ambuc closed this Sep 19, 2018
jmarantz added a commit to jmarantz/envoy that referenced this pull request Sep 23, 2018
Signed-off-by: Joshua Marantz <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale stalebot believes this issue/PR has not been touched recently
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants