-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stats: integrate real symbol table into stats system #4980
stats: integrate real symbol table into stats system #4980
Conversation
Signed-off-by: Joshua Marantz <[email protected]>
Signed-off-by: Joshua Marantz <[email protected]>
Signed-off-by: Joshua Marantz <[email protected]>
Signed-off-by: Joshua Marantz <[email protected]>
@ambuc FYI |
Signed-off-by: Joshua Marantz <[email protected]>
Signed-off-by: Joshua Marantz <[email protected]>
Signed-off-by: Joshua Marantz <[email protected]>
Got some encouraging perf results to motivate this large complicated scary PR :) Memory usage is cut by 50% in HeapStatsThreadLocalStoreTest.Memory*:
Speed tests also look like they doubled as well...but will confirm after a couple of corroborating runs. |
Signed-off-by: Joshua Marantz <[email protected]>
Signed-off-by: Joshua Marantz <[email protected]>
Signed-off-by: Joshua Marantz <[email protected]>
Siege results (with a modified siege.py to include contention) tell an interesting story. I was expecting a perf improvement, but it was about a wash. But now that contentions are tracked explicitly we can see that in the current state of this PR, mutex contention increases significantly.
More work is required on the PR to remove the contentions, by symbolizing stat names on startup rather than composing strings on every request to lookup a stat. Or better yet, hold onto the stat objects themselves when possible. |
Microbenchmark shows significant improvement. This differs from the siege results in that there will be no contention in the single-threaded benchmark, and of course it measures only the stat lookups and does not include all the other work needed to proxy a few K of lorem ipsum. old:
new
This is more than a 2x improvement in stats overhead (assuming we are looking up stats from symbolized names rather than from strings). |
Signed-off-by: Joshua Marantz <[email protected]>
… it in, and preconstruct strings. Signed-off-by: Joshua Marantz <[email protected]>
…_view rather than std::string. Signed-off-by: Joshua Marantz <[email protected]>
…ars. Signed-off-by: Joshua Marantz <[email protected]>
Signed-off-by: Joshua Marantz <[email protected]>
Signed-off-by: Joshua Marantz <[email protected]>
Signed-off-by: Joshua Marantz <[email protected]>
Signed-off-by: Joshua Marantz <[email protected]>
Signed-off-by: Joshua Marantz <[email protected]>
Signed-off-by: Joshua Marantz <[email protected]>
Signed-off-by: Joshua Marantz <[email protected]>
…, to avoid plumbing one common SymbolTable through the object graph. Signed-off-by: Joshua Marantz <[email protected]>
Signed-off-by: Joshua Marantz <[email protected]>
…ile-system not per-file, and add histogramming of symbol-table encodes. Signed-off-by: Joshua Marantz <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
tsan tests failures due to timeouts, one of which is eds_speed_test_benchmark_test. On my workstation, with tsan: //test/common/upstream:eds_speed_test_benchmark_test PASSED in 156.6s So I'm trying to re-run that; I guess maybe I've trigged that already but it will run after 'format'? |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s), but failed to run 2 pipeline(s). |
Signed-off-by: Joshua Marantz <[email protected]>
@jmarantz sorry can you merge master again to fix the conflict? The TSAN flake should be fixed. /wait |
I think I have another TSAN timeout issue as well, which I am hoping will be mitigated by #11768 but let's see what happens after the merge. |
Signed-off-by: Joshua Marantz <[email protected]>
Signed-off-by: Joshua Marantz <[email protected]>
Signed-off-by: Joshua Marantz <[email protected]>
…bing it. Signed-off-by: Joshua Marantz <[email protected]>
Signed-off-by: Joshua Marantz <[email protected]>
@mattklein123 this looks like it passes now; I had to de-flake the http2 integration test with a wait-for-counter-ge, and increase the test-size for ads_integration_test. I suspect that tsan overhead disproportionately slows down the tests with real symbol table due to all the locks being taken, regardless of whether they contend. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, LGTM with a small merge issue I think.
/wait
@@ -107,6 +107,9 @@ New Features | |||
* metrics service: added added :ref:`API version <envoy_v3_api_field_config.metrics.v3.MetricsServiceConfig.transport_api_version>` to explicitly set the version of gRPC service endpoint and message to be used. | |||
* network filters: added a :ref:`postgres proxy filter <config_network_filters_postgres_proxy>`. | |||
* network filters: added a :ref:`rocketmq proxy filter <config_network_filters_rocketmq_proxy>`. | |||
* performance: stats symbol table implementation (enabled by default; to disable it, add | |||
`--use-fake-symbol-table 1` to the command-line arguments when starting Envoy). | |||
* prometheus stats: fix the sort order of output lines to comply with the standard. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merge issue?
Signed-off-by: Joshua Marantz <[email protected]>
Signed-off-by: Joshua Marantz <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
Signed-off-by: Joshua Marantz <[email protected]>
Thanks for all the help pushing that through @mattklein123! This has to be close a the record for longest PR issue->merge time delta. |
Description: Switches the default from using fake symbol tables to using real symbol tables.
Risk Level: medium -- every effort was made to eliminate risk of contention by symbolizing stat names during initialization phases. However, some code paths may have evaded this. Using real symbol tables does appear to be significantly faster. For example, a microbenchmark for the HTTP response code stats shows approximately 40% speed improvement, in addition to a significant reduction in per-cluster memory usage.
Testing: //test/...
Docs Changes: should update stats.md to reflect the default change.
Release Notes: will need those still.
Fixes: #3585, #4196