Skip to content
This repository has been archived by the owner on Jun 23, 2022. It is now read-only.

feat: update the way to get heap profile #433

Merged
merged 16 commits into from
Apr 28, 2020
Merged

Conversation

hycdong
Copy link
Contributor

@hycdong hycdong commented Apr 3, 2020

What is this pull request solved

As issue #426 and pprof_remote_server_doc shows, there are two ways to get heap profile.

We used to use GetHeapSample, this interface need server started with param TCMALLOC_SAMPLE_PARAMETER > 0, usually is configured as zero in production clusters. If we want to get heap profile of the instance, we need to restart server.

Besides, I read source code of pprof, and find out that the heap got from GetHeapSample is not completed. If we set TCMALLOC_SAMPLE_PARAMETER > 0, pprof will sample some objects while it is allocating, and create stack tracker structure for them. If GetHeapSample is called, pprof will create a heap profile with current sampled objects. As a result, the heap is not completed, besides, it will costs more to handle those stack tracker.

This pull request update the way to get heap profile, the way doris used:

Call HeapProfileStart(filename) (from heap-profiler.h), continue to do work, and then, some number of seconds later, call GetHeapProfile() (followed by HeapProfilerStop()). The server can send the output of GetHeapProfile back as the HTTP response to pprof.

The difference between those two functions just like a gperftools issue gperftools/gperftools#783 says:

heap sampling will tell you roughly which objects are in use by program at the time when sample is taken. And heap profiler will tell you which pieces of code allocated (and possibly freed) how much memory.

Besides, some pprof action should not be executing concurrently, this pull request add restriction about it. Just like doris does (pprof_actions.cpp), we should not get heap profile, cpu profile and heap growth at the same time.
There are different methods to restrict it, we can use lock or atomic flag.
I tried to add lock on the top of the handler, and send get heap profile request concurrently, and one of them will get profile succeed, another will be failed because of timeout. However, the pegasus server will continue to get profile after the first request release lock, server will get profile twice. Fetching heap profile has the possibility to affect performance, it is not the best solution.
As a result, I use atomic flag, and if server is executing pprof action, it will print log and return error immediately. The error message is like below:

> go tool pprof -symbolize=none --seconds=5 http://<ip>:<port>/pprof/heap
Fetching profile over HTTP from http:/<ip>:<port>/pprof/heap?seconds=5
Please wait... (5s)
http://<ip>:<port>/pprof/heap: server response: 500 Internal Server Error
failed to fetch any source profiles

Command Sample

pprof -symbolize=none --seconds=10 http://<host>:<port>/pprof/heap
Fetch heap profile of replica server will heavily affect read write performance, please set seconds as short as possible.
Set symbolize option as none, and set PPROF_BINARY_PATH for pegasus binary dir, and analysis profile locally, just like: pprof -symbolize=local <profile_path> or pprof -http=localhost:3000 -symbolize=local <profile_path>

Performance Comparison

normal case read write rate running time(s) TPS (ops/s) read latency avg read latency P99 write latency avg  write latency P99
(1) load: 3 clients*10 threads 0:1 194 10299 - - 965 4152
(2) run: 3 clients*15 threads 1:3 214 14032 218 714 1344 6499
(3) run: 3 clients*30 threads 30:1 117 86461 331 878 730 1545
(4) run: 3 clients*30 threads 1:0 194 108425 273 866 - -

Heap profile for 300 seconds while staring ycsb

profile replica read write rate running time(s) TPS (ops/s) read latency avg read latency P99 write latency avg  write latency P99
(1) load: 3 clients*10 threads 0:1 400 4997 - - 1994 17887
(2) run: 3 clients*15 threads 1:3 441 6798 252 731 2814 32228
(3) run: 3 clients*30 threads 30:1 344 29043 971 5945 2721 12727
(4) run: 3 clients*30 threads 1:0 318 66001 451 3438 - -
profile meta read write rate running time(s) TPS (ops/s) read latency avg read latency P99 write latency avg  write latency P99
(1) load: 3 clients*10 threads 0:1 193 10350 - - 960 4021
(2) run: 3 clients*15 threads 1:3 221 13591 215 633 1390 7392
(3) run: 3 clients*30 threads 30:1 121 82837 345 802 729 1491
(4) run: 3 clients*30 threads 1:0 201 97591 304 877 - -

As the result shows, getting replica server heap profile will strongly affect read write throughput and latency. If the target machine belongs to a cluster who is latency-sensitive, you SHOULD NOT get heap profile. If the cluster is less latency-sensitive, you can get profile for several seconds, which should also be as short as possible, better not longer than 1 minutes.
Getting meta server heap profile has a little impact on throughput and latency. We still recommend that you should set time interval shorter than 1 minute.

Check List

  • Need to update document

@hycdong hycdong marked this pull request as ready for review April 17, 2020 07:57
src/dist/http/pprof_http_service.cpp Outdated Show resolved Hide resolved
src/dist/http/pprof_http_service.cpp Outdated Show resolved Hide resolved
src/dist/http/pprof_http_service.cpp Show resolved Hide resolved

get_heap_profile(resp.body);
resp.status_code = http_status_code::ok;
resp.body = profile;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has the problem that response too large and http server can not deal with been solved?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have read source code, and find that GetHeapProfile will return buffers at most 1MB.

// We use buffers of this size in DoGetHeapProfile.
static const int kProfileBufferSize = 1 << 20;

During my test, I have observed that some middle profile files whose size will not exceed 1M, it proves the file size limit.
Besides, I have read google pprof source code roughly, find that pprof toolset will collect several heap profile files, and combine them into one file, then serialize, compress the file. I suppose that pprof toolset did something to get whole profile file.

Comment on lines +468 to +469
dwarn_f("node is already exectuting pprof action, please wait and retry");
resp.status_code = http_status_code::internal_server_error;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dwarn_f("node is already exectuting pprof action, please wait and retry");
resp.status_code = http_status_code::internal_server_error;
resp.body = "node is already exectuting pprof action, please wait and retry";
resp.status_code = http_status_code::internal_server_error;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I consider that Filling error msg into response body is to tell client the reason why this request failed. However, I tried to this, and found out the pprof only print http status_code while meet error, like below:

> go tool pprof -symbolize=none --seconds=5 http://<ip>:<port>/pprof/heap
Fetching profile over HTTP from http:/<ip>:<port>/pprof/heap?seconds=5
Please wait... (5s)
http://<ip>:<port>/pprof/heap: server response: 500 Internal Server Error
failed to fetch any source profiles

I think we should print error log in server side, http status code is enough for client.

Comment on lines +425 to +426
dwarn_f("node is already exectuting pprof action, please wait and retry");
resp.status_code = http_status_code::internal_server_error;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dwarn_f("node is already exectuting pprof action, please wait and retry");
resp.status_code = http_status_code::internal_server_error;
resp.body = "node is already exectuting pprof action, please wait and retry";
resp.status_code = http_status_code::internal_server_error;

Comment on lines +328 to +329
dwarn_f("node is already exectuting pprof action, please wait and retry");
resp.status_code = http_status_code::internal_server_error;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dwarn_f("node is already exectuting pprof action, please wait and retry");
resp.status_code = http_status_code::internal_server_error;
resp.body = "node is already exectuting pprof action, please wait and retry";
resp.status_code = http_status_code::internal_server_error;

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants