-
Notifications
You must be signed in to change notification settings - Fork 59
feat: update the way to get heap profile #433
Conversation
|
||
get_heap_profile(resp.body); | ||
resp.status_code = http_status_code::ok; | ||
resp.body = profile; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has the problem that response too large and http server can not deal with been solved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have read source code, and find that GetHeapProfile will return buffers at most 1MB.
// We use buffers of this size in DoGetHeapProfile.
static const int kProfileBufferSize = 1 << 20;
During my test, I have observed that some middle profile files whose size will not exceed 1M, it proves the file size limit.
Besides, I have read google pprof source code roughly, find that pprof toolset will collect several heap profile files, and combine them into one file, then serialize, compress the file. I suppose that pprof toolset did something to get whole profile file.
dwarn_f("node is already exectuting pprof action, please wait and retry"); | ||
resp.status_code = http_status_code::internal_server_error; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dwarn_f("node is already exectuting pprof action, please wait and retry"); | |
resp.status_code = http_status_code::internal_server_error; | |
resp.body = "node is already exectuting pprof action, please wait and retry"; | |
resp.status_code = http_status_code::internal_server_error; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I consider that Filling error msg into response body is to tell client the reason why this request failed. However, I tried to this, and found out the pprof only print http status_code while meet error, like below:
> go tool pprof -symbolize=none --seconds=5 http://<ip>:<port>/pprof/heap
Fetching profile over HTTP from http:/<ip>:<port>/pprof/heap?seconds=5
Please wait... (5s)
http://<ip>:<port>/pprof/heap: server response: 500 Internal Server Error
failed to fetch any source profiles
I think we should print error log in server side, http status code is enough for client.
dwarn_f("node is already exectuting pprof action, please wait and retry"); | ||
resp.status_code = http_status_code::internal_server_error; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dwarn_f("node is already exectuting pprof action, please wait and retry"); | |
resp.status_code = http_status_code::internal_server_error; | |
resp.body = "node is already exectuting pprof action, please wait and retry"; | |
resp.status_code = http_status_code::internal_server_error; |
dwarn_f("node is already exectuting pprof action, please wait and retry"); | ||
resp.status_code = http_status_code::internal_server_error; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dwarn_f("node is already exectuting pprof action, please wait and retry"); | |
resp.status_code = http_status_code::internal_server_error; | |
resp.body = "node is already exectuting pprof action, please wait and retry"; | |
resp.status_code = http_status_code::internal_server_error; |
What is this pull request solved
As issue #426 and pprof_remote_server_doc shows, there are two ways to get heap profile.
We used to use
GetHeapSample
, this interface need server started with paramTCMALLOC_SAMPLE_PARAMETER > 0
, usually is configured as zero in production clusters. If we want to get heap profile of the instance, we need to restart server.Besides, I read source code of pprof, and find out that the heap got from
GetHeapSample
is not completed. If we setTCMALLOC_SAMPLE_PARAMETER > 0
, pprof will sample some objects while it is allocating, and create stack tracker structure for them. IfGetHeapSample
is called, pprof will create a heap profile with current sampled objects. As a result, the heap is not completed, besides, it will costs more to handle those stack tracker.This pull request update the way to get heap profile, the way doris used:
The difference between those two functions just like a gperftools issue gperftools/gperftools#783 says:
Besides, some pprof action should not be executing concurrently, this pull request add restriction about it. Just like doris does (pprof_actions.cpp), we should not get heap profile, cpu profile and heap growth at the same time.
There are different methods to restrict it, we can use lock or atomic flag.
I tried to add lock on the top of the handler, and send get heap profile request concurrently, and one of them will get profile succeed, another will be failed because of timeout. However, the pegasus server will continue to get profile after the first request release lock, server will get profile twice. Fetching heap profile has the possibility to affect performance, it is not the best solution.
As a result, I use atomic flag, and if server is executing pprof action, it will print log and return error immediately. The error message is like below:
Command Sample
pprof -symbolize=none --seconds=10 http://<host>:<port>/pprof/heap
Fetch heap profile of replica server will heavily affect read write performance, please set
seconds
as short as possible.Set symbolize option as none, and set PPROF_BINARY_PATH for pegasus binary dir, and analysis profile locally, just like:
pprof -symbolize=local <profile_path>
orpprof -http=localhost:3000 -symbolize=local <profile_path>
Performance Comparison
Heap profile for 300 seconds while staring ycsb
As the result shows, getting replica server heap profile will strongly affect read write throughput and latency. If the target machine belongs to a cluster who is latency-sensitive, you SHOULD NOT get heap profile. If the cluster is less latency-sensitive, you can get profile for several seconds, which should also be as short as possible, better not longer than 1 minutes.
Getting meta server heap profile has a little impact on throughput and latency. We still recommend that you should set time interval shorter than 1 minute.
Check List