feat: update the way to get heap profile #433

hycdong · 2020-04-03T08:16:53Z

What is this pull request solved

As issue #426 and pprof_remote_server_doc shows, there are two ways to get heap profile.

We used to use GetHeapSample, this interface need server started with param TCMALLOC_SAMPLE_PARAMETER > 0, usually is configured as zero in production clusters. If we want to get heap profile of the instance, we need to restart server.

Besides, I read source code of pprof, and find out that the heap got from GetHeapSample is not completed. If we set TCMALLOC_SAMPLE_PARAMETER > 0, pprof will sample some objects while it is allocating, and create stack tracker structure for them. If GetHeapSample is called, pprof will create a heap profile with current sampled objects. As a result, the heap is not completed, besides, it will costs more to handle those stack tracker.

This pull request update the way to get heap profile, the way doris used:

Call HeapProfileStart(filename) (from heap-profiler.h), continue to do work, and then, some number of seconds later, call GetHeapProfile() (followed by HeapProfilerStop()). The server can send the output of GetHeapProfile back as the HTTP response to pprof.

The difference between those two functions just like a gperftools issue gperftools/gperftools#783 says:

heap sampling will tell you roughly which objects are in use by program at the time when sample is taken. And heap profiler will tell you which pieces of code allocated (and possibly freed) how much memory.

Besides, some pprof action should not be executing concurrently, this pull request add restriction about it. Just like doris does (pprof_actions.cpp), we should not get heap profile, cpu profile and heap growth at the same time.
There are different methods to restrict it, we can use lock or atomic flag.
I tried to add lock on the top of the handler, and send get heap profile request concurrently, and one of them will get profile succeed, another will be failed because of timeout. However, the pegasus server will continue to get profile after the first request release lock, server will get profile twice. Fetching heap profile has the possibility to affect performance, it is not the best solution.
As a result, I use atomic flag, and if server is executing pprof action, it will print log and return error immediately. The error message is like below:

> go tool pprof -symbolize=none --seconds=5 http://<ip>:<port>/pprof/heap
Fetching profile over HTTP from http:/<ip>:<port>/pprof/heap?seconds=5
Please wait... (5s)
http://<ip>:<port>/pprof/heap: server response: 500 Internal Server Error
failed to fetch any source profiles

Command Sample

pprof -symbolize=none --seconds=10 http://<host>:<port>/pprof/heap
Fetch heap profile of replica server will heavily affect read write performance, please set seconds as short as possible.
Set symbolize option as none, and set PPROF_BINARY_PATH for pegasus binary dir, and analysis profile locally, just like: pprof -symbolize=local <profile_path> or pprof -http=localhost:3000 -symbolize=local <profile_path>

Performance Comparison

normal case	read write rate	running time(s)	TPS (ops/s)	read latency avg	read latency P99	write latency avg	write latency P99
(1) load: 3 clients*10 threads	0:1	194	10299	-	-	965	4152
(2) run: 3 clients*15 threads	1:3	214	14032	218	714	1344	6499
(3) run: 3 clients*30 threads	30:1	117	86461	331	878	730	1545
(4) run: 3 clients*30 threads	1:0	194	108425	273	866	-	-

Heap profile for 300 seconds while staring ycsb

profile replica	read write rate	running time(s)	TPS (ops/s)	read latency avg	read latency P99	write latency avg	write latency P99
(1) load: 3 clients*10 threads	0:1	400	4997	-	-	1994	17887
(2) run: 3 clients*15 threads	1:3	441	6798	252	731	2814	32228
(3) run: 3 clients*30 threads	30:1	344	29043	971	5945	2721	12727
(4) run: 3 clients*30 threads	1:0	318	66001	451	3438	-	-

profile meta	read write rate	running time(s)	TPS (ops/s)	read latency avg	read latency P99	write latency avg	write latency P99
(1) load: 3 clients*10 threads	0:1	193	10350	-	-	960	4021
(2) run: 3 clients*15 threads	1:3	221	13591	215	633	1390	7392
(3) run: 3 clients*30 threads	30:1	121	82837	345	802	729	1491
(4) run: 3 clients*30 threads	1:0	201	97591	304	877	-	-

As the result shows, getting replica server heap profile will strongly affect read write throughput and latency. If the target machine belongs to a cluster who is latency-sensitive, you SHOULD NOT get heap profile. If the cluster is less latency-sensitive, you can get profile for several seconds, which should also be as short as possible, better not longer than 1 minutes.
Getting meta server heap profile has a little impact on throughput and latency. We still recommend that you should set time interval shorter than 1 minute.