Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[BACKPORT 2.20][#23927] docdb: Add gflag for minimum thread stack size
Summary: Original commit: 9ab7806 / D38053 Recent Linux kernels (alma9 and ubuntu 22, but not alma8 / ubuntu 20) are backing our large thread stacks (8 MiB virtual memory default on Linux) with hugepages, causing each thread to use up 2048 KB of RSS (compared to 40 KB on Alma8). OpenJDK ran into a similar issue as well: https://bugs.openjdk.org/browse/JDK-8303215. OpenJDK's main fix was dropping the default minimum stack size to something smaller than the hugepage size, which prevents the kernel from backing thread stacks with hugepages. Note that this does not limit the max stack size (i.e., the stack can still grow). NB: Setting the transparent hugepages setting to `madvise` instead of `always` would also work, but tcmalloc documentation (https://google.github.io/tcmalloc/tuning.html#system-level-optimizations) recommends we keep it at `always`. This diff adds a gflag, `min_thread_stack_size_bytes`, which is passed to `pthread_attr_setstacksize` when creating threads. The default is 512 KiB, which is the MacOS default. If a value of <= 0 is provided, we don't set a minimum thread stack size and thus fall back to the system default. Jira: DB-12828 Test Plan: Jenkins: urgent Manual testing. Created an alma9 cluster and created 4 YSQL tables with 50 tablets: ``` CREATE TABLE t1 (id int) SPLIT INTO 150 TABLETS; CREATE TABLE t2 (id int) SPLIT INTO 150 TABLETS; CREATE TABLE t3 (id int) SPLIT INTO 150 TABLETS; CREATE TABLE t4 (id int) SPLIT INTO 150 TABLETS; ``` The system parameters for `transparent_hugepages/enabled` is set to `always`, and `transparent_hugepages_defrag` is set to `madvise`. The output of `pmap -x <tserver_pid>` on Alma9 before this diff shows many (but not all) thread stacks being backed by hugepages: ``` 00007f7cdba00000 8192 2048 2048 rw--- [ anon ] ``` Note that this takes some time (~30 minutes) for the kernel to back these stacks with hugepages. Over time, all thread stacks might be backed by hugepages. The extended pmap output (`pmap -X`) shows that all thread stacks are THPEligible (though this is true on alma8 as well). This is the summary output of pmap every 1 minute: ``` Kbytes RSS Dirty total kB 9691664 183356 115000 total kB 9691664 183360 115004 total kB 9691664 183364 115008 total kB 9691664 183384 115028 total kB 9691664 185380 117024 total kB 9691664 195360 127004 total kB 9691664 205340 136984 total kB 9691664 215320 146964 total kB 9691664 225300 156944 total kB 9691664 233284 164928 total kB 9691664 241268 172912 total kB 9691664 253252 184896 total kB 9691664 263232 194876 total kB 9691664 273212 204856 total kB 9691664 281196 212840 total kB 9691664 283192 214836 total kB 9691664 293172 224816 total kB 9691664 301156 232800 total kB 9691664 309140 240784 total kB 9691664 315128 246772 total kB 9691664 315128 246772 total kB 9691664 315128 246772 total kB 9691664 317124 248768 total kB 9691664 319120 250764 total kB 9691664 319120 250764 ... ``` With this fix, the threads are not backed by hugepages. I.e., all thread stacks look like this: ``` 00007f554055d000 512 48 48 rw--- [ anon ] ``` and there is no RSS memory growth over time: ``` Kbytes RSS Dirty total kB 7333120 178900 111392 total kB 7333116 178932 110612 total kB 7333116 178936 110616 total kB 7333116 178940 110620 total kB 7333116 178960 110640 total kB 7333116 178960 110640 total kB 7333116 178960 110640 ``` Reviewers: mlillibridge, slingam Reviewed By: slingam Subscribers: ybase Differential Revision: https://phorge.dev.yugabyte.com/D38287
- Loading branch information