[DocDB] Add a gflag to control the default min stack size #23927

SrivastavaAnubhav · 2024-09-14T00:52:55Z

Description

Recent Linux kernels (on alma9 and ubuntu 22, but not ubuntu 20) are backing our large thread stacks (8 MiB virtual memory default on Linux) with hugepages, causing each thread to use up 2048 KB of RSS (compared to 40 KB on Alma8). OpenJDK ran into a similar issue as well.

Since we aren't actually using all 8 MiB for the thread stacks, dropping the default minimum stack size (to something smaller than the hugepage size) should prevent the kernel from backing thread stacks with hugepages.

NB: Setting the transparent hugepages setting to madvise instead of always would also work, but tcmalloc recommends we keep it at always.

Issue Type

kind/enhancement

Warning: Please confirm that this issue does not contain any sensitive information

I confirm this issue does not contain any sensitive information.

The text was updated successfully, but these errors were encountered:

Summary: Recent Linux kernels (alma9 and ubuntu 22, but not alma8 / ubuntu 20) are backing our large thread stacks (8 MiB virtual memory default on Linux) with hugepages, causing each thread to use up 2048 KB of RSS (compared to 40 KB on Alma8). OpenJDK ran into a similar issue as well: https://bugs.openjdk.org/browse/JDK-8303215. OpenJDK's main fix was dropping the default minimum stack size to something smaller than the hugepage size, which prevents the kernel from backing thread stacks with hugepages. Note that this does not limit the max stack size (i.e., the stack can still grow). NB: Setting the transparent hugepages setting to `madvise` instead of `always` would also work, but tcmalloc documentation (https://google.github.io/tcmalloc/tuning.html#system-level-optimizations) recommends we keep it at `always`. This diff adds a gflag, `min_thread_stack_size_bytes`, which is passed to `pthread_attr_setstacksize` when creating threads. The default is 512 KiB, which is the MacOS default. If a value of <= 0 is provided, we don't set a minimum thread stack size and thus fall back to the system default. Jira: DB-12828 Test Plan: Manual testing. Created an alma9 cluster and created 4 YSQL tables with 50 tablets: ``` CREATE TABLE t1 (id int) SPLIT INTO 150 TABLETS; CREATE TABLE t2 (id int) SPLIT INTO 150 TABLETS; CREATE TABLE t3 (id int) SPLIT INTO 150 TABLETS; CREATE TABLE t4 (id int) SPLIT INTO 150 TABLETS; ``` The system parameters for `transparent_hugepages/enabled` is set to `always`, and `transparent_hugepages_defrag` is set to `madvise`. The output of `pmap -x <tserver_pid>` on Alma9 before this diff shows many (but not all) thread stacks being backed by hugepages: ``` 00007f7cdba00000 8192 2048 2048 rw--- [ anon ] ``` Note that this takes some time (~30 minutes) for the kernel to back these stacks with hugepages. Over time, all thread stacks might be backed by hugepages. The extended pmap output (`pmap -X`) shows that all thread stacks are THPEligible (though this is true on alma8 as well). This is the summary output of pmap every 1 minute: ``` Kbytes RSS Dirty total kB 9691664 183356 115000 total kB 9691664 183360 115004 total kB 9691664 183364 115008 total kB 9691664 183384 115028 total kB 9691664 185380 117024 total kB 9691664 195360 127004 total kB 9691664 205340 136984 total kB 9691664 215320 146964 total kB 9691664 225300 156944 total kB 9691664 233284 164928 total kB 9691664 241268 172912 total kB 9691664 253252 184896 total kB 9691664 263232 194876 total kB 9691664 273212 204856 total kB 9691664 281196 212840 total kB 9691664 283192 214836 total kB 9691664 293172 224816 total kB 9691664 301156 232800 total kB 9691664 309140 240784 total kB 9691664 315128 246772 total kB 9691664 315128 246772 total kB 9691664 315128 246772 total kB 9691664 317124 248768 total kB 9691664 319120 250764 total kB 9691664 319120 250764 ... ``` With this fix, the threads are not backed by hugepages. I.e., all thread stacks look like this: ``` 00007f554055d000 512 48 48 rw--- [ anon ] ``` and there is no RSS memory growth over time: ``` Kbytes RSS Dirty total kB 7333120 178900 111392 total kB 7333116 178932 110612 total kB 7333116 178936 110616 total kB 7333116 178940 110620 total kB 7333116 178960 110640 total kB 7333116 178960 110640 total kB 7333116 178960 110640 ``` Reviewers: mlillibridge Reviewed By: mlillibridge Subscribers: ybase Differential Revision: https://phorge.dev.yugabyte.com/D38053

Summary: Original commit: 9ab7806 / D38053 Recent Linux kernels (alma9 and ubuntu 22, but not alma8 / ubuntu 20) are backing our large thread stacks (8 MiB virtual memory default on Linux) with hugepages, causing each thread to use up 2048 KB of RSS (compared to 40 KB on Alma8). OpenJDK ran into a similar issue as well: https://bugs.openjdk.org/browse/JDK-8303215. OpenJDK's main fix was dropping the default minimum stack size to something smaller than the hugepage size, which prevents the kernel from backing thread stacks with hugepages. Note that this does not limit the max stack size (i.e., the stack can still grow). NB: Setting the transparent hugepages setting to `madvise` instead of `always` would also work, but tcmalloc documentation (https://google.github.io/tcmalloc/tuning.html#system-level-optimizations) recommends we keep it at `always`. This diff adds a gflag, `min_thread_stack_size_bytes`, which is passed to `pthread_attr_setstacksize` when creating threads. The default is 512 KiB, which is the MacOS default. If a value of <= 0 is provided, we don't set a minimum thread stack size and thus fall back to the system default. Jira: DB-12828 Test Plan: Jenkins: urgent Manual testing. Created an alma9 cluster and created 4 YSQL tables with 50 tablets: ``` CREATE TABLE t1 (id int) SPLIT INTO 150 TABLETS; CREATE TABLE t2 (id int) SPLIT INTO 150 TABLETS; CREATE TABLE t3 (id int) SPLIT INTO 150 TABLETS; CREATE TABLE t4 (id int) SPLIT INTO 150 TABLETS; ``` The system parameters for `transparent_hugepages/enabled` is set to `always`, and `transparent_hugepages_defrag` is set to `madvise`. The output of `pmap -x <tserver_pid>` on Alma9 before this diff shows many (but not all) thread stacks being backed by hugepages: ``` 00007f7cdba00000 8192 2048 2048 rw--- [ anon ] ``` Note that this takes some time (~30 minutes) for the kernel to back these stacks with hugepages. Over time, all thread stacks might be backed by hugepages. The extended pmap output (`pmap -X`) shows that all thread stacks are THPEligible (though this is true on alma8 as well). This is the summary output of pmap every 1 minute: ``` Kbytes RSS Dirty total kB 9691664 183356 115000 total kB 9691664 183360 115004 total kB 9691664 183364 115008 total kB 9691664 183384 115028 total kB 9691664 185380 117024 total kB 9691664 195360 127004 total kB 9691664 205340 136984 total kB 9691664 215320 146964 total kB 9691664 225300 156944 total kB 9691664 233284 164928 total kB 9691664 241268 172912 total kB 9691664 253252 184896 total kB 9691664 263232 194876 total kB 9691664 273212 204856 total kB 9691664 281196 212840 total kB 9691664 283192 214836 total kB 9691664 293172 224816 total kB 9691664 301156 232800 total kB 9691664 309140 240784 total kB 9691664 315128 246772 total kB 9691664 315128 246772 total kB 9691664 315128 246772 total kB 9691664 317124 248768 total kB 9691664 319120 250764 total kB 9691664 319120 250764 ... ``` With this fix, the threads are not backed by hugepages. I.e., all thread stacks look like this: ``` 00007f554055d000 512 48 48 rw--- [ anon ] ``` and there is no RSS memory growth over time: ``` Kbytes RSS Dirty total kB 7333120 178900 111392 total kB 7333116 178932 110612 total kB 7333116 178936 110616 total kB 7333116 178940 110620 total kB 7333116 178960 110640 total kB 7333116 178960 110640 total kB 7333116 178960 110640 ``` Reviewers: mlillibridge, slingam Reviewed By: slingam Subscribers: ybase Differential Revision: https://phorge.dev.yugabyte.com/D38287

Summary: Original commit: 9ab7806 / D38053 Recent Linux kernels (alma9 and ubuntu 22, but not alma8 / ubuntu 20) are backing our large thread stacks (8 MiB virtual memory default on Linux) with hugepages, causing each thread to use up 2048 KB of RSS (compared to 40 KB on Alma8). OpenJDK ran into a similar issue as well: https://bugs.openjdk.org/browse/JDK-8303215. OpenJDK's main fix was dropping the default minimum stack size to something smaller than the hugepage size, which prevents the kernel from backing thread stacks with hugepages. Note that this does not limit the max stack size (i.e., the stack can still grow). NB: Setting the transparent hugepages setting to `madvise` instead of `always` would also work, but tcmalloc documentation (https://google.github.io/tcmalloc/tuning.html#system-level-optimizations) recommends we keep it at `always`. This diff adds a gflag, `min_thread_stack_size_bytes`, which is passed to `pthread_attr_setstacksize` when creating threads. The default is 512 KiB, which is the MacOS default. If a value of <= 0 is provided, we don't set a minimum thread stack size and thus fall back to the system default. Jira: DB-12828 Test Plan: Manual testing. Created an alma9 cluster and created 4 YSQL tables with 50 tablets: ``` CREATE TABLE t1 (id int) SPLIT INTO 150 TABLETS; CREATE TABLE t2 (id int) SPLIT INTO 150 TABLETS; CREATE TABLE t3 (id int) SPLIT INTO 150 TABLETS; CREATE TABLE t4 (id int) SPLIT INTO 150 TABLETS; ``` The system parameters for `transparent_hugepages/enabled` is set to `always`, and `transparent_hugepages_defrag` is set to `madvise`. The output of `pmap -x <tserver_pid>` on Alma9 before this diff shows many (but not all) thread stacks being backed by hugepages: ``` 00007f7cdba00000 8192 2048 2048 rw--- [ anon ] ``` Note that this takes some time (~30 minutes) for the kernel to back these stacks with hugepages. Over time, all thread stacks might be backed by hugepages. The extended pmap output (`pmap -X`) shows that all thread stacks are THPEligible (though this is true on alma8 as well). This is the summary output of pmap every 1 minute: ``` Kbytes RSS Dirty total kB 9691664 183356 115000 total kB 9691664 183360 115004 total kB 9691664 183364 115008 total kB 9691664 183384 115028 total kB 9691664 185380 117024 total kB 9691664 195360 127004 total kB 9691664 205340 136984 total kB 9691664 215320 146964 total kB 9691664 225300 156944 total kB 9691664 233284 164928 total kB 9691664 241268 172912 total kB 9691664 253252 184896 total kB 9691664 263232 194876 total kB 9691664 273212 204856 total kB 9691664 281196 212840 total kB 9691664 283192 214836 total kB 9691664 293172 224816 total kB 9691664 301156 232800 total kB 9691664 309140 240784 total kB 9691664 315128 246772 total kB 9691664 315128 246772 total kB 9691664 315128 246772 total kB 9691664 317124 248768 total kB 9691664 319120 250764 total kB 9691664 319120 250764 ... ``` With this fix, the threads are not backed by hugepages. I.e., all thread stacks look like this: ``` 00007f554055d000 512 48 48 rw--- [ anon ] ``` and there is no RSS memory growth over time: ``` Kbytes RSS Dirty total kB 7333120 178900 111392 total kB 7333116 178932 110612 total kB 7333116 178936 110616 total kB 7333116 178940 110620 total kB 7333116 178960 110640 total kB 7333116 178960 110640 total kB 7333116 178960 110640 ``` Reviewers: mlillibridge, slingam Reviewed By: slingam Subscribers: ybase Differential Revision: https://phorge.dev.yugabyte.com/D38285

Summary: 5d3e83e [PLAT-15199] Change TP API URLs according to latest refactoring a50a730 [doc][yba] YBDB compatibility (#23984) 0c84dbe [#24029] Update the callhome diagnostics not to send gflags details. b53ed3a [PLAT-15379][Fix PLAT-12510] Option to use UTC when dealing with cron exp. in backup schedule f0eab8f [PLAT-15278]: Fix DB Scoped XCluster replication restart 344bc76 Revert "[PLAT-15379][Fix PLAT-12510] Option to use UTC when dealing with cron exp. in backup schedule" 3628ba7 [PLAT-14459] Swagger fix bb93ebe [#24021] YSQL: Add --TEST_check_catalog_version_overflow 9ab7806 [#23927] docdb: Add gflag for minimum thread stack size Excluded: 8c8adc0 [#18822] YSQL: Gate update optimizations behind preview flag 5e86515 [#23768] YSQL: Fix table rewrite DDL before slot creation 123d496 [PLAT-14682] Universe task should only unlock itself and make unlock aware of the lock config de9d4ad [doc][yba] CIS hardened OS support (#23789) e131b20 [#23998] DocDB: Update usearch and other header-only third-party dependencies 1665662 Automatic commit by thirdparty_tool: update usearch to commit 240fe9c298100f9e37a2d7377b1595be6ba1f412. 3adbdae Automatic commit by thirdparty_tool: update fp16 to commit 98b0a46bce017382a6351a19577ec43a715b6835. 9a819f7 Automatic commit by thirdparty_tool: update hnswlib to commit 2142dc6f4dd08e64ab727a7bbd93be7f732e80b0. 2dc58f4 Automatic commit by thirdparty_tool: update simsimd to tag v5.1.0. 9a03432 [doc][ybm] Azure private link host (#24086) 039c9a2 [#17378] YSQL: Testing for histogram_bounds in pg_stats 09f7a0f [#24085] DocDB: Refactor HNSW wrappers 555af7d [#24000] DocDB: Shutting down shared exchange could cause TServer to hang 5743a03 [PLAT-15317]Alert emails are not in the correct format. 8642555 [PLAT-15379][Fix PLAT-12510] Option to use UTC when dealing with cron exp. in backup schedule 253ab07 [PLAT-15400][PLAT-15401][PLAT-13051] - Connection pooling ui issues and other ui issues 57576ae [#16487] YSQL: Fix flakey TestPostgresPid test bc8ae45 Update ports for CIS hardened (#24098) 6fa33e6 [#18152, #18729] Docdb: Fix test TestPgIndexSelectiveUpdate cc6d2d1 [docs] added and updated cves (#24046) Excluded: ed153dc [#24055] YSQL: fix pg_hint_plan regression with executing prepared statement Test Plan: Jenkins: rebase: pg15-cherrypicks Reviewers: jason, jenkins-bot Differential Revision: https://phorge.dev.yugabyte.com/D38322

Summary: Original commit: 9ab7806 / D38053 Recent Linux kernels (alma9 and ubuntu 22, but not alma8 / ubuntu 20) are backing our large thread stacks (8 MiB virtual memory default on Linux) with hugepages, causing each thread to use up 2048 KB of RSS (compared to 40 KB on Alma8). OpenJDK ran into a similar issue as well: https://bugs.openjdk.org/browse/JDK-8303215. OpenJDK's main fix was dropping the default minimum stack size to something smaller than the hugepage size, which prevents the kernel from backing thread stacks with hugepages. Note that this does not limit the max stack size (i.e., the stack can still grow). NB: Setting the transparent hugepages setting to `madvise` instead of `always` would also work, but tcmalloc documentation (https://google.github.io/tcmalloc/tuning.html#system-level-optimizations) recommends we keep it at `always`. This diff adds a gflag, `min_thread_stack_size_bytes`, which is passed to `pthread_attr_setstacksize` when creating threads. The default is 512 KiB, which is the MacOS default. If a value of <= 0 is provided, we don't set a minimum thread stack size and thus fall back to the system default. Jira: DB-12828 Test Plan: Manual testing. Created an alma9 cluster and created 4 YSQL tables with 50 tablets: ``` CREATE TABLE t1 (id int) SPLIT INTO 150 TABLETS; CREATE TABLE t2 (id int) SPLIT INTO 150 TABLETS; CREATE TABLE t3 (id int) SPLIT INTO 150 TABLETS; CREATE TABLE t4 (id int) SPLIT INTO 150 TABLETS; ``` The system parameters for `transparent_hugepages/enabled` is set to `always`, and `transparent_hugepages_defrag` is set to `madvise`. The output of `pmap -x <tserver_pid>` on Alma9 before this diff shows many (but not all) thread stacks being backed by hugepages: ``` 00007f7cdba00000 8192 2048 2048 rw--- [ anon ] ``` Note that this takes some time (~30 minutes) for the kernel to back these stacks with hugepages. Over time, all thread stacks might be backed by hugepages. The extended pmap output (`pmap -X`) shows that all thread stacks are THPEligible (though this is true on alma8 as well). This is the summary output of pmap every 1 minute: ``` Kbytes RSS Dirty total kB 9691664 183356 115000 total kB 9691664 183360 115004 total kB 9691664 183364 115008 total kB 9691664 183384 115028 total kB 9691664 185380 117024 total kB 9691664 195360 127004 total kB 9691664 205340 136984 total kB 9691664 215320 146964 total kB 9691664 225300 156944 total kB 9691664 233284 164928 total kB 9691664 241268 172912 total kB 9691664 253252 184896 total kB 9691664 263232 194876 total kB 9691664 273212 204856 total kB 9691664 281196 212840 total kB 9691664 283192 214836 total kB 9691664 293172 224816 total kB 9691664 301156 232800 total kB 9691664 309140 240784 total kB 9691664 315128 246772 total kB 9691664 315128 246772 total kB 9691664 315128 246772 total kB 9691664 317124 248768 total kB 9691664 319120 250764 total kB 9691664 319120 250764 ... ``` With this fix, the threads are not backed by hugepages. I.e., all thread stacks look like this: ``` 00007f554055d000 512 48 48 rw--- [ anon ] ``` and there is no RSS memory growth over time: ``` Kbytes RSS Dirty total kB 7333120 178900 111392 total kB 7333116 178932 110612 total kB 7333116 178936 110616 total kB 7333116 178940 110620 total kB 7333116 178960 110640 total kB 7333116 178960 110640 total kB 7333116 178960 110640 ``` Reviewers: mlillibridge, slingam Reviewed By: slingam Subscribers: ybase Differential Revision: https://phorge.dev.yugabyte.com/D38286

Summary: Original commit: 9ab7806 / D38053 Recent Linux kernels (alma9 and ubuntu 22, but not alma8 / ubuntu 20) are backing our large thread stacks (8 MiB virtual memory default on Linux) with hugepages, causing each thread to use up 2048 KB of RSS (compared to 40 KB on Alma8). OpenJDK ran into a similar issue as well: https://bugs.openjdk.org/browse/JDK-8303215. OpenJDK's main fix was dropping the default minimum stack size to something smaller than the hugepage size, which prevents the kernel from backing thread stacks with hugepages. Note that this does not limit the max stack size (i.e., the stack can still grow). NB: Setting the transparent hugepages setting to `madvise` instead of `always` would also work, but tcmalloc documentation (https://google.github.io/tcmalloc/tuning.html#system-level-optimizations) recommends we keep it at `always`. This diff adds a gflag, `min_thread_stack_size_bytes`, which is passed to `pthread_attr_setstacksize` when creating threads. The default is 512 KiB, which is the MacOS default. If a value of <= 0 is provided, we don't set a minimum thread stack size and thus fall back to the system default. Jira: DB-12828 Test Plan: Manual testing. Created an alma9 cluster and created 4 YSQL tables with 50 tablets: ``` CREATE TABLE t1 (id int) SPLIT INTO 150 TABLETS; CREATE TABLE t2 (id int) SPLIT INTO 150 TABLETS; CREATE TABLE t3 (id int) SPLIT INTO 150 TABLETS; CREATE TABLE t4 (id int) SPLIT INTO 150 TABLETS; ``` The system parameters for `transparent_hugepages/enabled` is set to `always`, and `transparent_hugepages_defrag` is set to `madvise`. The output of `pmap -x <tserver_pid>` on Alma9 before this diff shows many (but not all) thread stacks being backed by hugepages: ``` 00007f7cdba00000 8192 2048 2048 rw--- [ anon ] ``` Note that this takes some time (~30 minutes) for the kernel to back these stacks with hugepages. Over time, all thread stacks might be backed by hugepages. The extended pmap output (`pmap -X`) shows that all thread stacks are THPEligible (though this is true on alma8 as well). This is the summary output of pmap every 1 minute: ``` Kbytes RSS Dirty total kB 9691664 183356 115000 total kB 9691664 183360 115004 total kB 9691664 183364 115008 total kB 9691664 183384 115028 total kB 9691664 185380 117024 total kB 9691664 195360 127004 total kB 9691664 205340 136984 total kB 9691664 215320 146964 total kB 9691664 225300 156944 total kB 9691664 233284 164928 total kB 9691664 241268 172912 total kB 9691664 253252 184896 total kB 9691664 263232 194876 total kB 9691664 273212 204856 total kB 9691664 281196 212840 total kB 9691664 283192 214836 total kB 9691664 293172 224816 total kB 9691664 301156 232800 total kB 9691664 309140 240784 total kB 9691664 315128 246772 total kB 9691664 315128 246772 total kB 9691664 315128 246772 total kB 9691664 317124 248768 total kB 9691664 319120 250764 total kB 9691664 319120 250764 ... ``` With this fix, the threads are not backed by hugepages. I.e., all thread stacks look like this: ``` 00007f554055d000 512 48 48 rw--- [ anon ] ``` and there is no RSS memory growth over time: ``` Kbytes RSS Dirty total kB 7333120 178900 111392 total kB 7333116 178932 110612 total kB 7333116 178936 110616 total kB 7333116 178940 110620 total kB 7333116 178960 110640 total kB 7333116 178960 110640 total kB 7333116 178960 110640 ``` Reviewers: mlillibridge, slingam Reviewed By: slingam Subscribers: ybase Differential Revision: https://phorge.dev.yugabyte.com/D38860

SrivastavaAnubhav added area/docdb YugabyteDB core features priority/high High Priority labels Sep 14, 2024

SrivastavaAnubhav self-assigned this Sep 14, 2024

yugabyte-ci added the kind/enhancement This is an enhancement of an existing feature label Sep 14, 2024

SrivastavaAnubhav closed this as completed Sep 20, 2024

yugabyte-ci reopened this Oct 9, 2024

yugabyte-ci added the 2.20.7.1_blocker label Oct 9, 2024

SrivastavaAnubhav closed this as completed Oct 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DocDB] Add a gflag to control the default min stack size #23927

[DocDB] Add a gflag to control the default min stack size #23927

SrivastavaAnubhav commented Sep 14, 2024 •

edited by jira bot

Loading

[DocDB] Add a gflag to control the default min stack size #23927

[DocDB] Add a gflag to control the default min stack size #23927

Comments

SrivastavaAnubhav commented Sep 14, 2024 • edited by jira bot Loading

Description

Issue Type

Warning: Please confirm that this issue does not contain any sensitive information

SrivastavaAnubhav commented Sep 14, 2024 •

edited by jira bot

Loading