Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DocDB] Add a gflag to control the default min stack size #23927

Closed
1 task done
SrivastavaAnubhav opened this issue Sep 14, 2024 · 0 comments
Closed
1 task done

[DocDB] Add a gflag to control the default min stack size #23927

SrivastavaAnubhav opened this issue Sep 14, 2024 · 0 comments
Assignees
Labels
2.20.7.1_blocker area/docdb YugabyteDB core features kind/enhancement This is an enhancement of an existing feature priority/high High Priority

Comments

@SrivastavaAnubhav
Copy link
Contributor

SrivastavaAnubhav commented Sep 14, 2024

Jira Link: DB-12828

Description

Recent Linux kernels (on alma9 and ubuntu 22, but not ubuntu 20) are backing our large thread stacks (8 MiB virtual memory default on Linux) with hugepages, causing each thread to use up 2048 KB of RSS (compared to 40 KB on Alma8). OpenJDK ran into a similar issue as well.

Since we aren't actually using all 8 MiB for the thread stacks, dropping the default minimum stack size (to something smaller than the hugepage size) should prevent the kernel from backing thread stacks with hugepages.

NB: Setting the transparent hugepages setting to madvise instead of always would also work, but tcmalloc recommends we keep it at always.

Issue Type

kind/enhancement

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@SrivastavaAnubhav SrivastavaAnubhav added area/docdb YugabyteDB core features priority/high High Priority labels Sep 14, 2024
@SrivastavaAnubhav SrivastavaAnubhav self-assigned this Sep 14, 2024
@yugabyte-ci yugabyte-ci added the kind/enhancement This is an enhancement of an existing feature label Sep 14, 2024
SrivastavaAnubhav added a commit that referenced this issue Sep 20, 2024
Summary:
Recent Linux kernels (alma9 and ubuntu 22, but not alma8 / ubuntu 20) are backing our large
thread stacks (8 MiB virtual memory default on Linux) with hugepages, causing each thread to use up
2048 KB of RSS (compared to 40 KB on Alma8). OpenJDK ran into a similar issue as well:
https://bugs.openjdk.org/browse/JDK-8303215.

OpenJDK's main fix was dropping the default minimum stack size to something smaller than the
hugepage size, which prevents the kernel from backing thread stacks with hugepages. Note that this
does not limit the max stack size (i.e., the stack can still grow).

NB: Setting the transparent hugepages setting to `madvise` instead of `always` would also work, but
tcmalloc documentation (https://google.github.io/tcmalloc/tuning.html#system-level-optimizations)
recommends we keep it at `always`.

This diff adds a gflag, `min_thread_stack_size_bytes`, which is passed to
`pthread_attr_setstacksize` when creating threads. The default is 512 KiB, which is the MacOS
default. If a value of <= 0 is provided, we don't set a minimum thread stack size and thus fall back
to the system default.
Jira: DB-12828

Test Plan:
Manual testing. Created an alma9 cluster and created 4 YSQL tables with 50 tablets:
```
CREATE TABLE t1 (id int) SPLIT INTO 150 TABLETS;
CREATE TABLE t2 (id int) SPLIT INTO 150 TABLETS;
CREATE TABLE t3 (id int) SPLIT INTO 150 TABLETS;
CREATE TABLE t4 (id int) SPLIT INTO 150 TABLETS;
```

The system parameters for `transparent_hugepages/enabled` is set to `always`, and `transparent_hugepages_defrag` is set to `madvise`.

The output of `pmap -x <tserver_pid>` on Alma9 before this diff shows many (but not all) thread stacks being backed by hugepages:
```
00007f7cdba00000    8192    2048    2048 rw---   [ anon ]
```

Note that this takes some time (~30 minutes) for the kernel to back these stacks with hugepages. Over time, all thread stacks might be backed by hugepages. The extended pmap output (`pmap -X`) shows that all thread stacks are THPEligible (though this is true on alma8 as well). This is the summary output of pmap every 1 minute:
```
                 Kbytes   RSS     Dirty
total kB         9691664  183356  115000
total kB         9691664  183360  115004
total kB         9691664  183364  115008
total kB         9691664  183384  115028
total kB         9691664  185380  117024
total kB         9691664  195360  127004
total kB         9691664  205340  136984
total kB         9691664  215320  146964
total kB         9691664  225300  156944
total kB         9691664  233284  164928
total kB         9691664  241268  172912
total kB         9691664  253252  184896
total kB         9691664  263232  194876
total kB         9691664  273212  204856
total kB         9691664  281196  212840
total kB         9691664  283192  214836
total kB         9691664  293172  224816
total kB         9691664  301156  232800
total kB         9691664  309140  240784
total kB         9691664  315128  246772
total kB         9691664  315128  246772
total kB         9691664  315128  246772
total kB         9691664  317124  248768
total kB         9691664  319120  250764
total kB         9691664  319120  250764
...
```

With this fix, the threads are not backed by hugepages. I.e., all thread stacks look like this:
```
00007f554055d000     512      48      48 rw---   [ anon ]
```
and there is no RSS memory growth over time:
```
                 Kbytes   RSS     Dirty
total kB         7333120  178900  111392
total kB         7333116  178932  110612
total kB         7333116  178936  110616
total kB         7333116  178940  110620
total kB         7333116  178960  110640
total kB         7333116  178960  110640
total kB         7333116  178960  110640
```

Reviewers: mlillibridge

Reviewed By: mlillibridge

Subscribers: ybase

Differential Revision: https://phorge.dev.yugabyte.com/D38053
SrivastavaAnubhav added a commit that referenced this issue Sep 23, 2024
Summary:
Original commit: 9ab7806 / D38053
Recent Linux kernels (alma9 and ubuntu 22, but not alma8 / ubuntu 20) are backing our large
thread stacks (8 MiB virtual memory default on Linux) with hugepages, causing each thread to use up
2048 KB of RSS (compared to 40 KB on Alma8). OpenJDK ran into a similar issue as well:
https://bugs.openjdk.org/browse/JDK-8303215.

OpenJDK's main fix was dropping the default minimum stack size to something smaller than the
hugepage size, which prevents the kernel from backing thread stacks with hugepages. Note that this
does not limit the max stack size (i.e., the stack can still grow).

NB: Setting the transparent hugepages setting to `madvise` instead of `always` would also work, but
tcmalloc documentation (https://google.github.io/tcmalloc/tuning.html#system-level-optimizations)
recommends we keep it at `always`.

This diff adds a gflag, `min_thread_stack_size_bytes`, which is passed to
`pthread_attr_setstacksize` when creating threads. The default is 512 KiB, which is the MacOS
default. If a value of <= 0 is provided, we don't set a minimum thread stack size and thus fall back
to the system default.
Jira: DB-12828

Test Plan:
Jenkins: urgent

Manual testing. Created an alma9 cluster and created 4 YSQL tables with 50 tablets:
```
CREATE TABLE t1 (id int) SPLIT INTO 150 TABLETS;
CREATE TABLE t2 (id int) SPLIT INTO 150 TABLETS;
CREATE TABLE t3 (id int) SPLIT INTO 150 TABLETS;
CREATE TABLE t4 (id int) SPLIT INTO 150 TABLETS;
```

The system parameters for `transparent_hugepages/enabled` is set to `always`, and `transparent_hugepages_defrag` is set to `madvise`.

The output of `pmap -x <tserver_pid>` on Alma9 before this diff shows many (but not all) thread stacks being backed by hugepages:
```
00007f7cdba00000    8192    2048    2048 rw---   [ anon ]
```

Note that this takes some time (~30 minutes) for the kernel to back these stacks with hugepages. Over time, all thread stacks might be backed by hugepages. The extended pmap output (`pmap -X`) shows that all thread stacks are THPEligible (though this is true on alma8 as well). This is the summary output of pmap every 1 minute:
```
                 Kbytes   RSS     Dirty
total kB         9691664  183356  115000
total kB         9691664  183360  115004
total kB         9691664  183364  115008
total kB         9691664  183384  115028
total kB         9691664  185380  117024
total kB         9691664  195360  127004
total kB         9691664  205340  136984
total kB         9691664  215320  146964
total kB         9691664  225300  156944
total kB         9691664  233284  164928
total kB         9691664  241268  172912
total kB         9691664  253252  184896
total kB         9691664  263232  194876
total kB         9691664  273212  204856
total kB         9691664  281196  212840
total kB         9691664  283192  214836
total kB         9691664  293172  224816
total kB         9691664  301156  232800
total kB         9691664  309140  240784
total kB         9691664  315128  246772
total kB         9691664  315128  246772
total kB         9691664  315128  246772
total kB         9691664  317124  248768
total kB         9691664  319120  250764
total kB         9691664  319120  250764
...
```

With this fix, the threads are not backed by hugepages. I.e., all thread stacks look like this:
```
00007f554055d000     512      48      48 rw---   [ anon ]
```
and there is no RSS memory growth over time:
```
                 Kbytes   RSS     Dirty
total kB         7333120  178900  111392
total kB         7333116  178932  110612
total kB         7333116  178936  110616
total kB         7333116  178940  110620
total kB         7333116  178960  110640
total kB         7333116  178960  110640
total kB         7333116  178960  110640
```

Reviewers: mlillibridge, slingam

Reviewed By: slingam

Subscribers: ybase

Differential Revision: https://phorge.dev.yugabyte.com/D38287
SrivastavaAnubhav added a commit that referenced this issue Sep 23, 2024
Summary:
Original commit: 9ab7806 / D38053
Recent Linux kernels (alma9 and ubuntu 22, but not alma8 / ubuntu 20) are backing our large
thread stacks (8 MiB virtual memory default on Linux) with hugepages, causing each thread to use up
2048 KB of RSS (compared to 40 KB on Alma8). OpenJDK ran into a similar issue as well:
https://bugs.openjdk.org/browse/JDK-8303215.

OpenJDK's main fix was dropping the default minimum stack size to something smaller than the
hugepage size, which prevents the kernel from backing thread stacks with hugepages. Note that this
does not limit the max stack size (i.e., the stack can still grow).

NB: Setting the transparent hugepages setting to `madvise` instead of `always` would also work, but
tcmalloc documentation (https://google.github.io/tcmalloc/tuning.html#system-level-optimizations)
recommends we keep it at `always`.

This diff adds a gflag, `min_thread_stack_size_bytes`, which is passed to
`pthread_attr_setstacksize` when creating threads. The default is 512 KiB, which is the MacOS
default. If a value of <= 0 is provided, we don't set a minimum thread stack size and thus fall back
to the system default.
Jira: DB-12828

Test Plan:
Manual testing. Created an alma9 cluster and created 4 YSQL tables with 50 tablets:
```
CREATE TABLE t1 (id int) SPLIT INTO 150 TABLETS;
CREATE TABLE t2 (id int) SPLIT INTO 150 TABLETS;
CREATE TABLE t3 (id int) SPLIT INTO 150 TABLETS;
CREATE TABLE t4 (id int) SPLIT INTO 150 TABLETS;
```

The system parameters for `transparent_hugepages/enabled` is set to `always`, and `transparent_hugepages_defrag` is set to `madvise`.

The output of `pmap -x <tserver_pid>` on Alma9 before this diff shows many (but not all) thread stacks being backed by hugepages:
```
00007f7cdba00000    8192    2048    2048 rw---   [ anon ]
```

Note that this takes some time (~30 minutes) for the kernel to back these stacks with hugepages. Over time, all thread stacks might be backed by hugepages. The extended pmap output (`pmap -X`) shows that all thread stacks are THPEligible (though this is true on alma8 as well). This is the summary output of pmap every 1 minute:
```
                 Kbytes   RSS     Dirty
total kB         9691664  183356  115000
total kB         9691664  183360  115004
total kB         9691664  183364  115008
total kB         9691664  183384  115028
total kB         9691664  185380  117024
total kB         9691664  195360  127004
total kB         9691664  205340  136984
total kB         9691664  215320  146964
total kB         9691664  225300  156944
total kB         9691664  233284  164928
total kB         9691664  241268  172912
total kB         9691664  253252  184896
total kB         9691664  263232  194876
total kB         9691664  273212  204856
total kB         9691664  281196  212840
total kB         9691664  283192  214836
total kB         9691664  293172  224816
total kB         9691664  301156  232800
total kB         9691664  309140  240784
total kB         9691664  315128  246772
total kB         9691664  315128  246772
total kB         9691664  315128  246772
total kB         9691664  317124  248768
total kB         9691664  319120  250764
total kB         9691664  319120  250764
...
```

With this fix, the threads are not backed by hugepages. I.e., all thread stacks look like this:
```
00007f554055d000     512      48      48 rw---   [ anon ]
```
and there is no RSS memory growth over time:
```
                 Kbytes   RSS     Dirty
total kB         7333120  178900  111392
total kB         7333116  178932  110612
total kB         7333116  178936  110616
total kB         7333116  178940  110620
total kB         7333116  178960  110640
total kB         7333116  178960  110640
total kB         7333116  178960  110640
```

Reviewers: mlillibridge, slingam

Reviewed By: slingam

Subscribers: ybase

Differential Revision: https://phorge.dev.yugabyte.com/D38285
foucher pushed a commit that referenced this issue Sep 24, 2024
Summary:
 5d3e83e [PLAT-15199] Change TP API URLs according to latest refactoring
 a50a730 [doc][yba] YBDB compatibility (#23984)
 0c84dbe [#24029] Update the callhome diagnostics  not to send gflags details.
 b53ed3a [PLAT-15379][Fix PLAT-12510] Option to use UTC when dealing with cron exp. in backup schedule
 f0eab8f [PLAT-15278]: Fix DB Scoped XCluster replication restart
 344bc76 Revert "[PLAT-15379][Fix PLAT-12510] Option to use UTC when dealing with cron exp. in backup schedule"
 3628ba7 [PLAT-14459] Swagger fix
 bb93ebe [#24021] YSQL: Add --TEST_check_catalog_version_overflow
 9ab7806 [#23927] docdb: Add gflag for minimum thread stack size
 Excluded: 8c8adc0 [#18822] YSQL: Gate update optimizations behind preview flag
 5e86515 [#23768] YSQL: Fix table rewrite DDL before slot creation
 123d496 [PLAT-14682] Universe task should only unlock itself and make unlock aware of the lock config
 de9d4ad [doc][yba] CIS hardened OS support (#23789)
 e131b20 [#23998] DocDB: Update usearch and other header-only third-party dependencies
 1665662 Automatic commit by thirdparty_tool: update usearch to commit 240fe9c298100f9e37a2d7377b1595be6ba1f412.
 3adbdae Automatic commit by thirdparty_tool: update fp16 to commit 98b0a46bce017382a6351a19577ec43a715b6835.
 9a819f7 Automatic commit by thirdparty_tool: update hnswlib to commit 2142dc6f4dd08e64ab727a7bbd93be7f732e80b0.
 2dc58f4 Automatic commit by thirdparty_tool: update simsimd to tag v5.1.0.
 9a03432 [doc][ybm] Azure private link host (#24086)
 039c9a2 [#17378] YSQL: Testing for histogram_bounds in pg_stats
 09f7a0f [#24085] DocDB: Refactor HNSW wrappers
 555af7d [#24000] DocDB: Shutting down shared exchange could cause TServer to hang
 5743a03 [PLAT-15317]Alert emails are not in the correct format.
 8642555 [PLAT-15379][Fix PLAT-12510] Option to use UTC when dealing with cron exp. in backup schedule
 253ab07 [PLAT-15400][PLAT-15401][PLAT-13051] - Connection pooling ui issues and other ui issues
 57576ae [#16487] YSQL: Fix flakey TestPostgresPid test
 bc8ae45 Update ports for CIS hardened (#24098)
 6fa33e6 [#18152, #18729] Docdb: Fix test TestPgIndexSelectiveUpdate
 cc6d2d1 [docs] added and updated cves (#24046)
 Excluded: ed153dc [#24055] YSQL: fix pg_hint_plan regression with executing prepared statement

Test Plan: Jenkins: rebase: pg15-cherrypicks

Reviewers: jason, jenkins-bot

Differential Revision: https://phorge.dev.yugabyte.com/D38322
SrivastavaAnubhav added a commit that referenced this issue Sep 25, 2024
Summary:
Original commit: 9ab7806 / D38053
Recent Linux kernels (alma9 and ubuntu 22, but not alma8 / ubuntu 20) are backing our large
thread stacks (8 MiB virtual memory default on Linux) with hugepages, causing each thread to use up
2048 KB of RSS (compared to 40 KB on Alma8). OpenJDK ran into a similar issue as well:
https://bugs.openjdk.org/browse/JDK-8303215.

OpenJDK's main fix was dropping the default minimum stack size to something smaller than the
hugepage size, which prevents the kernel from backing thread stacks with hugepages. Note that this
does not limit the max stack size (i.e., the stack can still grow).

NB: Setting the transparent hugepages setting to `madvise` instead of `always` would also work, but
tcmalloc documentation (https://google.github.io/tcmalloc/tuning.html#system-level-optimizations)
recommends we keep it at `always`.

This diff adds a gflag, `min_thread_stack_size_bytes`, which is passed to
`pthread_attr_setstacksize` when creating threads. The default is 512 KiB, which is the MacOS
default. If a value of <= 0 is provided, we don't set a minimum thread stack size and thus fall back
to the system default.
Jira: DB-12828

Test Plan:
Manual testing. Created an alma9 cluster and created 4 YSQL tables with 50 tablets:
```
CREATE TABLE t1 (id int) SPLIT INTO 150 TABLETS;
CREATE TABLE t2 (id int) SPLIT INTO 150 TABLETS;
CREATE TABLE t3 (id int) SPLIT INTO 150 TABLETS;
CREATE TABLE t4 (id int) SPLIT INTO 150 TABLETS;
```

The system parameters for `transparent_hugepages/enabled` is set to `always`, and `transparent_hugepages_defrag` is set to `madvise`.

The output of `pmap -x <tserver_pid>` on Alma9 before this diff shows many (but not all) thread stacks being backed by hugepages:
```
00007f7cdba00000    8192    2048    2048 rw---   [ anon ]
```

Note that this takes some time (~30 minutes) for the kernel to back these stacks with hugepages. Over time, all thread stacks might be backed by hugepages. The extended pmap output (`pmap -X`) shows that all thread stacks are THPEligible (though this is true on alma8 as well). This is the summary output of pmap every 1 minute:
```
                 Kbytes   RSS     Dirty
total kB         9691664  183356  115000
total kB         9691664  183360  115004
total kB         9691664  183364  115008
total kB         9691664  183384  115028
total kB         9691664  185380  117024
total kB         9691664  195360  127004
total kB         9691664  205340  136984
total kB         9691664  215320  146964
total kB         9691664  225300  156944
total kB         9691664  233284  164928
total kB         9691664  241268  172912
total kB         9691664  253252  184896
total kB         9691664  263232  194876
total kB         9691664  273212  204856
total kB         9691664  281196  212840
total kB         9691664  283192  214836
total kB         9691664  293172  224816
total kB         9691664  301156  232800
total kB         9691664  309140  240784
total kB         9691664  315128  246772
total kB         9691664  315128  246772
total kB         9691664  315128  246772
total kB         9691664  317124  248768
total kB         9691664  319120  250764
total kB         9691664  319120  250764
...
```

With this fix, the threads are not backed by hugepages. I.e., all thread stacks look like this:
```
00007f554055d000     512      48      48 rw---   [ anon ]
```
and there is no RSS memory growth over time:
```
                 Kbytes   RSS     Dirty
total kB         7333120  178900  111392
total kB         7333116  178932  110612
total kB         7333116  178936  110616
total kB         7333116  178940  110620
total kB         7333116  178960  110640
total kB         7333116  178960  110640
total kB         7333116  178960  110640
```

Reviewers: mlillibridge, slingam

Reviewed By: slingam

Subscribers: ybase

Differential Revision: https://phorge.dev.yugabyte.com/D38286
@yugabyte-ci yugabyte-ci reopened this Oct 9, 2024
SrivastavaAnubhav added a commit that referenced this issue Oct 9, 2024
Summary:
Original commit: 9ab7806 / D38053
Recent Linux kernels (alma9 and ubuntu 22, but not alma8 / ubuntu 20) are backing our large
thread stacks (8 MiB virtual memory default on Linux) with hugepages, causing each thread to use up
2048 KB of RSS (compared to 40 KB on Alma8). OpenJDK ran into a similar issue as well:
https://bugs.openjdk.org/browse/JDK-8303215.

OpenJDK's main fix was dropping the default minimum stack size to something smaller than the
hugepage size, which prevents the kernel from backing thread stacks with hugepages. Note that this
does not limit the max stack size (i.e., the stack can still grow).

NB: Setting the transparent hugepages setting to `madvise` instead of `always` would also work, but
tcmalloc documentation (https://google.github.io/tcmalloc/tuning.html#system-level-optimizations)
recommends we keep it at `always`.

This diff adds a gflag, `min_thread_stack_size_bytes`, which is passed to
`pthread_attr_setstacksize` when creating threads. The default is 512 KiB, which is the MacOS
default. If a value of <= 0 is provided, we don't set a minimum thread stack size and thus fall back
to the system default.
Jira: DB-12828

Test Plan:
Manual testing. Created an alma9 cluster and created 4 YSQL tables with 50 tablets:
```
CREATE TABLE t1 (id int) SPLIT INTO 150 TABLETS;
CREATE TABLE t2 (id int) SPLIT INTO 150 TABLETS;
CREATE TABLE t3 (id int) SPLIT INTO 150 TABLETS;
CREATE TABLE t4 (id int) SPLIT INTO 150 TABLETS;
```

The system parameters for `transparent_hugepages/enabled` is set to `always`, and `transparent_hugepages_defrag` is set to `madvise`.

The output of `pmap -x <tserver_pid>` on Alma9 before this diff shows many (but not all) thread stacks being backed by hugepages:
```
00007f7cdba00000    8192    2048    2048 rw---   [ anon ]
```

Note that this takes some time (~30 minutes) for the kernel to back these stacks with hugepages. Over time, all thread stacks might be backed by hugepages. The extended pmap output (`pmap -X`) shows that all thread stacks are THPEligible (though this is true on alma8 as well). This is the summary output of pmap every 1 minute:
```
                 Kbytes   RSS     Dirty
total kB         9691664  183356  115000
total kB         9691664  183360  115004
total kB         9691664  183364  115008
total kB         9691664  183384  115028
total kB         9691664  185380  117024
total kB         9691664  195360  127004
total kB         9691664  205340  136984
total kB         9691664  215320  146964
total kB         9691664  225300  156944
total kB         9691664  233284  164928
total kB         9691664  241268  172912
total kB         9691664  253252  184896
total kB         9691664  263232  194876
total kB         9691664  273212  204856
total kB         9691664  281196  212840
total kB         9691664  283192  214836
total kB         9691664  293172  224816
total kB         9691664  301156  232800
total kB         9691664  309140  240784
total kB         9691664  315128  246772
total kB         9691664  315128  246772
total kB         9691664  315128  246772
total kB         9691664  317124  248768
total kB         9691664  319120  250764
total kB         9691664  319120  250764
...
```

With this fix, the threads are not backed by hugepages. I.e., all thread stacks look like this:
```
00007f554055d000     512      48      48 rw---   [ anon ]
```
and there is no RSS memory growth over time:
```
                 Kbytes   RSS     Dirty
total kB         7333120  178900  111392
total kB         7333116  178932  110612
total kB         7333116  178936  110616
total kB         7333116  178940  110620
total kB         7333116  178960  110640
total kB         7333116  178960  110640
total kB         7333116  178960  110640
```

Reviewers: mlillibridge, slingam

Reviewed By: slingam

Subscribers: ybase

Differential Revision: https://phorge.dev.yugabyte.com/D38860
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.20.7.1_blocker area/docdb YugabyteDB core features kind/enhancement This is an enhancement of an existing feature priority/high High Priority
Projects
None yet
Development

No branches or pull requests

2 participants