Skip to content

Commit

Permalink
[BACKPORT 2.20.7][#23927] docdb: Add gflag for minimum thread stack size
Browse files Browse the repository at this point in the history
Summary:
Original commit: 9ab7806 / D38053
Recent Linux kernels (alma9 and ubuntu 22, but not alma8 / ubuntu 20) are backing our large
thread stacks (8 MiB virtual memory default on Linux) with hugepages, causing each thread to use up
2048 KB of RSS (compared to 40 KB on Alma8). OpenJDK ran into a similar issue as well:
https://bugs.openjdk.org/browse/JDK-8303215.

OpenJDK's main fix was dropping the default minimum stack size to something smaller than the
hugepage size, which prevents the kernel from backing thread stacks with hugepages. Note that this
does not limit the max stack size (i.e., the stack can still grow).

NB: Setting the transparent hugepages setting to `madvise` instead of `always` would also work, but
tcmalloc documentation (https://google.github.io/tcmalloc/tuning.html#system-level-optimizations)
recommends we keep it at `always`.

This diff adds a gflag, `min_thread_stack_size_bytes`, which is passed to
`pthread_attr_setstacksize` when creating threads. The default is 512 KiB, which is the MacOS
default. If a value of <= 0 is provided, we don't set a minimum thread stack size and thus fall back
to the system default.
Jira: DB-12828

Test Plan:
Manual testing. Created an alma9 cluster and created 4 YSQL tables with 50 tablets:
```
CREATE TABLE t1 (id int) SPLIT INTO 150 TABLETS;
CREATE TABLE t2 (id int) SPLIT INTO 150 TABLETS;
CREATE TABLE t3 (id int) SPLIT INTO 150 TABLETS;
CREATE TABLE t4 (id int) SPLIT INTO 150 TABLETS;
```

The system parameters for `transparent_hugepages/enabled` is set to `always`, and `transparent_hugepages_defrag` is set to `madvise`.

The output of `pmap -x <tserver_pid>` on Alma9 before this diff shows many (but not all) thread stacks being backed by hugepages:
```
00007f7cdba00000    8192    2048    2048 rw---   [ anon ]
```

Note that this takes some time (~30 minutes) for the kernel to back these stacks with hugepages. Over time, all thread stacks might be backed by hugepages. The extended pmap output (`pmap -X`) shows that all thread stacks are THPEligible (though this is true on alma8 as well). This is the summary output of pmap every 1 minute:
```
                 Kbytes   RSS     Dirty
total kB         9691664  183356  115000
total kB         9691664  183360  115004
total kB         9691664  183364  115008
total kB         9691664  183384  115028
total kB         9691664  185380  117024
total kB         9691664  195360  127004
total kB         9691664  205340  136984
total kB         9691664  215320  146964
total kB         9691664  225300  156944
total kB         9691664  233284  164928
total kB         9691664  241268  172912
total kB         9691664  253252  184896
total kB         9691664  263232  194876
total kB         9691664  273212  204856
total kB         9691664  281196  212840
total kB         9691664  283192  214836
total kB         9691664  293172  224816
total kB         9691664  301156  232800
total kB         9691664  309140  240784
total kB         9691664  315128  246772
total kB         9691664  315128  246772
total kB         9691664  315128  246772
total kB         9691664  317124  248768
total kB         9691664  319120  250764
total kB         9691664  319120  250764
...
```

With this fix, the threads are not backed by hugepages. I.e., all thread stacks look like this:
```
00007f554055d000     512      48      48 rw---   [ anon ]
```
and there is no RSS memory growth over time:
```
                 Kbytes   RSS     Dirty
total kB         7333120  178900  111392
total kB         7333116  178932  110612
total kB         7333116  178936  110616
total kB         7333116  178940  110620
total kB         7333116  178960  110640
total kB         7333116  178960  110640
total kB         7333116  178960  110640
```

Reviewers: mlillibridge, slingam

Reviewed By: slingam

Subscribers: ybase

Differential Revision: https://phorge.dev.yugabyte.com/D38860
  • Loading branch information
SrivastavaAnubhav committed Oct 9, 2024
1 parent 820d01a commit e1fffb0
Show file tree
Hide file tree
Showing 3 changed files with 38 additions and 7 deletions.
13 changes: 12 additions & 1 deletion src/yb/rocksdb/util/env_posix.cc
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,8 @@

#include "yb/util/file_system_posix.h"

DECLARE_int32(min_thread_stack_size_bytes);

namespace rocksdb {

namespace {
Expand Down Expand Up @@ -804,8 +806,17 @@ void PosixEnv::StartThread(void (*function)(void* arg), void* arg) {
StartThreadState* state = new StartThreadState;
state->user_function = function;
state->arg = arg;

pthread_attr_t attr;
ThreadPool::PthreadCall("init thread attributes struct", pthread_attr_init(&attr));
if (FLAGS_min_thread_stack_size_bytes > 0) {
ThreadPool::PthreadCall(
"set min thread stack size",
pthread_attr_setstacksize(&attr, FLAGS_min_thread_stack_size_bytes));
}
ThreadPool::PthreadCall(
"start thread", pthread_create(&t, nullptr, &StartThreadWrapper, state));
"start thread", pthread_create(&t, &attr, &StartThreadWrapper, state));

ThreadPool::PthreadCall("lock", pthread_mutex_lock(&mu_));
threads_to_join_.push_back(t);
ThreadPool::PthreadCall("unlock", pthread_mutex_unlock(&mu_));
Expand Down
30 changes: 24 additions & 6 deletions src/yb/util/thread.cc
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,10 @@ METRIC_DEFINE_gauge_uint64(server, involuntary_context_switches,
"Total involuntary context switches",
yb::EXPOSE_AS_COUNTER);

DEFINE_NON_RUNTIME_int32(min_thread_stack_size_bytes, 512 * 1024,
"Default minimum stack size for new threads. If set to <=0, the system default will be used. "
"Note that the stack can grow larger than this if needed and allowed by system limits.");

namespace yb {

using std::endl;
Expand Down Expand Up @@ -744,6 +748,18 @@ std::string Thread::ToString() const {
return Substitute("Thread $0 (name: \"$1\", category: \"$2\")", tid_, name_, category_);
}

Status Thread::TryStartThread(Thread* t) {
pthread_attr_t attr;
RETURN_NOT_OK(STATUS_FROM_ERRNO_RV_FN_CALL(pthread_attr_init, &attr));
if (FLAGS_min_thread_stack_size_bytes > 0) {
RETURN_NOT_OK(STATUS_FROM_ERRNO_RV_FN_CALL(
pthread_attr_setstacksize, &attr, FLAGS_min_thread_stack_size_bytes));
}
RETURN_NOT_OK(STATUS_FROM_ERRNO_RV_FN_CALL(
pthread_create, &t->thread_, &attr, &Thread::SuperviseThread, t));
return Status::OK();
}

Status Thread::StartThread(const std::string& category, const std::string& name,
ThreadFunctor functor, scoped_refptr<Thread> *holder) {
InitThreading();
Expand All @@ -766,12 +782,14 @@ Status Thread::StartThread(const std::string& category, const std::string& name,
// Block stack trace collection while we create a thread. This also prevents stack trace
// collection in the new thread while it is being started since it will inherit our signal
// masks. SuperviseThread function will unblock the signal as soon as thread begins to run.
auto old_signal = VERIFY_RESULT(ThreadSignalMaskBlock({GetStackTraceSignal()}));
int ret = pthread_create(&t->thread_, NULL, &Thread::SuperviseThread, t.get());
RETURN_NOT_OK(ThreadSignalMaskRestore(old_signal));

if (ret) {
return STATUS(RuntimeError, "Could not create thread", Errno(ret));
const auto old_signal = VERIFY_RESULT(ThreadSignalMaskBlock({GetStackTraceSignal()}));
const auto thread_start_status = TryStartThread(t.get());
const auto mask_restore_status = ThreadSignalMaskRestore(old_signal);
if (!thread_start_status.ok() || !mask_restore_status.ok()) {
return STATUS_FORMAT(
RuntimeError, "Failed to start thread. "
"Thread start status: $0, signal mask restore status: $1",
thread_start_status, mask_restore_status);
}
}

Expand Down
2 changes: 2 additions & 0 deletions src/yb/util/thread.h
Original file line number Diff line number Diff line change
Expand Up @@ -396,6 +396,8 @@ class Thread : public RefCountedThreadSafe<Thread> {
// Invoked when the user-supplied function finishes or in the case of an
// abrupt exit (i.e. pthread_exit()). Cleans up after SuperviseThread().
static void FinishThread(void* arg);

static Status TryStartThread(Thread* t);
};

typedef scoped_refptr<Thread> ThreadPtr;
Expand Down

0 comments on commit e1fffb0

Please sign in to comment.