Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible Memory Leak Found #43729

Closed
DataScientistSamChan opened this issue Apr 8, 2024 · 9 comments
Closed

Possible Memory Leak Found #43729

DataScientistSamChan opened this issue Apr 8, 2024 · 9 comments
Assignees
Labels
type/bug Something isn't working

Comments

@DataScientistSamChan
Copy link
Contributor

Steps to reproduce the behavior (Required)

Currently it's very likely that the memory leak problem occurs because of AWS sdk. So the revelant steps to reproduce the behavior is to bring up a cluster and start streaming data into the cluster through theStream Load http interface. Depending on the size of your data, the memory leak will become apparent as soon as 1 hour or possibly longer. The memory leak could be identified by accessing the http mem_tracker api, where memory consumption of the sub modules doesn't add up to the total(You could see in the snapshot below that the consumption of the sub modules doesn't add up to 56G, leaving a gap of over 36G).

image

Expected behavior (Required)

Memory leak should not happen. Memory consumption of the sub modules should add up to the total.

Real behavior (Required)

Memory leak will render the whole cluster unavailable from time to time.

StarRocks version (Required)

3.1.9-e1c6e4e

@DataScientistSamChan DataScientistSamChan added the type/bug Something isn't working label Apr 8, 2024
@tracymacding tracymacding self-assigned this Apr 8, 2024
@tracymacding
Copy link
Contributor

tracymacding commented Apr 8, 2024

Memory profile like this:

graphviz

It looks like function s2n_drbg_instantiate consumes too much memory(12GB+), and never free it

@tracymacding
Copy link
Contributor

tracymacding commented Apr 8, 2024

The aws sdk version StarRocks cloud native is 1.10.36, and the s2n version sdk used is v1.3.27:

S2N_URI=${CRT_URI_PREFIX}/s2n/zip/15d534e8a9ca1eda6bacee514e37d08b4f38a526

@tracymacding
Copy link
Contributor

tracymacding commented Apr 8, 2024

The memory allocated here, but it only freed within process exit, seems below:

int s2n_drbg_instantiate(struct s2n_drbg *drbg, struct s2n_blob *personalization_string, const s2n_drbg_mode mode)
{       
    drbg->ctx = EVP_CIPHER_CTX_new();
    ...
}

it's freed within function s2n_drbg_wipe:

int s2n_drbg_wipe(struct s2n_drbg *drbg)
{   
    if (drbg->ctx) {
        POSIX_GUARD_OSSL(EVP_CIPHER_CTX_cleanup(drbg->ctx), S2N_ERR_DRBG);
            
        EVP_CIPHER_CTX_free(drbg->ctx);
        drbg->ctx = NULL;
    } 
    return 0;
}

But only freed within process exits:

2N_RESULT s2n_rand_cleanup_thread(void)
{
    RESULT_GUARD(s2n_drbg_wipe(&s2n_per_thread_rand_state.private_drbg));
    RESULT_GUARD(s2n_drbg_wipe(&s2n_per_thread_rand_state.public_drbg));
    ...
    return S2N_RESULT_OK;
}

static bool s2n_cleanup_atexit_impl(void)
{
    bool cleaned_up = s2n_result_is_ok(s2n_cipher_suites_cleanup())
            && s2n_result_is_ok(s2n_rand_cleanup_thread())
}

static void s2n_cleanup_atexit(void)
{
    (void) s2n_cleanup_atexit_impl();
}

int s2n_init(void)
{
    ...
    if (atexit_cleanup) {
        POSIX_ENSURE_OK(atexit(s2n_cleanup_atexit), S2N_ERR_ATEXIT);
    }
    return S2N_SUCCESS;
}

@kevincai
Copy link
Contributor

A few aws sdk s2n fixes that may be related to this leak:

aws/s2n-tls#3771
aws/s2n-tls#3966
aws/s2n-tls#4037

@kevincai
Copy link
Contributor

aws/aws-sdk-cpp#2373

@kevincai
Copy link
Contributor

kevincai commented May 13, 2024

I am opt to give a GO to backport the aws sdk version upgrade to v3.2

@kevincai
Copy link
Contributor

#43887 the aws sdk version upgraded to 1.11.267 and is released in 3.3.0-rc01

@kevincai
Copy link
Contributor

#45543 for release 3.2.x

@DataScientistSamChan
Copy link
Contributor Author

We upgraded to 3.3-rc01 and have been closely monitoring this issue and not observed the memory leak again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants