Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve security-crypto threadpool overflow handling #111369

Merged
merged 7 commits into from
Aug 2, 2024

Conversation

gwbrown
Copy link
Contributor

@gwbrown gwbrown commented Jul 26, 2024

Prior to this PR, when the security-crypto threadpool queue overflows and rejects API key hashing submissions, a toxic value (specifically, a future which will never be completed) is added to the API key auth cache. This toxic cache value causes future authentication attempts with that API key to fail by timeout, because they will attempt to wait for the toxic future, until that value is invalidated and removed from the cache. Additionally, this will hold on to memory for each request that waits on the toxic future, even after the request has timed out.

This PR adds a unit test to replicate this case, and adjusts the code which submits the key hashing task to the security-crypto threadpool to properly handle this point of failure by invalidating the cached future and notifying waiting handlers that the computation has failed.

@gwbrown gwbrown added >bug :Security/Authentication Logging in, Usernames/passwords, Realms (Native/LDAP/AD/SAML/PKI/etc) labels Jul 26, 2024
@elasticsearchmachine elasticsearchmachine added v8.16.0 Team:Security Meta label for security team labels Jul 26, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-security (Team:Security)

@elasticsearchmachine
Copy link
Collaborator

Hi @gwbrown, I've created a changelog YAML for you.

}, listener::onFailure));
}, exception -> {
// Crypto threadpool queue is full, invalidate this cache entry and make sure nothing is going to wait on it
logger.warn(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned that this might cause log spam, but I don't want to pull it all the way back to debug because I think it's good to be able to easily tell when this is happening.

@gwbrown gwbrown requested a review from jakelandis July 26, 2024 20:42
Copy link
Contributor

@albertzaharovits albertzaharovits left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM good job, nice catch!

Copy link
Contributor

@jakelandis jakelandis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice job finding this bug

@gwbrown gwbrown added the auto-backport Automatically create backport pull requests when merged label Aug 2, 2024
@gwbrown gwbrown merged commit ece555e into elastic:main Aug 2, 2024
20 checks passed
gwbrown added a commit to gwbrown/elasticsearch that referenced this pull request Aug 2, 2024
Prior to this PR, when the security-crypto threadpool queue overflows and rejects API key hashing submissions, a toxic value (specifically, a future which will never be completed) is added to the API key auth cache. This toxic cache value causes future authentication attempts with that API key to fail by timeout, because they will attempt to wait for the toxic future, until that value is invalidated and removed from the cache. Additionally, this will hold on to memory for each request that waits on the toxic future, even after the request has timed out.

This PR adds a unit test to replicate this case, and adjusts the code which submits the key hashing task to the security-crypto threadpool to properly handle this point of failure by invalidating the cached future and notifying waiting handlers that the computation has failed.
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
8.15
7.17 Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 111369

gwbrown added a commit to gwbrown/elasticsearch that referenced this pull request Aug 2, 2024
Prior to this PR, when the security-crypto threadpool queue overflows and rejects API key hashing submissions, a toxic value (specifically, a future which will never be completed) is added to the API key auth cache. This toxic cache value causes future authentication attempts with that API key to fail by timeout, because they will attempt to wait for the toxic future, until that value is invalidated and removed from the cache. Additionally, this will hold on to memory for each request that waits on the toxic future, even after the request has timed out.

This PR adds a unit test to replicate this case, and adjusts the code which submits the key hashing task to the security-crypto threadpool to properly handle this point of failure by invalidating the cached future and notifying waiting handlers that the computation has failed.
elasticsearchmachine pushed a commit that referenced this pull request Aug 2, 2024
Prior to this PR, when the security-crypto threadpool queue overflows and rejects API key hashing submissions, a toxic value (specifically, a future which will never be completed) is added to the API key auth cache. This toxic cache value causes future authentication attempts with that API key to fail by timeout, because they will attempt to wait for the toxic future, until that value is invalidated and removed from the cache. Additionally, this will hold on to memory for each request that waits on the toxic future, even after the request has timed out.

This PR adds a unit test to replicate this case, and adjusts the code which submits the key hashing task to the security-crypto threadpool to properly handle this point of failure by invalidating the cached future and notifying waiting handlers that the computation has failed.
gwbrown added a commit that referenced this pull request Aug 5, 2024
Prior to this PR, when the security-crypto threadpool queue overflows and rejects API key hashing submissions, a toxic value (specifically, a future which will never be completed) is added to the API key auth cache. This toxic cache value causes future authentication attempts with that API key to fail by timeout, because they will attempt to wait for the toxic future, until that value is invalidated and removed from the cache. Additionally, this will hold on to memory for each request that waits on the toxic future, even after the request has timed out.

This PR adds a unit test to replicate this case, and adjusts the code which submits the key hashing task to the security-crypto threadpool to properly handle this point of failure by invalidating the cached future and notifying waiting handlers that the computation has failed.
cbuescher pushed a commit to cbuescher/elasticsearch that referenced this pull request Sep 4, 2024
Prior to this PR, when the security-crypto threadpool queue overflows and rejects API key hashing submissions, a toxic value (specifically, a future which will never be completed) is added to the API key auth cache. This toxic cache value causes future authentication attempts with that API key to fail by timeout, because they will attempt to wait for the toxic future, until that value is invalidated and removed from the cache. Additionally, this will hold on to memory for each request that waits on the toxic future, even after the request has timed out.

This PR adds a unit test to replicate this case, and adjusts the code which submits the key hashing task to the security-crypto threadpool to properly handle this point of failure by invalidating the cached future and notifying waiting handlers that the computation has failed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged backport pending >bug :Security/Authentication Logging in, Usernames/passwords, Realms (Native/LDAP/AD/SAML/PKI/etc) Team:Security Meta label for security team v7.17.24 v8.15.1 v8.16.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants