Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid OOM-killing query if result-level caching fails #17652

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jtuglu-netflix
Copy link
Contributor

@jtuglu-netflix jtuglu-netflix commented Jan 22, 2025

Fixes #17651.

Description

Currently, result-level caching which attempts to allocate a large enough buffer to store query results will overflow the Integer.MAX_INT capacity. ByteArrayOutputStream materializes this case as an OutOfMemoryError, which is not caught and terminates the query. This limits the allocated buffer for storing query results to whatever is set in CacheConfig.getResultLevelCacheLimit(). Although we do a check comparing buffer size to CacheConfig.getResultLevelCacheLimit() here, this comes after the exception is thrown and is too late to gracefully catch the issue.

Important Note

I opted to use LimitedOutputStream here as it is already used with ByteArrayOutputStream. While ok in a QueryRunners (single-threaded), this still is less-than-ideal in the general case because it doesn't guarantee strict consistency between overflow exception delivery and ordering of writes to the buffer(see another example below). As such, this class in general is *not* thread-safe and I think should be refactored to account for this. This is because every case of LimitedOutputStream already uses ByteArrayOutputStream, which *is* already using locks, we should suffer no performance hit by synchronizing LimitedOutputStream::write methods. This is just in the general spirit of future-proofing code, given that we're already using locks, we might as well avoid as many future races as we can : ). Given that this would take some changes to the LimitedOutputStream API (from extending ByteArrayOutputStream directly) I've opted to not change these APIs here, but in a separate PR.

Changes to LimitedOutputStream

  • Expose an public LimitedOutputStream::get() which returns the output stream for stream-specific operations.
  • Set wrapped member to be atomic. This isn't a complete fix for the thread-safety concerns above, but at least it prevents a future simple race case where multiple threads writing can result in uncaught buffer overflows:
T1: write():
T1: read written = INT_MAX - 1
INT
T2: write():
T2: read written = INT_MAX - 1
T2: write written += 1
T2: write() succeeds
INT
T1: write written += 1
T1: write() succeeds
FIN

Release note

Avoid OOM-killing query if large result-level cache population fails for query


Key changed/added classes in this PR
  • processing/src/main/java/org/apache/druid/io/LimitedOutputStream.java
  • server/src/main/java/org/apache/druid/query/ResultLevelCachingQueryRunner.java
  • server/src/test/java/org/apache/druid/query/ResultLevelCachingQueryRunnerTest.java

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@jtuglu-netflix jtuglu-netflix force-pushed the fix-oom-on-result-level-cache-population branch 3 times, most recently from 8bc7891 to 56dfb12 Compare January 22, 2025 16:06
*/
public class LimitedOutputStream extends OutputStream
{
private final OutputStream out;
private final long limit;
private final Function<Long, String> exceptionMessageFn;
long written;
AtomicLong written;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the class is not thread safe, then I don't see the point of using an AtomicLong here. The hope is that someone would read the above documentation and not share the stream among multiple threads.

Copy link
Contributor Author

@jtuglu-netflix jtuglu-netflix Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get your point, That AtomicLong effectively just ensures the worst case doesn't happen if someone uses it incorrectly (a race that causes torn writes (2 threads write a combined N bytes) which succeed to a buffer with (<N) byte to spare). I can remove, but until a properly thread-safe solution for a byte-limited version of ByteArrayOutputStream is made, I figured I'd keep it. This will involve more changes not related to this bug-fix which I think be logically separated into another PR. It's effectively defensive programming against the worst-case race if used improperly (multi-threaded setting). I can switch back, but I don't see the harm in keeping it.

@@ -152,6 +153,8 @@ public void after(boolean isDone, Throwable thrown)
// The resultset identifier and its length is cached along with the resultset
resultLevelCachePopulator.populateResults();
log.debug("Cache population complete for query %s", query.getId());
} else { // thrown == null && !resultLevelCachePopulator.isShouldPopulate()
log.error("Failed (and recovered) to populate result level cache for query %s", query.getId());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message is a bit confusing. This block will be hit when !resultLevelCachePopulator.isShouldPopulate() evaluates to true. So no attempt would have been made to populate the result level cache. Also, if thrown is null, why was there a failure?

Copy link
Contributor Author

@jtuglu-netflix jtuglu-netflix Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The block will be hit when thrown == null and !resultLevelCachePopulator.isShouldPopulate(). In this case, thrown represents when an irrecoverable exception was found here, where we re-throw the exception. resultLevelCachePopulator.isShouldPopulate() references the case when we hit an exception that we can definitively recover from and that we know how to handle properly (e.g IOException). That is where stopPopulating() is called.

The distinction is errors we can (and should) effectively recover from and those where we should re-throw (fail the query).

I can switch to Failed (gracefully) ....

…for query

Currently, result-level caching which attempts to allocate a large enough buffer to store query results will overflow the Integer.MAX_INT capacity. ByteArrayOutputStream materializes this case as an OutOfMemoryError, which is not caught and terminates the node. This limits the allocated buffer for storing query results to whatever is set in `CacheConfig.getResultLevelCacheLimit()`.
@jtuglu-netflix jtuglu-netflix force-pushed the fix-oom-on-result-level-cache-population branch from 56dfb12 to 648faf1 Compare January 22, 2025 20:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Query Failure due to ResultLevelCache Population OOM
2 participants