Avoid OOM-killing query if result-level caching fails #17652

jtuglu-netflix · 2025-01-22T06:21:42Z

Description

Currently, result-level caching which attempts to allocate a large enough buffer to store query results will overflow the Integer.MAX_INT capacity. ByteArrayOutputStream materializes this case as an OutOfMemoryError, which is not caught and terminates the query. This limits the allocated buffer for storing query results to whatever is set in CacheConfig.getResultLevelCacheLimit(). Although we do a check comparing buffer size to CacheConfig.getResultLevelCacheLimit() here, this comes after the exception is thrown and is too late to gracefully catch the issue.

Important Note

I opted to use LimitedOutputStream here as it is already used with ByteArrayOutputStream. While ok in a QueryRunners (single-threaded), this still is less-than-ideal in the general case because it doesn't guarantee strict consistency between overflow exception delivery and ordering of writes to the buffer(see another example below). As such, this class in general is *not* thread-safe and I think should be refactored to account for this. This is because every case of LimitedOutputStream already uses ByteArrayOutputStream, which *is* already using locks, we should suffer no performance hit by synchronizing LimitedOutputStream::write methods. This is just in the general spirit of future-proofing code, given that we're already using locks, we might as well avoid as many future races as we can : ). Given that this would take some changes to the LimitedOutputStream API (from extending ByteArrayOutputStream directly) I've opted to not change these APIs here, but in a separate PR.

Changes to `LimitedOutputStream`

Expose an public LimitedOutputStream::get() which returns the output stream for stream-specific operations.
Set wrapped member to be atomic. This isn't a complete fix for the thread-safety concerns above, but at least it prevents a future simple race case where multiple threads writing can result in uncaught buffer overflows:

T1: write():
T1: read written = INT_MAX - 1
INT
T2: write():
T2: read written = INT_MAX - 1
T2: write written += 1
T2: write() succeeds
INT
T1: write written += 1
T1: write() succeeds
FIN

Release note

Avoid OOM-killing query if large result-level cache population fails for query

Key changed/added classes in this PR

processing/src/main/java/org/apache/druid/io/LimitedOutputStream.java
server/src/main/java/org/apache/druid/query/ResultLevelCachingQueryRunner.java
server/src/test/java/org/apache/druid/query/ResultLevelCachingQueryRunnerTest.java

This PR has:

samarthjain · 2025-01-22T19:38:59Z

processing/src/main/java/org/apache/druid/io/LimitedOutputStream.java

 */
 public class LimitedOutputStream extends OutputStream
 {
  private final OutputStream out;
  private final long limit;
  private final Function<Long, String> exceptionMessageFn;
-  long written;
+  AtomicLong written;


If the class is not thread safe, then I don't see the point of using an AtomicLong here. The hope is that someone would read the above documentation and not share the stream among multiple threads.

I get your point, That AtomicLong effectively just ensures the worst case doesn't happen if someone uses it incorrectly (a race that causes torn writes (2 threads write a combined N bytes) which succeed to a buffer with (<N) byte to spare). I can remove, but until a properly thread-safe solution for a byte-limited version of ByteArrayOutputStream is made, I figured I'd keep it. This will involve more changes not related to this bug-fix which I think be logically separated into another PR. It's effectively defensive programming against the worst-case race if used improperly (multi-threaded setting). I can switch back, but I don't see the harm in keeping it.

samarthjain · 2025-01-22T19:45:20Z

server/src/main/java/org/apache/druid/query/ResultLevelCachingQueryRunner.java

@@ -152,6 +153,8 @@ public void after(boolean isDone, Throwable thrown)
                  // The resultset identifier and its length is cached along with the resultset
                  resultLevelCachePopulator.populateResults();
                  log.debug("Cache population complete for query %s", query.getId());
+                } else { // thrown == null && !resultLevelCachePopulator.isShouldPopulate()
+                  log.error("Failed (and recovered) to populate result level cache for query %s", query.getId());


The error message is a bit confusing. This block will be hit when !resultLevelCachePopulator.isShouldPopulate() evaluates to true. So no attempt would have been made to populate the result level cache. Also, if thrown is null, why was there a failure?

The block will be hit when thrown == null and !resultLevelCachePopulator.isShouldPopulate(). In this case, thrown represents when an irrecoverable exception was found here, where we re-throw the exception. resultLevelCachePopulator.isShouldPopulate() references the case when we hit an exception that we can definitively recover from and that we know how to handle properly (e.g IOException). That is where stopPopulating() is called.

The distinction is errors we can (and should) effectively recover from and those where we should re-throw (fail the query).

I can switch to Failed (gracefully) ....

…for query Currently, result-level caching which attempts to allocate a large enough buffer to store query results will overflow the Integer.MAX_INT capacity. ByteArrayOutputStream materializes this case as an OutOfMemoryError, which is not caught and terminates the node. This limits the allocated buffer for storing query results to whatever is set in `CacheConfig.getResultLevelCacheLimit()`.

jtuglu-netflix force-pushed the fix-oom-on-result-level-cache-population branch 3 times, most recently from 8bc7891 to 56dfb12 Compare January 22, 2025 16:06

samarthjain reviewed Jan 22, 2025

View reviewed changes

jtuglu-netflix requested a review from samarthjain January 22, 2025 20:20

jtuglu-netflix force-pushed the fix-oom-on-result-level-cache-population branch from 56dfb12 to 648faf1 Compare January 22, 2025 20:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid OOM-killing query if result-level caching fails #17652

Avoid OOM-killing query if result-level caching fails #17652

jtuglu-netflix commented Jan 22, 2025 •

edited

Loading

samarthjain Jan 22, 2025

jtuglu-netflix Jan 22, 2025 •

edited

Loading

samarthjain Jan 22, 2025

jtuglu-netflix Jan 22, 2025 •

edited

Loading

Avoid OOM-killing query if result-level caching fails #17652

Are you sure you want to change the base?

Avoid OOM-killing query if result-level caching fails #17652

Conversation

jtuglu-netflix commented Jan 22, 2025 • edited Loading

Description

Important Note

Changes to LimitedOutputStream

Release note

Key changed/added classes in this PR

samarthjain Jan 22, 2025

Choose a reason for hiding this comment

jtuglu-netflix Jan 22, 2025 • edited Loading

Choose a reason for hiding this comment

samarthjain Jan 22, 2025

Choose a reason for hiding this comment

jtuglu-netflix Jan 22, 2025 • edited Loading

Choose a reason for hiding this comment

jtuglu-netflix commented Jan 22, 2025 •

edited

Loading

Changes to `LimitedOutputStream`

jtuglu-netflix Jan 22, 2025 •

edited

Loading

jtuglu-netflix Jan 22, 2025 •

edited

Loading