-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid OOM-killing query if result-level caching fails #17652
base: master
Are you sure you want to change the base?
Avoid OOM-killing query if result-level caching fails #17652
Conversation
8bc7891
to
56dfb12
Compare
*/ | ||
public class LimitedOutputStream extends OutputStream | ||
{ | ||
private final OutputStream out; | ||
private final long limit; | ||
private final Function<Long, String> exceptionMessageFn; | ||
long written; | ||
AtomicLong written; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the class is not thread safe, then I don't see the point of using an AtomicLong
here. The hope is that someone would read the above documentation and not share the stream among multiple threads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get your point, That AtomicLong
effectively just ensures the worst case doesn't happen if someone uses it incorrectly (a race that causes torn writes (2 threads write a combined N bytes) which succeed to a buffer with (<N) byte to spare). I can remove, but until a properly thread-safe solution for a byte-limited version of ByteArrayOutputStream
is made, I figured I'd keep it. This will involve more changes not related to this bug-fix which I think be logically separated into another PR. It's effectively defensive programming against the worst-case race if used improperly (multi-threaded setting). I can switch back, but I don't see the harm in keeping it.
@@ -152,6 +153,8 @@ public void after(boolean isDone, Throwable thrown) | |||
// The resultset identifier and its length is cached along with the resultset | |||
resultLevelCachePopulator.populateResults(); | |||
log.debug("Cache population complete for query %s", query.getId()); | |||
} else { // thrown == null && !resultLevelCachePopulator.isShouldPopulate() | |||
log.error("Failed (and recovered) to populate result level cache for query %s", query.getId()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message is a bit confusing. This block will be hit when !resultLevelCachePopulator.isShouldPopulate()
evaluates to true. So no attempt would have been made to populate the result level cache. Also, if thrown is null, why was there a failure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The block will be hit when thrown == null and !resultLevelCachePopulator.isShouldPopulate()
. In this case, thrown
represents when an irrecoverable exception was found here, where we re-throw the exception. resultLevelCachePopulator.isShouldPopulate()
references the case when we hit an exception that we can definitively recover from and that we know how to handle properly (e.g IOException
). That is where stopPopulating()
is called.
The distinction is errors we can (and should) effectively recover from and those where we should re-throw (fail the query).
I can switch to Failed (gracefully) ...
.
…for query Currently, result-level caching which attempts to allocate a large enough buffer to store query results will overflow the Integer.MAX_INT capacity. ByteArrayOutputStream materializes this case as an OutOfMemoryError, which is not caught and terminates the node. This limits the allocated buffer for storing query results to whatever is set in `CacheConfig.getResultLevelCacheLimit()`.
56dfb12
to
648faf1
Compare
Fixes #17651.
Description
Currently, result-level caching which attempts to allocate a large enough buffer to store query results will overflow the Integer.MAX_INT capacity.
ByteArrayOutputStream
materializes this case as anOutOfMemoryError
, which is not caught and terminates the query. This limits the allocated buffer for storing query results to whatever is set inCacheConfig.getResultLevelCacheLimit()
. Although we do a check comparing buffer size toCacheConfig.getResultLevelCacheLimit()
here, this comes after the exception is thrown and is too late to gracefully catch the issue.Important Note
I opted to use
LimitedOutputStream
here as it is already used withByteArrayOutputStream
. While ok in a QueryRunners (single-threaded), this still is less-than-ideal in the general case because it doesn't guarantee strict consistency between overflow exception delivery and ordering of writes to the buffer(see another example below). As such, this class in general is *not* thread-safe and I think should be refactored to account for this. This is because every case ofLimitedOutputStream
already usesByteArrayOutputStream
, which *is* already using locks, we should suffer no performance hit by synchronizingLimitedOutputStream::write
methods. This is just in the general spirit of future-proofing code, given that we're already using locks, we might as well avoid as many future races as we can : ). Given that this would take some changes to theLimitedOutputStream
API (from extendingByteArrayOutputStream
directly) I've opted to not change these APIs here, but in a separate PR.Changes to
LimitedOutputStream
LimitedOutputStream::get()
which returns the output stream for stream-specific operations.wrapped
member to be atomic. This isn't a complete fix for the thread-safety concerns above, but at least it prevents a future simple race case where multiple threads writing can result in uncaught buffer overflows:Release note
Avoid OOM-killing query if large result-level cache population fails for query
Key changed/added classes in this PR
processing/src/main/java/org/apache/druid/io/LimitedOutputStream.java
server/src/main/java/org/apache/druid/query/ResultLevelCachingQueryRunner.java
server/src/test/java/org/apache/druid/query/ResultLevelCachingQueryRunnerTest.java
This PR has: