[SPARK-22982] Remove unsafe asynchronous close() call from FileDownloadChannel #20179

JoshRosen · 2018-01-07T23:30:39Z

What changes were proposed in this pull request?

This patch fixes a severe asynchronous IO bug in Spark's Netty-based file transfer code. At a high-level, the problem is that an unsafe asynchronous close() of a pipe's source channel creates a race condition where file transfer code closes a file descriptor then attempts to read from it. If the closed file descriptor's number has been reused by an open() call then this invalid read may cause unrelated file operations to return incorrect results. One manifestation of this problem is incorrect query results.

For a high-level overview of how file download works, take a look at the control flow in NettyRpcEnv.openChannel(): this code creates a pipe to buffer results, then submits an asynchronous stream request to a lower-level TransportClient. The callback passes received data to the sink end of the pipe. The source end of the pipe is passed back to the caller of openChannel(). Thus openChannel() returns immediately and callers interact with the returned pipe source channel.

Because the underlying stream request is asynchronous, errors may occur after openChannel() has returned and after that method's caller has started to read() from the returned channel. For example, if a client requests an invalid stream from a remote server then the "stream does not exist" error may not be received from the remote server until after openChannel() has returned. In order to be able to propagate the "stream does not exist" error to the file-fetching application thread, this code wraps the pipe's source channel in a special FileDownloadChannel which adds an setError(t: Throwable) method, then calls this setError() method in the FileDownloadCallback's onFailure method.

It is possible for FileDownloadChannel's read() and setError() methods to be called concurrently from different threads: the setError() method is called from within the Netty RPC system's stream callback handlers, while the read() methods are called from higher-level application code performing remote stream reads.

The problem lies in setError(): the existing code closed the wrapped pipe source channel. Because read() and setError() occur in different threads, this means it is possible for one thread to be calling source.read() while another asynchronously calls source.close(). Java's IO libraries do not guarantee that this will be safe and, in fact, it's possible for these operations to interleave in such a way that a lower-level read() system call occurs right after a close() call. In the best-case, this fails as a read of a closed file descriptor; in the worst-case, the file descriptor number has been re-used by an intervening open() operation and the read corrupts the result of an unrelated file IO operation being performed by a different thread.

The solution here is to remove the stream.close() call in onError(): the thread that is performing the read() calls is responsible for closing the stream in a finally block, so there's no need to close it here. If that thread is blocked in a read() then it will become unblocked when the sink end of the pipe is closed in FileDownloadCallback.onFailure().

After making this change, we also need to refine the read() method to always check for a setError() result, even if the underlying channel read() call has succeeded.

This patch also makes a slight cleanup to a dodgy-looking catch e: Exception block to use a safer try-finally error handling idiom.

This bug was introduced in SPARK-11956 / #9941 and is present in Spark 1.6.0+.

How was this patch tested?

This fix was tested manually against a workload which non-deterministically hit this bug.

SparkQA · 2018-01-08T02:35:22Z

Test build #85773 has finished for PR 20179 at commit 8e5ffa4.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-01-08T12:33:26Z

retest this please

SparkQA · 2018-01-08T16:54:18Z

Test build #85795 has finished for PR 20179 at commit 8e5ffa4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

pathikrit · 2018-01-09T02:09:46Z

core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala

-      case e: Exception =>
+      exceptionThrown = false
+    } finally {
+      if (exceptionThrown) {
        pipe.sink().close()


instead of using this var exceptionThrown, can we move these 2 close() calls to the catch block? Current code:

var exceptionThrown = true try { // code exceptionThrown = false } finally { if (exceptionThrown) { pipe.sink().close() source.close() } }

Proposed code:

try { // code } catch { case e: Throwable => pipe.sink().close() source.close() throw e }

To preserve the same behavior, I think we'd need to do

try { // code } catch { case t: Throwable => pipe.sink().close() source.close() throw t }

to ensure that we propagate the original Throwable.

It could be clearer (and safer, in the case of exceptions thrown from close() calls), to use Utils.tryWithSafeFinallyAndFailureCallbacks here, so let me make that change.

pathikrit · 2018-01-09T02:11:42Z

core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala

      val offset = in.readLong()
      val nextOffset = in.readLong()
+      val actualPosition = channel.position()
+      val expectedPosition = blockId.reduceId * 8 + 16
+      if (actualPosition != expectedPosition) {


Maybe an assert assert(actualPosition == expectedPosition, $msg) is better for things like this so we may elide them using compiler flags if desired

I considered this, but I don't think there's ever a case where we want to elide this particular check: if we read an incorrect offset here then there's (potentially) no other mechanism to detect this error, leading to silent wrong answers.

cloud-fan · 2018-01-09T03:35:31Z

also cc @jerryshao

cloud-fan · 2018-01-09T03:42:20Z

core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala

@@ -376,18 +374,13 @@ private[netty] class NettyRpcEnv(

    def setError(e: Throwable): Unit = {
      error = e
-      source.close()
    }

    override def read(dst: ByteBuffer): Int = {


so the caller of read would close the source channel?

Yes. This currently happens in two places:

In Utils.doFetchFile():

spark/core/src/main/scala/org/apache/spark/util/Utils.scala

Line 661 in 2831571

val source = SparkEnv.get.rpcEnv.openChannel(url)

: the stream gets passed a couple of layers down to a Utils.copyStream(closeStreams = true) call which is guaranteed to clean up the stream.

In ExecutorClassLoader, where we construct the stream in fetchFn from getclassFileInputStreamFromRpc and then close it in a finally block in findClassLocally:

spark/repl/src/main/scala/org/apache/spark/repl/ExecutorClassLoader.scala

Line 167 in e08d06b

inputStream = fetchFn(pathInDirectory)

cloud-fan · 2018-01-09T03:49:54Z

core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala

+    // class of issue from re-occurring in the future which is why they are left here even though
+    // SPARK-22982 is fixed.
+    val channel = Files.newByteChannel(indexFile.toPath)
+    channel.position(blockId.reduceId * 8)


cc @zsxwing I recall you mentioned about a performance issue with skipping data in the file channel, do we have this problem here?

I made sure to incorporate @zsxwing's changes here. The problem originally related to calling skip(), but this change is from his fix to explicitly use position on a FileChannel instead.

Sorry I'm not clear whether the change here is related to "asynchronous close()" issue?

It's used to detect bugs like "asynchronous close()" earlier in the future.

I see. Thanks!

For some more background: the asynchronous close() bug can cause reads from a closed-and-subsequently-reassigned file descriptor number and in principle this can affect almost any IO operation anywhere in the application. For example, if the closed file descriptor number is immediately recycled by opening a socket then the invalid read can cause that socket read to miss data (since the data would have been consumed by the invalid reader and won't be delivered to the legitimate new user of the file descriptor).

Given this, I see how it might be puzzling that this patch is adding a check only here. There are two reasons for this:

Many other IO operations have implicit checksumming such that dropping data due to an invalid read be detected and cause an exception. For example, many compression codecs have block-level checksumming (and magic numbers at the beginning of the stream), so dropping data (especially at the start of a read) will be detected. This particular shuffle index file, however, does not have mechanisms to detect corruption: skipping forward in the read by a multiple of 8 bytes will still read structurally-valid data (but it will be the wrong data, causing the wrong output to be read from the shuffle data file).

In the investigation which uncovered this bug, the invalid reads were predominantly impacting shuffle index lookups for reading local blocks. In a nutshell, there's a subtle race condition where Janino codegen compilation triggers attempted remote classloading of classes which don't exist, triggering the error-handling / error-propagation paths in FileDownloadChannel and causing the invalid asynchronous close() call to be performed. At the same time that this close() call was being performed, another task from the same stage attempts to read the shuffle index files of local blocks and experiences an invalid read due to the falsely-shared file descriptor.

This is a very hard-to-trigger bug: we were only able to reproduce it on large clusters with very fast machines and shuffles that contain large numbers of map and reduce tasks (more shuffle blocks means more index file reads and more chances for the race to occur; faster machines increase the likelihood of the race occurring; larger clusters give us more chances for the error to occur). In our reproduction, this race occurred on a microsecond timescale (measured via kernel syscall tracing) and occurred relatively rarely, requiring many iterations until we could trigger a reproduction.

While investigating, I added these checks so that the index read fails-fast when this issue occurs, which made it significantly easier to reproduce and diagnose the root cause (fixed by the other changes in this patch).

There are a number of interesting details in the story of how we worked from the original high-level data corruption symptom to this low-level IO bug. I'll see about writing up the complete story in a blog post at some point.

cloud-fan · 2018-01-09T04:05:10Z

LGTM

SparkQA · 2018-01-09T07:08:38Z

Test build #85828 has finished for PR 20179 at commit c1278b8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jerryshao · 2018-01-09T08:55:22Z

core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala

+      val actualPosition = channel.position()
+      val expectedPosition = blockId.reduceId * 8 + 16
+      if (actualPosition != expectedPosition) {
+        throw new Exception(s"SPARK-22982: Incorrect channel position after index file reads: " +


Maybe we'd better change to some specific Exception type here.

Any suggestions for a better exception subtype? I don't expect this to be a recoverable error and wanted to avoid the possibility that downstream code catches and handles this error. Maybe I should go further and make it a RuntimeException to make it even more fatal?

I see. Thanks!

viirya · 2018-01-09T08:58:11Z

core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala

    }

    override def read(dst: ByteBuffer): Int = {
      Try(source.read(dst)) match {
+        case _ if error != null => throw error


I think it is better to also add a short comment here. This bug is subtle and no test against it now. Just from this code, it is hard to know why we check error even success.

I added a pair of comments to explain the flow of calls involving setError() and pipe closes.

viirya · 2018-01-09T08:59:18Z

LGTM

zsxwing

LGTM

SparkQA · 2018-01-10T02:42:55Z

Test build #85885 has finished for PR 20179 at commit 9e5f20e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-01-10T07:02:28Z

thanks, merging to master/2.3!

…adChannel ## What changes were proposed in this pull request? This patch fixes a severe asynchronous IO bug in Spark's Netty-based file transfer code. At a high-level, the problem is that an unsafe asynchronous `close()` of a pipe's source channel creates a race condition where file transfer code closes a file descriptor then attempts to read from it. If the closed file descriptor's number has been reused by an `open()` call then this invalid read may cause unrelated file operations to return incorrect results. **One manifestation of this problem is incorrect query results.** For a high-level overview of how file download works, take a look at the control flow in `NettyRpcEnv.openChannel()`: this code creates a pipe to buffer results, then submits an asynchronous stream request to a lower-level TransportClient. The callback passes received data to the sink end of the pipe. The source end of the pipe is passed back to the caller of `openChannel()`. Thus `openChannel()` returns immediately and callers interact with the returned pipe source channel. Because the underlying stream request is asynchronous, errors may occur after `openChannel()` has returned and after that method's caller has started to `read()` from the returned channel. For example, if a client requests an invalid stream from a remote server then the "stream does not exist" error may not be received from the remote server until after `openChannel()` has returned. In order to be able to propagate the "stream does not exist" error to the file-fetching application thread, this code wraps the pipe's source channel in a special `FileDownloadChannel` which adds an `setError(t: Throwable)` method, then calls this `setError()` method in the FileDownloadCallback's `onFailure` method. It is possible for `FileDownloadChannel`'s `read()` and `setError()` methods to be called concurrently from different threads: the `setError()` method is called from within the Netty RPC system's stream callback handlers, while the `read()` methods are called from higher-level application code performing remote stream reads. The problem lies in `setError()`: the existing code closed the wrapped pipe source channel. Because `read()` and `setError()` occur in different threads, this means it is possible for one thread to be calling `source.read()` while another asynchronously calls `source.close()`. Java's IO libraries do not guarantee that this will be safe and, in fact, it's possible for these operations to interleave in such a way that a lower-level `read()` system call occurs right after a `close()` call. In the best-case, this fails as a read of a closed file descriptor; in the worst-case, the file descriptor number has been re-used by an intervening `open()` operation and the read corrupts the result of an unrelated file IO operation being performed by a different thread. The solution here is to remove the `stream.close()` call in `onError()`: the thread that is performing the `read()` calls is responsible for closing the stream in a `finally` block, so there's no need to close it here. If that thread is blocked in a `read()` then it will become unblocked when the sink end of the pipe is closed in `FileDownloadCallback.onFailure()`. After making this change, we also need to refine the `read()` method to always check for a `setError()` result, even if the underlying channel `read()` call has succeeded. This patch also makes a slight cleanup to a dodgy-looking `catch e: Exception` block to use a safer `try-finally` error handling idiom. This bug was introduced in SPARK-11956 / #9941 and is present in Spark 1.6.0+. ## How was this patch tested? This fix was tested manually against a workload which non-deterministically hit this bug. Author: Josh Rosen <[email protected]> Closes #20179 from JoshRosen/SPARK-22982-fix-unsafe-async-io-in-file-download-channel. (cherry picked from commit edf0a48) Signed-off-by: Wenchen Fan <[email protected]>

srowen · 2018-01-12T16:40:19Z

I cherry-picked to 2.2 as well, at least.

…adChannel ## What changes were proposed in this pull request? This patch fixes a severe asynchronous IO bug in Spark's Netty-based file transfer code. At a high-level, the problem is that an unsafe asynchronous `close()` of a pipe's source channel creates a race condition where file transfer code closes a file descriptor then attempts to read from it. If the closed file descriptor's number has been reused by an `open()` call then this invalid read may cause unrelated file operations to return incorrect results. **One manifestation of this problem is incorrect query results.** For a high-level overview of how file download works, take a look at the control flow in `NettyRpcEnv.openChannel()`: this code creates a pipe to buffer results, then submits an asynchronous stream request to a lower-level TransportClient. The callback passes received data to the sink end of the pipe. The source end of the pipe is passed back to the caller of `openChannel()`. Thus `openChannel()` returns immediately and callers interact with the returned pipe source channel. Because the underlying stream request is asynchronous, errors may occur after `openChannel()` has returned and after that method's caller has started to `read()` from the returned channel. For example, if a client requests an invalid stream from a remote server then the "stream does not exist" error may not be received from the remote server until after `openChannel()` has returned. In order to be able to propagate the "stream does not exist" error to the file-fetching application thread, this code wraps the pipe's source channel in a special `FileDownloadChannel` which adds an `setError(t: Throwable)` method, then calls this `setError()` method in the FileDownloadCallback's `onFailure` method. It is possible for `FileDownloadChannel`'s `read()` and `setError()` methods to be called concurrently from different threads: the `setError()` method is called from within the Netty RPC system's stream callback handlers, while the `read()` methods are called from higher-level application code performing remote stream reads. The problem lies in `setError()`: the existing code closed the wrapped pipe source channel. Because `read()` and `setError()` occur in different threads, this means it is possible for one thread to be calling `source.read()` while another asynchronously calls `source.close()`. Java's IO libraries do not guarantee that this will be safe and, in fact, it's possible for these operations to interleave in such a way that a lower-level `read()` system call occurs right after a `close()` call. In the best-case, this fails as a read of a closed file descriptor; in the worst-case, the file descriptor number has been re-used by an intervening `open()` operation and the read corrupts the result of an unrelated file IO operation being performed by a different thread. The solution here is to remove the `stream.close()` call in `onError()`: the thread that is performing the `read()` calls is responsible for closing the stream in a `finally` block, so there's no need to close it here. If that thread is blocked in a `read()` then it will become unblocked when the sink end of the pipe is closed in `FileDownloadCallback.onFailure()`. After making this change, we also need to refine the `read()` method to always check for a `setError()` result, even if the underlying channel `read()` call has succeeded. This patch also makes a slight cleanup to a dodgy-looking `catch e: Exception` block to use a safer `try-finally` error handling idiom. This bug was introduced in SPARK-11956 / #9941 and is present in Spark 1.6.0+. ## How was this patch tested? This fix was tested manually against a workload which non-deterministically hit this bug. Author: Josh Rosen <[email protected]> Closes #20179 from JoshRosen/SPARK-22982-fix-unsafe-async-io-in-file-download-channel. (cherry picked from commit edf0a48) Signed-off-by: Sean Owen <[email protected]>

…adChannel ## What changes were proposed in this pull request? This patch fixes a severe asynchronous IO bug in Spark's Netty-based file transfer code. At a high-level, the problem is that an unsafe asynchronous `close()` of a pipe's source channel creates a race condition where file transfer code closes a file descriptor then attempts to read from it. If the closed file descriptor's number has been reused by an `open()` call then this invalid read may cause unrelated file operations to return incorrect results. **One manifestation of this problem is incorrect query results.** For a high-level overview of how file download works, take a look at the control flow in `NettyRpcEnv.openChannel()`: this code creates a pipe to buffer results, then submits an asynchronous stream request to a lower-level TransportClient. The callback passes received data to the sink end of the pipe. The source end of the pipe is passed back to the caller of `openChannel()`. Thus `openChannel()` returns immediately and callers interact with the returned pipe source channel. Because the underlying stream request is asynchronous, errors may occur after `openChannel()` has returned and after that method's caller has started to `read()` from the returned channel. For example, if a client requests an invalid stream from a remote server then the "stream does not exist" error may not be received from the remote server until after `openChannel()` has returned. In order to be able to propagate the "stream does not exist" error to the file-fetching application thread, this code wraps the pipe's source channel in a special `FileDownloadChannel` which adds an `setError(t: Throwable)` method, then calls this `setError()` method in the FileDownloadCallback's `onFailure` method. It is possible for `FileDownloadChannel`'s `read()` and `setError()` methods to be called concurrently from different threads: the `setError()` method is called from within the Netty RPC system's stream callback handlers, while the `read()` methods are called from higher-level application code performing remote stream reads. The problem lies in `setError()`: the existing code closed the wrapped pipe source channel. Because `read()` and `setError()` occur in different threads, this means it is possible for one thread to be calling `source.read()` while another asynchronously calls `source.close()`. Java's IO libraries do not guarantee that this will be safe and, in fact, it's possible for these operations to interleave in such a way that a lower-level `read()` system call occurs right after a `close()` call. In the best-case, this fails as a read of a closed file descriptor; in the worst-case, the file descriptor number has been re-used by an intervening `open()` operation and the read corrupts the result of an unrelated file IO operation being performed by a different thread. The solution here is to remove the `stream.close()` call in `onError()`: the thread that is performing the `read()` calls is responsible for closing the stream in a `finally` block, so there's no need to close it here. If that thread is blocked in a `read()` then it will become unblocked when the sink end of the pipe is closed in `FileDownloadCallback.onFailure()`. After making this change, we also need to refine the `read()` method to always check for a `setError()` result, even if the underlying channel `read()` call has succeeded. This patch also makes a slight cleanup to a dodgy-looking `catch e: Exception` block to use a safer `try-finally` error handling idiom. This bug was introduced in SPARK-11956 / apache#9941 and is present in Spark 1.6.0+. ## How was this patch tested? This fix was tested manually against a workload which non-deterministically hit this bug. Author: Josh Rosen <[email protected]> Closes apache#20179 from JoshRosen/SPARK-22982-fix-unsafe-async-io-in-file-download-channel. (cherry picked from commit edf0a48) Signed-off-by: Sean Owen <[email protected]>

JoshRosen added 2 commits January 7, 2018 15:21

Add position checks to IndexShuffleBlockResolver.

2138aa6

Remove unsafe asynchronous close() call from FileDownloadChannel

8e5ffa4

pathikrit approved these changes Jan 9, 2018

View reviewed changes

Use tryWithSafeFinallyAndFailureCallbacks

c1278b8

cloud-fan reviewed Jan 9, 2018

View reviewed changes

jerryshao reviewed Jan 9, 2018

View reviewed changes

viirya reviewed Jan 9, 2018

View reviewed changes

Add more comments.

9e5f20e

zsxwing approved these changes Jan 10, 2018

View reviewed changes

asfgit closed this in edf0a48 Jan 10, 2018

JoshRosen deleted the SPARK-22982-fix-unsafe-async-io-in-file-download-channel branch January 10, 2018 07:37

zsxwing mentioned this pull request Mar 7, 2018

[SPARK-23524] Big local shuffle blocks should not be checked for corruption. #20685

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-22982] Remove unsafe asynchronous close() call from FileDownloadChannel #20179

[SPARK-22982] Remove unsafe asynchronous close() call from FileDownloadChannel #20179

JoshRosen commented Jan 7, 2018

SparkQA commented Jan 8, 2018

HyukjinKwon commented Jan 8, 2018

SparkQA commented Jan 8, 2018

pathikrit Jan 9, 2018 •

edited

Loading

JoshRosen Jan 9, 2018

pathikrit Jan 9, 2018

JoshRosen Jan 9, 2018

cloud-fan commented Jan 9, 2018

cloud-fan Jan 9, 2018

JoshRosen Jan 9, 2018

cloud-fan Jan 9, 2018

JoshRosen Jan 9, 2018

jerryshao Jan 9, 2018

cloud-fan Jan 9, 2018

jerryshao Jan 9, 2018

JoshRosen Jan 9, 2018

cloud-fan commented Jan 9, 2018

SparkQA commented Jan 9, 2018

jerryshao Jan 9, 2018

JoshRosen Jan 9, 2018

jerryshao Jan 10, 2018

viirya Jan 9, 2018

JoshRosen Jan 9, 2018

viirya commented Jan 9, 2018

zsxwing left a comment

SparkQA commented Jan 10, 2018

cloud-fan commented Jan 10, 2018

srowen commented Jan 12, 2018

[SPARK-22982] Remove unsafe asynchronous close() call from FileDownloadChannel #20179

[SPARK-22982] Remove unsafe asynchronous close() call from FileDownloadChannel #20179

Conversation

JoshRosen commented Jan 7, 2018

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Jan 8, 2018

HyukjinKwon commented Jan 8, 2018

SparkQA commented Jan 8, 2018

pathikrit Jan 9, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Jan 9, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Jan 9, 2018

SparkQA commented Jan 9, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya commented Jan 9, 2018

zsxwing left a comment

Choose a reason for hiding this comment

SparkQA commented Jan 10, 2018

cloud-fan commented Jan 10, 2018

srowen commented Jan 12, 2018

pathikrit Jan 9, 2018 •

edited

Loading