[SPARK-15736][CORE][branch-1.6] Gracefully handle loss of DiskStore files #13479

JoshRosen · 2016-06-02T22:21:01Z

If an RDD partition is cached on disk and the DiskStore file is lost, then reads of that cached partition will fail and the missing partition is supposed to be recomputed by a new task attempt. In the current BlockManager implementation, however, the missing file does not trigger any metadata updates / does not invalidate the cache, so subsequent task attempts will be scheduled on the same executor and the doomed read will be repeatedly retried, leading to repeated task failures and eventually a total job failure.

In order to fix this problem, the executor with the missing file needs to properly mark the corresponding block as missing so that it stops advertising itself as a cache location for that block.

This patch fixes this bug and adds an end-to-end regression test (in FailureSuite) and a set of unit tests (in BlockManagerSuite).

This is a branch-1.6 backport of #13473.

JoshRosen · 2016-06-02T22:21:50Z

/cc @andrewor14, this is the branch-1.6 backport of my other patch.

SparkQA · 2016-06-03T00:21:31Z

Test build #59889 has finished for PR 13479 at commit 8f04720.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2016-06-03T00:47:19Z

Merging into 1.6.

…iles If an RDD partition is cached on disk and the DiskStore file is lost, then reads of that cached partition will fail and the missing partition is supposed to be recomputed by a new task attempt. In the current BlockManager implementation, however, the missing file does not trigger any metadata updates / does not invalidate the cache, so subsequent task attempts will be scheduled on the same executor and the doomed read will be repeatedly retried, leading to repeated task failures and eventually a total job failure. In order to fix this problem, the executor with the missing file needs to properly mark the corresponding block as missing so that it stops advertising itself as a cache location for that block. This patch fixes this bug and adds an end-to-end regression test (in `FailureSuite`) and a set of unit tests (`in BlockManagerSuite`). This is a branch-1.6 backport of #13473. Author: Josh Rosen <[email protected]> Closes #13479 from JoshRosen/handle-missing-cache-files-branch-1.6.

andrewor14 · 2016-06-03T00:48:22Z

can you delete branch

…iles If an RDD partition is cached on disk and the DiskStore file is lost, then reads of that cached partition will fail and the missing partition is supposed to be recomputed by a new task attempt. In the current BlockManager implementation, however, the missing file does not trigger any metadata updates / does not invalidate the cache, so subsequent task attempts will be scheduled on the same executor and the doomed read will be repeatedly retried, leading to repeated task failures and eventually a total job failure. In order to fix this problem, the executor with the missing file needs to properly mark the corresponding block as missing so that it stops advertising itself as a cache location for that block. This patch fixes this bug and adds an end-to-end regression test (in `FailureSuite`) and a set of unit tests (`in BlockManagerSuite`). This is a branch-1.6 backport of apache#13473. Author: Josh Rosen <[email protected]> Closes apache#13479 from JoshRosen/handle-missing-cache-files-branch-1.6. (cherry picked from commit 4259a28)

JoshRosen added 3 commits June 2, 2016 14:55

Add failing regression test.

fa40a80

Add failing unit tests in BlockManagerSuite.

36536d7

Fix bug.

8f04720

JoshRosen closed this Jun 3, 2016

JoshRosen deleted the handle-missing-cache-files-branch-1.6 branch June 3, 2016 00:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-15736][CORE][branch-1.6] Gracefully handle loss of DiskStore files #13479

[SPARK-15736][CORE][branch-1.6] Gracefully handle loss of DiskStore files #13479

JoshRosen commented Jun 2, 2016

JoshRosen commented Jun 2, 2016

SparkQA commented Jun 3, 2016

andrewor14 commented Jun 3, 2016

andrewor14 commented Jun 3, 2016

[SPARK-15736][CORE][branch-1.6] Gracefully handle loss of DiskStore files #13479

[SPARK-15736][CORE][branch-1.6] Gracefully handle loss of DiskStore files #13479

Conversation

JoshRosen commented Jun 2, 2016

JoshRosen commented Jun 2, 2016

SparkQA commented Jun 3, 2016

andrewor14 commented Jun 3, 2016

andrewor14 commented Jun 3, 2016