[FEA] Stop running task attempts on executors that encounter "sticky" CUDA errors #5029

jlowe · 2022-03-23T19:34:19Z

Is your feature request related to a problem? Please describe.
Certain CUDA errors, like illegal memory access, are "sticky," meaning that all CUDA operations to the GPU after the error will continue to return the same error over and over. No GPU operations will succeed after that point.

Describe the solution you'd like
The RAPIDS Accelerator should take measures to prevent further task execution on the executor once these "sticky" exceptions are detected. Tearing down the executor process is probably the best option, at least in the short-term. Without an external shuffle handler we will lose the shuffle of tasks that have completed, but this is probably a better way to "fail fast" then allow the executor to keep accepting new tasks only to have them fail the first time they touch the GPU.

sperlingxx · 2022-03-25T07:32:24Z

Hi @jlowe, I have a rough idea on this issue: failing fast through TaskFailureListener.

Add TaskFailureListener at the entries of GPU processing, such as: GpuScan, GpuColumnarToRow, GpuShuffleCoalesce. Skip adding if current tast context has already included one.
The listener analyzes errors, and calls system.exit if the error stack contains any "sticky" CUDA error. We can add a config to represent the max depth of error stack we will search for the "sticky" errors, just like spark.executor.killOnFatalError.depth.
We also need to list common "sticky" errors and figure out how to capture them.

jlowe · 2022-03-25T13:33:07Z

I think using the ExecutorPlugin.onTaskFailure interface would be a cleaner approach, as it avoids needing to ensure we install it in every task attempt. This API was added in 3.1.0 which is now our minimum supported version. Our ExecutorPlugin implementation could override this function and match on the TaskFailedReason parameter to handle ExceptionFailure classes which can provide the Throwable that killed the task. We can then walk the chain of exceptions looking for specific exception types (and possibly messages within those types to further discriminate).

Ideally we should update the cudf bindings to throw a different type of exception for these sticky exceptions which will make them easier to classify in Java/Scala code. There's centralized code in the cudf Java bindings where this mapping can take place (i.e.: the CATCH_STD macro and underlying utility methods).

As for which errors are "sticky" that should be primarily driven from the CUDA documentation on CUDA error codes. For example, any error would be considered "sticky" if it has this text in the description:

This leaves the process in an inconsistent state and any further CUDA work will return the same error. To continue using CUDA, the process must be terminated and relaunched.

I would also add CudaErrorUnknown to that list to be on the safe side.

tgravescs · 2022-03-31T19:29:54Z

may also want to add cudaErrorECCUncorrectable to that list.

tgravescs · 2022-03-31T21:57:30Z

A couple of examples of the Exceptions:

22/03/14 06:00:57 ERROR Executor: Exception in task 2195.0 in stage 7.0 (TID 5960)
ai.rapids.cudf.CudfException: CUDA error encountered at: ../src/io/utilities/hostdevice_vector.hpp:57: 999 cudaErrorUnknown unknown error
        at ai.rapids.cudf.Table.readParquet(Native Method)
        at ai.rapids.cudf.Table.readParquet(Table.java:1006)
        at com.nvidia.spark.rapids.ParquetPartitionReader.$anonfun$readToTable$1(GpuParquetScanBase.scala:1535)

22/03/28 14:00:05 ERROR Executor: Exception in task 59755.0 in stage 6.0 (TID 62795)
ai.rapids.cudf.CudfException: CUDA error encountered at: ../src/io/utilities/hostdevice_vector.hpp:57: 214 cudaErrorECCUncorrectable uncorrectable ECC error encountered
        at ai.rapids.cudf.Table.readParquet(Native Method)
        at ai.rapids.cudf.Table.readParquet(Table.java:1006)
        at com.nvidia.spark.rapids.MultiFileCloudParquetPartitionReader.$anonfun$readBufferToTable$4(GpuParquetScanBase.scala:1397)
        at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
        at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
        at com.nvidia.spark.rapids.FilePartitionReaderBase.withResource(GpuMultiFileReader.scala:236)
        at com.nvidia.spark.rapids.MultiFileCloudParquetPartitionReader.$anonfun$readBufferToTable$3(GpuParquetScanBase.scala:1396)
        at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
        at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
        at com.nvidia.spark.rapids.FilePartitionReaderBase.withResource(GpuMultiFileReader.scala:236)

This PR is for NVIDIA/spark-rapids#5029 and NVIDIA/spark-rapids#1870, which enables cuDF JNI to throw CUDA errors with specific error code. This PR relies on #10630, which exposes the CUDA error code and distinguishes fatal CUDA errors from the others. With this improvement, it is supposed to be easier to track CUDA errors triggered by JVM APIs. Authors: - Alfred Xu (https://github.com/sperlingxx) Approvers: - Jason Lowe (https://github.com/jlowe) URL: #10551

sameerz · 2022-06-01T18:45:21Z

Moving to 22.08 as there are cudf dependencies that will be in 22.08

Closes #5029 Detects unrecoverable (fatal) CUDA errors through the cuDF utility, which applys a more comprehensive way to determine whether a CUDA error is fatal or not. Signed-off-by: sperlingxx <[email protected]> Co-authored-by: Jason Lowe <[email protected]>

jlowe added feature request New feature or request ? - Needs Triage Need team to review and classify labels Mar 23, 2022

sperlingxx self-assigned this Mar 28, 2022

sameerz removed the ? - Needs Triage Need team to review and classify label Mar 29, 2022

sameerz mentioned this issue Mar 29, 2022

[TASK] Big Reliability Epic #1870

Closed

14 tasks

sperlingxx mentioned this issue Mar 31, 2022

JNI: throw CUDA errors more specifically rapidsai/cudf#10551

Merged

tgravescs mentioned this issue Mar 31, 2022

On task failure catch some CUDA exceptions and kill executor [databricks] #5118

Merged

revans2 added P0 Must have for release reliability Features to improve reliability or bugs that severly impact the reliability of the plugin labels Apr 12, 2022

sperlingxx mentioned this issue Apr 28, 2022

Halt Spark executor when encountering unrecoverable CUDA errors #5350

Merged

sameerz added this to the May 2 - May 20 milestone Apr 29, 2022

sameerz modified the milestones: May 2 - May 20, May 23 - Jun 3 May 20, 2022

sameerz removed the feature request New feature or request label Jun 3, 2022

sameerz modified the milestones: May 23 - Jun 3, Jun 6 - Jun 17 Jun 8, 2022

sperlingxx closed this as completed in #5350 Jun 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Stop running task attempts on executors that encounter "sticky" CUDA errors #5029

[FEA] Stop running task attempts on executors that encounter "sticky" CUDA errors #5029

jlowe commented Mar 23, 2022

sperlingxx commented Mar 25, 2022 •

edited

Loading

jlowe commented Mar 25, 2022

tgravescs commented Mar 31, 2022

tgravescs commented Mar 31, 2022

sameerz commented Jun 1, 2022

[FEA] Stop running task attempts on executors that encounter "sticky" CUDA errors #5029

[FEA] Stop running task attempts on executors that encounter "sticky" CUDA errors #5029

Comments

jlowe commented Mar 23, 2022

sperlingxx commented Mar 25, 2022 • edited Loading

jlowe commented Mar 25, 2022

tgravescs commented Mar 31, 2022

tgravescs commented Mar 31, 2022

sameerz commented Jun 1, 2022

sperlingxx commented Mar 25, 2022 •

edited

Loading