Fix Multithreaded Readers working with Unity Catalog on Databricks [databricks] #8296

tgravescs · 2023-05-15T21:46:23Z

Part of #8210

The issue here is our multi-threaded and coalescing readers fail to read with - org.apache.spark.util.TaskCompletionListenerException: org.apache.spark.SparkException: Missing Credential Scope.

This happens when the Unity Catalog feature on databricks is enabled. Interestingly enough, its just on at the account level even those this particular job isn't using it directly.

This PR only handles Parquet files because that is all I'm currently able to test with. I am working on setting up our own environment to test Unity Catalog with but at the moment I tested this fix in customers Databricks 11.3 environment. I tested multithreaded, coalescing, and perfile. For coalescing I also tested with filtering in parallel on and off.

the only way I've been able to reproduce at this point is to do a delta write to one table and then a read from another one. I have limited options as its customer env. My guess might be that the write is setting up credentials scope to one thing and then when we go to do the read the scope is set wrong and we have to explicitly set it for the read to work.

For the fix, I found a way to get the hadoop conf that seems to have the necessary confs for credentials and you use that conf in the reader threads.

Signed-off-by: Thomas Graves <[email protected]>

tgravescs · 2023-05-15T21:51:14Z

need to fix 3.4

Signed-off-by: Thomas Graves <[email protected]>

tgravescs · 2023-05-16T15:27:48Z

Spark 3.4 changed the PartitionedFile.filePath to be a SparkPath so had to shim it

jlowe · 2023-05-16T15:32:40Z

Spark 3.4 changed the PartitionedFile.filePath to be a SparkPath so had to shim it

Note shims could have been avoided (with a miniscule perf hit on Spark 3.4+) by simply tacking a toString on the path access before trying to wield it as as string argument to the Path constructor.

tgravescs · 2023-05-16T18:00:18Z

build

sql-plugin/src/main/spark330db/scala/com/nvidia/spark/rapids/shims/ReaderUtils.scala

tgravescs · 2023-05-16T18:17:54Z

failure fetching a jar, going to rekick tests

tgravescs · 2023-05-16T18:18:00Z

build

…hims/ReaderUtils.scala Co-authored-by: Alessandro Bellina <[email protected]>

…ricks [databricks] (NVIDIA#8296)" This reverts commit 5f40711.

…ricks [databricks] (NVIDIA#8296)"

…tabricks] (#10756) * Use cached ThreadPoolExecutor * Revert "Fix Multithreaded Readers working with Unity Catalog on Databricks [databricks] (#8296)" * Signing off Signed-off-by: Raza Jafri <[email protected]> * Removed spark311 version of ReaderUtils.scala --------- Signed-off-by: Raza Jafri <[email protected]>

tgravescs added 5 commits May 15, 2023 11:28

Get hadoop conf for Unity catalog

02c5d67

Signed-off-by: Thomas Graves <[email protected]>

handle dbfs

26a6cea

log warning

247e001

add loggign

abd87f7

try to lookup conf once

0366295

tgravescs self-assigned this May 15, 2023

sameerz added the bug Something isn't working label May 16, 2023

Handle Spark 3.4 changing PartitionedFile filePath to be a SparkPath

d06c2b3

Signed-off-by: Thomas Graves <[email protected]>

tgravescs added 3 commits May 16, 2023 11:32

no need to shim SparkPath and change how check if unity enabled

209dd02

fix extra line

c134526

remove unneeded import

0cf63fe

tgravescs changed the title ~~Fix Multithreaded Readers working with Unity Catalog on Databricks~~ Fix Multithreaded Readers working with Unity Catalog on Databricks [databricks] May 16, 2023

jlowe previously approved these changes May 16, 2023

View reviewed changes

abellina reviewed May 16, 2023

View reviewed changes

sql-plugin/src/main/spark330db/scala/com/nvidia/spark/rapids/shims/ReaderUtils.scala Outdated Show resolved Hide resolved

Update sql-plugin/src/main/spark330db/scala/com/nvidia/spark/rapids/s…

6ad4061

…hims/ReaderUtils.scala Co-authored-by: Alessandro Bellina <[email protected]>

tgravescs dismissed jlowe’s stale review via 6ad4061 May 16, 2023 18:18

abellina approved these changes May 16, 2023

View reviewed changes

tgravescs merged commit 5f40711 into NVIDIA:branch-23.06 May 16, 2023

tgravescs deleted the fixUnityMultiReads branch May 16, 2023 22:03

razajafri added a commit to razajafri/spark-rapids that referenced this pull request Apr 30, 2024

Revert "Fix Multithreaded Readers working with Unity Catalog on Datab…

7ee24c5

…ricks [databricks] (NVIDIA#8296)" This reverts commit 5f40711.

razajafri added a commit to razajafri/spark-rapids that referenced this pull request Apr 30, 2024

Revert "Fix Multithreaded Readers working with Unity Catalog on Datab…

7c982fc

…ricks [databricks] (NVIDIA#8296)" This reverts commit 5f40711.

razajafri added a commit to razajafri/spark-rapids that referenced this pull request Apr 30, 2024

Revert "Fix Multithreaded Readers working with Unity Catalog on Datab…

fcc7700

…ricks [databricks] (NVIDIA#8296)"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Multithreaded Readers working with Unity Catalog on Databricks [databricks] #8296

Fix Multithreaded Readers working with Unity Catalog on Databricks [databricks] #8296

tgravescs commented May 15, 2023

tgravescs commented May 15, 2023

tgravescs commented May 16, 2023

jlowe commented May 16, 2023

tgravescs commented May 16, 2023

tgravescs commented May 16, 2023

tgravescs commented May 16, 2023

Fix Multithreaded Readers working with Unity Catalog on Databricks [databricks] #8296

Fix Multithreaded Readers working with Unity Catalog on Databricks [databricks] #8296

Conversation

tgravescs commented May 15, 2023

tgravescs commented May 15, 2023

tgravescs commented May 16, 2023

jlowe commented May 16, 2023

tgravescs commented May 16, 2023

tgravescs commented May 16, 2023

tgravescs commented May 16, 2023