Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Multithreaded Readers working with Unity Catalog on Databricks [databricks] #8296

Merged
merged 10 commits into from
May 16, 2023

Conversation

tgravescs
Copy link
Collaborator

Part of #8210

The issue here is our multi-threaded and coalescing readers fail to read with - org.apache.spark.util.TaskCompletionListenerException: org.apache.spark.SparkException: Missing Credential Scope.

This happens when the Unity Catalog feature on databricks is enabled. Interestingly enough, its just on at the account level even those this particular job isn't using it directly.

This PR only handles Parquet files because that is all I'm currently able to test with. I am working on setting up our own environment to test Unity Catalog with but at the moment I tested this fix in customers Databricks 11.3 environment. I tested multithreaded, coalescing, and perfile. For coalescing I also tested with filtering in parallel on and off.

the only way I've been able to reproduce at this point is to do a delta write to one table and then a read from another one. I have limited options as its customer env. My guess might be that the write is setting up credentials scope to one thing and then when we go to do the read the scope is set wrong and we have to explicitly set it for the read to work.

For the fix, I found a way to get the hadoop conf that seems to have the necessary confs for credentials and you use that conf in the reader threads.

@tgravescs tgravescs self-assigned this May 15, 2023
@tgravescs
Copy link
Collaborator Author

need to fix 3.4

@sameerz sameerz added the bug Something isn't working label May 16, 2023
@tgravescs
Copy link
Collaborator Author

Spark 3.4 changed the PartitionedFile.filePath to be a SparkPath so had to shim it

@jlowe
Copy link
Member

jlowe commented May 16, 2023

Spark 3.4 changed the PartitionedFile.filePath to be a SparkPath so had to shim it

Note shims could have been avoided (with a miniscule perf hit on Spark 3.4+) by simply tacking a toString on the path access before trying to wield it as as string argument to the Path constructor.

@tgravescs tgravescs changed the title Fix Multithreaded Readers working with Unity Catalog on Databricks Fix Multithreaded Readers working with Unity Catalog on Databricks [databricks] May 16, 2023
@tgravescs
Copy link
Collaborator Author

build

jlowe
jlowe previously approved these changes May 16, 2023
@tgravescs
Copy link
Collaborator Author

failure fetching a jar, going to rekick tests

@tgravescs
Copy link
Collaborator Author

build

…hims/ReaderUtils.scala

Co-authored-by: Alessandro Bellina <[email protected]>
@tgravescs tgravescs merged commit 5f40711 into NVIDIA:branch-23.06 May 16, 2023
@tgravescs tgravescs deleted the fixUnityMultiReads branch May 16, 2023 22:03
razajafri added a commit to razajafri/spark-rapids that referenced this pull request Apr 30, 2024
razajafri added a commit to razajafri/spark-rapids that referenced this pull request Apr 30, 2024
razajafri added a commit to razajafri/spark-rapids that referenced this pull request Apr 30, 2024
razajafri added a commit that referenced this pull request May 1, 2024
…tabricks] (#10756)

* Use cached ThreadPoolExecutor

* Revert "Fix Multithreaded Readers working with Unity Catalog on Databricks [databricks] (#8296)"

* Signing off

Signed-off-by: Raza Jafri <[email protected]>

* Removed spark311 version of ReaderUtils.scala

---------

Signed-off-by: Raza Jafri <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants