-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Multithreaded Readers working with Unity Catalog on Databricks [databricks] #8296
Conversation
Signed-off-by: Thomas Graves <[email protected]>
need to fix 3.4 |
Signed-off-by: Thomas Graves <[email protected]>
Spark 3.4 changed the PartitionedFile.filePath to be a SparkPath so had to shim it |
Note shims could have been avoided (with a miniscule perf hit on Spark 3.4+) by simply tacking a |
build |
sql-plugin/src/main/spark330db/scala/com/nvidia/spark/rapids/shims/ReaderUtils.scala
Outdated
Show resolved
Hide resolved
failure fetching a jar, going to rekick tests |
build |
…hims/ReaderUtils.scala Co-authored-by: Alessandro Bellina <[email protected]>
…ricks [databricks] (NVIDIA#8296)" This reverts commit 5f40711.
…ricks [databricks] (NVIDIA#8296)" This reverts commit 5f40711.
…ricks [databricks] (NVIDIA#8296)"
…tabricks] (#10756) * Use cached ThreadPoolExecutor * Revert "Fix Multithreaded Readers working with Unity Catalog on Databricks [databricks] (#8296)" * Signing off Signed-off-by: Raza Jafri <[email protected]> * Removed spark311 version of ReaderUtils.scala --------- Signed-off-by: Raza Jafri <[email protected]>
Part of #8210
The issue here is our multi-threaded and coalescing readers fail to read with - org.apache.spark.util.TaskCompletionListenerException: org.apache.spark.SparkException: Missing Credential Scope.
This happens when the Unity Catalog feature on databricks is enabled. Interestingly enough, its just on at the account level even those this particular job isn't using it directly.
This PR only handles Parquet files because that is all I'm currently able to test with. I am working on setting up our own environment to test Unity Catalog with but at the moment I tested this fix in customers Databricks 11.3 environment. I tested multithreaded, coalescing, and perfile. For coalescing I also tested with filtering in parallel on and off.
the only way I've been able to reproduce at this point is to do a delta write to one table and then a read from another one. I have limited options as its customer env. My guess might be that the write is setting up credentials scope to one thing and then when we go to do the read the scope is set wrong and we have to explicitly set it for the read to work.
For the fix, I found a way to get the hadoop conf that seems to have the necessary confs for credentials and you use that conf in the reader threads.