-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encode the file path from Iceberg when converting to a PartitionedFile [databricks] #9717
Encode the file path from Iceberg when converting to a PartitionedFile [databricks] #9717
Conversation
Because Iceberg always gives the raw data path but rapids multi-file readers expect the file path is encoded and url safe. Signed-off-by: Firestarman <[email protected]>
build |
...ugin/src/main/java/com/nvidia/spark/rapids/iceberg/spark/source/GpuMultiFileBatchReader.java
Show resolved
Hide resolved
Arrays.stream(pFiles).forEach(pFile -> { | ||
String partFilePathString = pFile.filePath().toString(); | ||
FileScanTask fst = files.get(partFilePathString); | ||
FilteredParquetFileInfo filteredInfo = filterParquetBlocks(fst, partFilePathString); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Let's add some comments why we need to pass in partFilePathString
other than using fst.file().path().toString()
to get it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use toEncodedPathString(fst)
instead of this partFilePathString
here, but I don't want to do this encoding again since we already have the encoded path string in the PartitionedFile
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Signed-off-by: Firestarman <[email protected]>
build |
close #9697
Iceberg always gives the raw data path but rapids multi-file readers expect the file path is encoded and url safe.
So this PR adds support to encode the file path string for Iceberg multi-files reads.