Pass metadata extractors to FileScanRDD [databricks] #10616

razajafri · 2024-03-20T20:34:52Z

This PR handles the change that was made in Spark. We are passing through the metdata extractors from the fileFormat to the FileScanRDD.

Changes

Created a shim for 350+ to pass the metadata extractors

fixes #8766

Signed-off-by: Raza Jafri <[email protected]>

gerashegalov · 2024-03-20T21:18:35Z

sql-plugin/src/main/spark350/scala/com/nvidia/spark/rapids/shims/SparkShims.scala

+      metadataColumns: Seq[AttributeReference] = Seq.empty): RDD[InternalRow] = {
+    if (relation.isDefined) {
+      new FileScanRDD(sparkSession, readFunction, filePartitions, readDataSchema, metadataColumns,
+        metadataExtractors = relation.get.fileFormat.fileConstantMetadataExtractors)


if we pass relation to access fileFormat to access fileConstantMetadataExtractors and only the latter is shim-specific should we just pass fileFormat as an option to getFileScanRDD?

gerashegalov · 2024-03-20T21:27:44Z

sql-plugin/src/main/spark350/scala/com/nvidia/spark/rapids/shims/SparkShims.scala

+    if (relation.isDefined) {
+      new FileScanRDD(sparkSession, readFunction, filePartitions, readDataSchema, metadataColumns,
+        metadataExtractors = relation.get.fileFormat.fileConstantMetadataExtractors)
+    } else {
+      new FileScanRDD(sparkSession, readFunction, filePartitions, readDataSchema, metadataColumns)
+    }


Typically in Scala calling .get on an Option is an anti-pattern.

Suggested change

if (relation.isDefined) {

new FileScanRDD(sparkSession, readFunction, filePartitions, readDataSchema, metadataColumns,

metadataExtractors = relation.get.fileFormat.fileConstantMetadataExtractors)

} else {

new FileScanRDD(sparkSession, readFunction, filePartitions, readDataSchema, metadataColumns)

}

new FileScanRDD(sparkSession, readFunction, filePartitions, readDataSchema, metadataColumns,

relation.map(_.fileFormat.fileConstantMetadataExtractors).getOrElse(Map.empty))

gerashegalov · 2024-03-20T21:29:52Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/SparkShims.scala

@@ -78,6 +78,7 @@ trait SparkShims {
      readFunction: (PartitionedFile) => Iterator[InternalRow],
      filePartitions: Seq[FilePartition],
      readDataSchema: StructType,
+      relation: Option[HadoopFsRelation],


if you move this arg to the last position and give it a default value None, you probably will have fewer lines to modify

razajafri · 2024-03-21T05:18:12Z

Thanks for the review and suggestions. PTAL again

gerashegalov

LGTM but need to update copyrights

can either use a pre-commit hook or invoke directly

export SPARK_RAPIDS_AUTO_COPYRIGHTER=ON 
git diff origin/branch-24.04..HEAD --name-status | \
  awk '/^M\s+/ { print $2}' | \
  xargs ./scripts/auto-copyrighter.sh

gerashegalov · 2024-03-21T16:53:47Z

sql-plugin/src/main/spark330/scala/com/nvidia/spark/rapids/shims/Spark330PlusShims.scala

Update copyright

gerashegalov · 2024-03-21T16:54:58Z

sql-plugin/src/main/spark311/scala/com/nvidia/spark/rapids/shims/Spark31Xuntil33XShims.scala

Update copyright year

gerashegalov

LGTM

gerashegalov · 2024-03-21T18:16:36Z

build

razajafri added 2 commits March 20, 2024 13:28

Pass metadata extractors to FileScanRDD

87fcc7d

Signing off

b2d133b

Signed-off-by: Raza Jafri <[email protected]>

gerashegalov reviewed Mar 20, 2024

View reviewed changes

addressed review comments

a641cff

gerashegalov reviewed Mar 21, 2024

View reviewed changes

updated copyrights manually

03b843f

gerashegalov approved these changes Mar 21, 2024

View reviewed changes

razajafri merged commit 09a0081 into NVIDIA:branch-24.04 Mar 22, 2024
42 of 43 checks passed

razajafri deleted the SP-8766-file-source-scan branch March 22, 2024 16:42

This was referenced Apr 11, 2024

Merge branch-24.04 into main NvTimLiu/spark-rapids-jni#6

Merged

Update latest changelog [skip ci] #10683

Merged

razajafri restored the SP-8766-file-source-scan branch April 23, 2024 17:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pass metadata extractors to FileScanRDD [databricks] #10616

Pass metadata extractors to FileScanRDD [databricks] #10616

razajafri commented Mar 20, 2024

gerashegalov Mar 20, 2024 •

edited

Loading

gerashegalov Mar 20, 2024

gerashegalov Mar 20, 2024

razajafri commented Mar 21, 2024

gerashegalov left a comment

gerashegalov Mar 21, 2024

gerashegalov Mar 21, 2024

gerashegalov left a comment

gerashegalov commented Mar 21, 2024

Pass metadata extractors to FileScanRDD [databricks] #10616

Pass metadata extractors to FileScanRDD [databricks] #10616

Conversation

razajafri commented Mar 20, 2024

gerashegalov Mar 20, 2024 • edited Loading

Choose a reason for hiding this comment

gerashegalov Mar 20, 2024

Choose a reason for hiding this comment

gerashegalov Mar 20, 2024

Choose a reason for hiding this comment

razajafri commented Mar 21, 2024

gerashegalov left a comment

Choose a reason for hiding this comment

gerashegalov Mar 21, 2024

Choose a reason for hiding this comment

gerashegalov Mar 21, 2024

Choose a reason for hiding this comment

gerashegalov left a comment

Choose a reason for hiding this comment

gerashegalov commented Mar 21, 2024

gerashegalov Mar 20, 2024 •

edited

Loading