[SPARK-18702][SQL] input_file_block_start and input_file_block_length #16133

rxin · 2016-12-04T05:41:26Z

What changes were proposed in this pull request?

We currently have function input_file_name to get the path of the input file, but don't have functions to get the block start offset and length. This patch introduces two functions:

input_file_block_start: returns the file block start offset, or -1 if not available.
input_file_block_length: returns the file block length, or -1 if not available.

How was this patch tested?

Updated existing test cases in ColumnExpressionSuite that covered input_file_name to also cover the two new functions.

… function

rxin · 2016-12-04T05:41:51Z

cc @ueshin

dongjoon-hyun · 2016-12-04T06:58:54Z

core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala

-      val inputMetrics = context.taskMetrics().inputMetrics
-      val existingBytesRead = inputMetrics.bytesRead
+      private val inputMetrics = context.taskMetrics().inputMetrics
+      private val existingBytesRead = inputMetrics.bytesRead

      // Sets the thread local variable for the file's name


nit. Updating comment together?

I agree with it and also the same comment in HadoopRDD.

SparkQA · 2016-12-04T08:05:00Z

Test build #69632 has finished for PR 16133 at commit 7713ebe.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2016-12-04T10:10:56Z

sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala

-      checkAnswer(data.select(input_file_name()).limit(1), Row(""))
+      // Test the 3 expressions when reading from files
+      val q = spark.read.parquet(dir.getCanonicalPath).select(
+        input_file_name(), expr("input_file_block_start()"), expr("input_file_block_length()"))


We should add input_file_block_start(), input_file_block_length() to functions.scala and use them the same as input_file_name()?

I'm actually intentionally not adding those because I don't know how common these expressions will be.

ueshin · 2016-12-04T10:11:14Z

sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala

-      checkAnswer(data.select(input_file_name()).limit(1), Row(""))
+      // Test the 3 expressions when reading from files
+      val q = df.select(
+        input_file_name(), expr("input_file_block_start()"), expr("input_file_block_length()"))


ueshin · 2016-12-04T10:11:21Z

sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala

-      checkAnswer(data.select(input_file_name()).limit(1), Row(""))
+      // Test the 3 expressions when reading from files
+      val q = df.select(
+        input_file_name(), expr("input_file_block_start()"), expr("input_file_block_length()"))


SparkQA · 2016-12-04T21:41:51Z

Test build #69642 has finished for PR 16133 at commit 097bfec.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-12-05T02:30:44Z

core/src/main/scala/org/apache/spark/rdd/InputFileBlockHolder.scala

+  }
+
+  /**
+   * The thread variable for the name of the current file being read. This is used by


nit: this comment can be updated too.

rxin · 2016-12-05T05:50:50Z

Thanks - going to merge this in master.

## What changes were proposed in this pull request? We currently have function input_file_name to get the path of the input file, but don't have functions to get the block start offset and length. This patch introduces two functions: 1. input_file_block_start: returns the file block start offset, or -1 if not available. 2. input_file_block_length: returns the file block length, or -1 if not available. ## How was this patch tested? Updated existing test cases in ColumnExpressionSuite that covered input_file_name to also cover the two new functions. Author: Reynold Xin <[email protected]> Closes apache#16133 from rxin/SPARK-18702.

[SPARK-18702][SQL] input_file_block_start and input_file_block_length…

7713ebe

… function

dongjoon-hyun reviewed Dec 4, 2016

View reviewed changes

ueshin reviewed Dec 4, 2016

View reviewed changes

dongjoon-hyun mentioned this pull request Dec 4, 2016

[SPARK-11374][SQL] Support skip.header.line.count option for Hive Table #14638

Closed

CR

097bfec

viirya reviewed Dec 5, 2016

View reviewed changes

asfgit closed this in e9730b7 Dec 5, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-18702][SQL] input_file_block_start and input_file_block_length #16133

[SPARK-18702][SQL] input_file_block_start and input_file_block_length #16133

rxin commented Dec 4, 2016

rxin commented Dec 4, 2016

dongjoon-hyun Dec 4, 2016

ueshin Dec 4, 2016 •

edited

Loading

SparkQA commented Dec 4, 2016

ueshin Dec 4, 2016

rxin Dec 4, 2016

ueshin Dec 5, 2016

ueshin Dec 4, 2016

ueshin Dec 4, 2016

SparkQA commented Dec 4, 2016

viirya Dec 5, 2016

rxin commented Dec 5, 2016

[SPARK-18702][SQL] input_file_block_start and input_file_block_length #16133

[SPARK-18702][SQL] input_file_block_start and input_file_block_length #16133

Conversation

rxin commented Dec 4, 2016

What changes were proposed in this pull request?

How was this patch tested?

rxin commented Dec 4, 2016

dongjoon-hyun Dec 4, 2016

Choose a reason for hiding this comment

ueshin Dec 4, 2016 • edited Loading

Choose a reason for hiding this comment

SparkQA commented Dec 4, 2016

ueshin Dec 4, 2016

Choose a reason for hiding this comment

rxin Dec 4, 2016

Choose a reason for hiding this comment

ueshin Dec 5, 2016

Choose a reason for hiding this comment

ueshin Dec 4, 2016

Choose a reason for hiding this comment

ueshin Dec 4, 2016

Choose a reason for hiding this comment

SparkQA commented Dec 4, 2016

viirya Dec 5, 2016

Choose a reason for hiding this comment

rxin commented Dec 5, 2016

ueshin Dec 4, 2016 •

edited

Loading