-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-18702][SQL] input_file_block_start and input_file_block_length #16133
Conversation
cc @ueshin |
val inputMetrics = context.taskMetrics().inputMetrics | ||
val existingBytesRead = inputMetrics.bytesRead | ||
private val inputMetrics = context.taskMetrics().inputMetrics | ||
private val existingBytesRead = inputMetrics.bytesRead | ||
|
||
// Sets the thread local variable for the file's name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit. Updating comment together?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with it and also the same comment in HadoopRDD
.
Test build #69632 has finished for PR 16133 at commit
|
checkAnswer(data.select(input_file_name()).limit(1), Row("")) | ||
// Test the 3 expressions when reading from files | ||
val q = spark.read.parquet(dir.getCanonicalPath).select( | ||
input_file_name(), expr("input_file_block_start()"), expr("input_file_block_length()")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add input_file_block_start()
, input_file_block_length()
to functions.scala
and use them the same as input_file_name()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually intentionally not adding those because I don't know how common these expressions will be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see.
checkAnswer(data.select(input_file_name()).limit(1), Row("")) | ||
// Test the 3 expressions when reading from files | ||
val q = df.select( | ||
input_file_name(), expr("input_file_block_start()"), expr("input_file_block_length()")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
checkAnswer(data.select(input_file_name()).limit(1), Row("")) | ||
// Test the 3 expressions when reading from files | ||
val q = df.select( | ||
input_file_name(), expr("input_file_block_start()"), expr("input_file_block_length()")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
Test build #69642 has finished for PR 16133 at commit
|
} | ||
|
||
/** | ||
* The thread variable for the name of the current file being read. This is used by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this comment can be updated too.
Thanks - going to merge this in master. |
## What changes were proposed in this pull request? We currently have function input_file_name to get the path of the input file, but don't have functions to get the block start offset and length. This patch introduces two functions: 1. input_file_block_start: returns the file block start offset, or -1 if not available. 2. input_file_block_length: returns the file block length, or -1 if not available. ## How was this patch tested? Updated existing test cases in ColumnExpressionSuite that covered input_file_name to also cover the two new functions. Author: Reynold Xin <[email protected]> Closes apache#16133 from rxin/SPARK-18702.
## What changes were proposed in this pull request? We currently have function input_file_name to get the path of the input file, but don't have functions to get the block start offset and length. This patch introduces two functions: 1. input_file_block_start: returns the file block start offset, or -1 if not available. 2. input_file_block_length: returns the file block length, or -1 if not available. ## How was this patch tested? Updated existing test cases in ColumnExpressionSuite that covered input_file_name to also cover the two new functions. Author: Reynold Xin <[email protected]> Closes apache#16133 from rxin/SPARK-18702.
What changes were proposed in this pull request?
We currently have function input_file_name to get the path of the input file, but don't have functions to get the block start offset and length. This patch introduces two functions:
input_file_block_start: returns the file block start offset, or -1 if not available.
input_file_block_length: returns the file block length, or -1 if not available.
How was this patch tested?
Updated existing test cases in ColumnExpressionSuite that covered input_file_name to also cover the two new functions.