Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-18702][SQL] input_file_block_start and input_file_block_length #16133

Closed
wants to merge 2 commits into from

Conversation

rxin
Copy link
Contributor

@rxin rxin commented Dec 4, 2016

What changes were proposed in this pull request?

We currently have function input_file_name to get the path of the input file, but don't have functions to get the block start offset and length. This patch introduces two functions:

  1. input_file_block_start: returns the file block start offset, or -1 if not available.

  2. input_file_block_length: returns the file block length, or -1 if not available.

How was this patch tested?

Updated existing test cases in ColumnExpressionSuite that covered input_file_name to also cover the two new functions.

@rxin
Copy link
Contributor Author

rxin commented Dec 4, 2016

cc @ueshin

val inputMetrics = context.taskMetrics().inputMetrics
val existingBytesRead = inputMetrics.bytesRead
private val inputMetrics = context.taskMetrics().inputMetrics
private val existingBytesRead = inputMetrics.bytesRead

// Sets the thread local variable for the file's name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. Updating comment together?

Copy link
Member

@ueshin ueshin Dec 4, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with it and also the same comment in HadoopRDD.

@SparkQA
Copy link

SparkQA commented Dec 4, 2016

Test build #69632 has finished for PR 16133 at commit 7713ebe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

checkAnswer(data.select(input_file_name()).limit(1), Row(""))
// Test the 3 expressions when reading from files
val q = spark.read.parquet(dir.getCanonicalPath).select(
input_file_name(), expr("input_file_block_start()"), expr("input_file_block_length()"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add input_file_block_start(), input_file_block_length() to functions.scala and use them the same as input_file_name()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually intentionally not adding those because I don't know how common these expressions will be.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.

checkAnswer(data.select(input_file_name()).limit(1), Row(""))
// Test the 3 expressions when reading from files
val q = df.select(
input_file_name(), expr("input_file_block_start()"), expr("input_file_block_length()"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

checkAnswer(data.select(input_file_name()).limit(1), Row(""))
// Test the 3 expressions when reading from files
val q = df.select(
input_file_name(), expr("input_file_block_start()"), expr("input_file_block_length()"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

@SparkQA
Copy link

SparkQA commented Dec 4, 2016

Test build #69642 has finished for PR 16133 at commit 097bfec.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}

/**
* The thread variable for the name of the current file being read. This is used by
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this comment can be updated too.

@rxin
Copy link
Contributor Author

rxin commented Dec 5, 2016

Thanks - going to merge this in master.

@asfgit asfgit closed this in e9730b7 Dec 5, 2016
robert3005 pushed a commit to palantir/spark that referenced this pull request Dec 15, 2016
## What changes were proposed in this pull request?
We currently have function input_file_name to get the path of the input file, but don't have functions to get the block start offset and length. This patch introduces two functions:

1. input_file_block_start: returns the file block start offset, or -1 if not available.

2. input_file_block_length: returns the file block length, or -1 if not available.

## How was this patch tested?
Updated existing test cases in ColumnExpressionSuite that covered input_file_name to also cover the two new functions.

Author: Reynold Xin <[email protected]>

Closes apache#16133 from rxin/SPARK-18702.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?
We currently have function input_file_name to get the path of the input file, but don't have functions to get the block start offset and length. This patch introduces two functions:

1. input_file_block_start: returns the file block start offset, or -1 if not available.

2. input_file_block_length: returns the file block length, or -1 if not available.

## How was this patch tested?
Updated existing test cases in ColumnExpressionSuite that covered input_file_name to also cover the two new functions.

Author: Reynold Xin <[email protected]>

Closes apache#16133 from rxin/SPARK-18702.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants