Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-27259][CORE] Allow setting -1 as length for FileBlock #26123

Closed
wants to merge 3 commits into from
Closed

[SPARK-27259][CORE] Allow setting -1 as length for FileBlock #26123

wants to merge 3 commits into from

Conversation

praneetsharma
Copy link

@praneetsharma praneetsharma commented Oct 15, 2019

What changes were proposed in this pull request?

This PR aims to update the validation check on length from length >= 0 to length >= -1 in order to allow set -1 to keep the default value.

Why are the changes needed?

At Apache Spark 2.2.0, SPARK-18702 adds class FileBlock with the default length value, -1, initially.

There is no way to set filePath only while keeping length is -1.

  def set(filePath: String, startOffset: Long, length: Long): Unit = {
     require(filePath != null, "filePath cannot be null")
     require(startOffset >= 0, s"startOffset ($startOffset) cannot be negative")
     require(length >= 0, s"length ($length) cannot be negative")
     inputBlock.set(new FileBlock(UTF8String.fromString(filePath), startOffset, length))
   }

For compressed files (like GZ), the size of split can be set to -1. This was allowed till Spark 2.1 but regressed starting with spark 2.2.x. Please note that split length of -1 also means the length was unknown - a valid scenario. Thus, split length of -1 should be acceptable like pre Spark 2.2.

Does this PR introduce any user-facing change?

No

How was this patch tested?

This is updating the corner case on the requirement check. Manually check the code.

For compressed files, the size of split can be set to -1. This was
allowed till Spark 2.1 but regressed starting spark 2.2.x

This commit removes the validation on length being at least 0.
@praneetsharma praneetsharma changed the title Fix for SPARK-27259 SPARK-27259 Fix Oct 15, 2019
@praneetsharma praneetsharma changed the title SPARK-27259 Fix [SPARK-27259][CORE] Allow setting -1 as split size for InputFileBlock Oct 15, 2019
@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Oct 15, 2019

Hi, @praneetsharma .
Could you give us the valid reference for the following claim?

Please note that split length of -1 also means the length was unknown - a valid scenario.

I'm just wondering if we have an official reference for the following.

Certain custom InputFormats set split size to be -1 for compressed files like GZ.

@dongjoon-hyun
Copy link
Member

ok to test

Incorporating comment of modifying the validation to be `length >= -1`
Formatting fix
@SparkQA
Copy link

SparkQA commented Oct 16, 2019

Test build #112132 has finished for PR 26123 at commit c67bcf3.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 16, 2019

Test build #112133 has finished for PR 26123 at commit 840f1ce.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 16, 2019

Test build #112131 has finished for PR 26123 at commit 53ae391.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @praneetsharma .
Merged to master/3.0-preview/2.4.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-27259][CORE] Allow setting -1 as split size for InputFileBlock [SPARK-27259][CORE] Allow setting -1 as length for FileBlock Oct 16, 2019
dongjoon-hyun pushed a commit that referenced this pull request Oct 16, 2019
### What changes were proposed in this pull request?

This PR aims to update the validation check on `length` from `length >= 0` to `length >= -1` in order to allow set `-1` to keep the default value.

### Why are the changes needed?

At Apache Spark 2.2.0, [SPARK-18702](https://github.com/apache/spark/pull/16133/files#diff-2c5519b1cf4308d77d6f12212971544fR27-R38) adds `class FileBlock` with the default `length` value, `-1`, initially.

There is no way to set `filePath` only while keeping `length` is `-1`.
```scala
  def set(filePath: String, startOffset: Long, length: Long): Unit = {
     require(filePath != null, "filePath cannot be null")
     require(startOffset >= 0, s"startOffset ($startOffset) cannot be negative")
     require(length >= 0, s"length ($length) cannot be negative")
     inputBlock.set(new FileBlock(UTF8String.fromString(filePath), startOffset, length))
   }
```

For compressed files (like GZ), the size of split can be set to -1. This was allowed till Spark 2.1 but regressed starting with spark 2.2.x. Please note that split length of -1 also means the length was unknown - a valid scenario. Thus, split length of -1 should be acceptable like pre Spark 2.2.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

This is updating the corner case on the requirement check. Manually check the code.

Closes #26123 from praneetsharma/fix-SPARK-27259.

Authored-by: prasha2 <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 57edb42)
Signed-off-by: Dongjoon Hyun <[email protected]>
dongjoon-hyun pushed a commit that referenced this pull request Oct 16, 2019
### What changes were proposed in this pull request?

This PR aims to update the validation check on `length` from `length >= 0` to `length >= -1` in order to allow set `-1` to keep the default value.

### Why are the changes needed?

At Apache Spark 2.2.0, [SPARK-18702](https://github.com/apache/spark/pull/16133/files#diff-2c5519b1cf4308d77d6f12212971544fR27-R38) adds `class FileBlock` with the default `length` value, `-1`, initially.

There is no way to set `filePath` only while keeping `length` is `-1`.
```scala
  def set(filePath: String, startOffset: Long, length: Long): Unit = {
     require(filePath != null, "filePath cannot be null")
     require(startOffset >= 0, s"startOffset ($startOffset) cannot be negative")
     require(length >= 0, s"length ($length) cannot be negative")
     inputBlock.set(new FileBlock(UTF8String.fromString(filePath), startOffset, length))
   }
```

For compressed files (like GZ), the size of split can be set to -1. This was allowed till Spark 2.1 but regressed starting with spark 2.2.x. Please note that split length of -1 also means the length was unknown - a valid scenario. Thus, split length of -1 should be acceptable like pre Spark 2.2.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

This is updating the corner case on the requirement check. Manually check the code.

Closes #26123 from praneetsharma/fix-SPARK-27259.

Authored-by: prasha2 <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 57edb42)
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants