Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23049][SQL] spark.sql.files.ignoreCorruptFiles should work for ORC files #20240

Closed
wants to merge 5 commits into from
Closed

Conversation

dongjoon-hyun
Copy link
Member

What changes were proposed in this pull request?

When spark.sql.files.ignoreCorruptFiles=true, we should ignore corrupted ORC files.

How was this patch tested?

Pass the Jenkins with a newly added test case.

@@ -608,4 +609,33 @@ class OrcQuerySuite extends OrcQueryTest with SharedSQLContext {
}
}
}

test("Enabling/disabling ignoreCorruptFiles") {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has the same test coverage with a Parquet test case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about hive.orc format?

new Path(basePath, "third").toString)
checkAnswer(
df,
Seq(Row(0), Row(1)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make this inlined?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, thank you for review, @HyukjinKwon .
It comes from Parquet test suite. So, I keep the same shape so far.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll inline it together.

}
} catch {
case e: org.apache.orc.FileFormatException =>
if (true) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if true ...?

Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Jan 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. My bad.

@SparkQA
Copy link

SparkQA commented Jan 12, 2018

Test build #85995 has finished for PR 20240 at commit 96a2196.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Hi, @gatorsmile and @cloud-fan .
Could you review this, too?

@SparkQA
Copy link

SparkQA commented Jan 12, 2018

Test build #85999 has finished for PR 20240 at commit 33ae3ca.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@SparkQA
Copy link

SparkQA commented Jan 12, 2018

Test build #86014 has finished for PR 20240 at commit 33ae3ca.

  • This patch fails from timeout after a configured wait of `250m`.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Retest this please

@SparkQA
Copy link

SparkQA commented Jan 12, 2018

Test build #86038 has finished for PR 20240 at commit 33ae3ca.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Retest this please

@SparkQA
Copy link

SparkQA commented Jan 12, 2018

Test build #86047 has finished for PR 20240 at commit 33ae3ca.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@SparkQA
Copy link

SparkQA commented Jan 13, 2018

Test build #86055 has finished for PR 20240 at commit 33ae3ca.

  • This patch fails from timeout after a configured wait of `300m`.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@SparkQA
Copy link

SparkQA commented Jan 13, 2018

Test build #86079 has finished for PR 20240 at commit 33ae3ca.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

testIgnoreCorruptFiles()
}

withSQLConf(SQLConf.IGNORE_CORRUPT_FILES.key -> "false") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have a test case when the schema is explicitly set too while we are here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I'll update Parquet test case together in order to keep the same test coverage.

@HyukjinKwon
Copy link
Member

LGTM otherwise.

@gatorsmile
Copy link
Member

We need to address both ORC formats.

@dongjoon-hyun
Copy link
Member Author

Thank you for review, @gatorsmile . Sure!

@SparkQA
Copy link

SparkQA commented Jan 13, 2018

Test build #86098 has finished for PR 20240 at commit f2e1e35.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 13, 2018

Test build #86102 has finished for PR 20240 at commit 1432057.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 13, 2018

Test build #86107 has finished for PR 20240 at commit 277e6f2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Hi, @HyukjinKwon and @gatorsmile .
It's ready for review again. Thanks!

@HyukjinKwon
Copy link
Member

LGTM

asfgit pushed a commit that referenced this pull request Jan 15, 2018
…or ORC files

## What changes were proposed in this pull request?

When `spark.sql.files.ignoreCorruptFiles=true`, we should ignore corrupted ORC files.

## How was this patch tested?

Pass the Jenkins with a newly added test case.

Author: Dongjoon Hyun <[email protected]>

Closes #20240 from dongjoon-hyun/SPARK-23049.

(cherry picked from commit 9a96bfc)
Signed-off-by: Wenchen Fan <[email protected]>
@cloud-fan
Copy link
Contributor

thanks, merging to master/2.3!

@asfgit asfgit closed this in 9a96bfc Jan 15, 2018
@dongjoon-hyun
Copy link
Member Author

Thank you so much for considering this at the last minute! @cloud-fan , @HyukjinKwon , and @gatorsmile .

@dongjoon-hyun dongjoon-hyun deleted the SPARK-23049 branch January 15, 2018 04:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants