Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32614][SQL] Don't apply comment processing if 'comment' unset for CSV #29516

Closed
wants to merge 4 commits into from

Conversation

srowen
Copy link
Member

@srowen srowen commented Aug 22, 2020

What changes were proposed in this pull request?

Spark's CSV source can optionally ignore lines starting with a comment char. Some code paths check to see if it's set before applying comment logic (i.e. not set to default of \0), but many do not, including the one that passes the option to Univocity. This means that rows beginning with a null char were being treated as comments even when 'disabled'.

Why are the changes needed?

To avoid dropping rows that start with a null char when this is not requested or intended. See JIRA for an example.

Does this PR introduce any user-facing change?

Nothing beyond the effect of the bug fix.

How was this patch tested?

Existing tests plus new test case.

@srowen srowen self-assigned this Aug 22, 2020
Copy link
Member Author

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC maybe @HyukjinKwon and/or @dongjoon-hyun

if (options.isCommentSet) {
val commentPrefix = options.comment.toString
iter.filter { line =>
val trimmed = line.trim
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just added this while here to avoid trimming twice

@@ -220,7 +220,9 @@ class CSVOptions(
format.setQuote(quote)
format.setQuoteEscape(escape)
charToEscapeQuoteEscaping.foreach(format.setCharToEscapeQuoteEscaping)
format.setComment(comment)
if (isCommentSet) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arguably we should rework the handling of 'optional' configs to not use this default of \u0000 to mean "none" but I avoided that here. One consequence is that you cannot use \u0000 as a comment char right now.

Copy link

@codealways codealways Aug 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we will change that way then it might impact existing users for which \u0000 is a comment character by default. So I would say a separate optional config is a better solution. What I am saying here is that we need to wait for univocity 3.0.0 to be available where the new changes will be available then we can add spark changes in a proper manner.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct, but, this has never been a valid comment character, and the flip side is the bug you describe: it's always a comment character. I think it's reasonable to fix as a bug. I don't think we need yet another config, as I think it would be quite obscure to use this non-printing control code for comments in a CSV file.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but once the changes will be done then \u0000 won't be treated as comment character. It will resolve this bug. But then default comment character will be # as in univocity this is the default comment character. So if my data row starts with # then will the row be processed now. If not then it will break most of the existing jobs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I'll fix that in the next commit - we need to set the comment char to whatever Spark is using no matter what. However it looks like we are going to need your univocity fix to really fix this. Looks like that was just released in 2.9.0: uniVocity/univocity-parsers@f392311

let me try that.

@dongjoon-hyun it is a correctness issue but I wouldn't hold up a release for it. We should address it but doesn't absolutely have to happen in 2.4.7 or 3.0.1. It's not a regression.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, if you are fine I can also raise a PR for this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think this is rather a bug fix. If comment is not set, it shouldn't assume anything else is a comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's also what we documented, see also DataFrameReader.csv.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, yeah, so we have to use the new method in univocity 2.9.0 to turn off its comment handling if its unset in Spark (= \u0000)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right, this stanza is for writer settings. There is no setCommentProcessingEnabled for writers in univocity. Comments aren't generated. In fact the comment setting doesn't matter, really?

@dongjoon-hyun
Copy link
Member

Oh, this sounds like a kind of correctness issue. Did I understand correctly, @srowen ?

@dongjoon-hyun
Copy link
Member

cc @ScrapCodes for Apache Spark 2.4.7 and @zhengruifeng for Apache Spark 3.0.1 because this is a correctness issue.

@SparkQA
Copy link

SparkQA commented Aug 23, 2020

Test build #127788 has finished for PR 29516 at commit 6358727.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen srowen changed the title [SPARK-32614][SQL] Don't apply comment processing if 'comment' unset for CSV [WIP][SPARK-32614][SQL] Don't apply comment processing if 'comment' unset for CSV Aug 23, 2020
}
}

def skipComments(iter: Iterator[String], options: CSVOptions): Iterator[String] = {
if (options.isCommentSet) {
val commentPrefix = options.comment.toString
iter.dropWhile { line =>
line.trim.isEmpty || line.trim.startsWith(commentPrefix)
line.trim.isEmpty || line.startsWith(commentPrefix)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's correct to not trim the string that's checked to see if it starts with a comment, which is a slightly separate issue. \u0000 can't be used as a comment char, but other non-printable chars could.

@@ -1902,25 +1902,26 @@ abstract class CSVSuite extends QueryTest with SharedSparkSession with TestCsvDa

test("SPARK-25387: bad input should not cause NPE") {
val schema = StructType(StructField("a", IntegerType) :: Nil)
val input = spark.createDataset(Seq("\u0000\u0000\u0001234"))
val input = spark.createDataset(Seq("\u0001\u0000\u0001234"))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test was wrong in 2 ways. First it relied on, actually, ignoring lines starting with \u0000, which is the very bug we're fixing. You can see below it's asserting there is no result at all, when there should be some result.


checkAnswer(
spark.read
.option("columnNameOfCorruptRecord", "_corrupt_record")
.schema(schema)
.csv(input),
Row(null, null))
assert(spark.read.csv(input).collect().toSet == Set(Row()))
Row(null, "\u0001\u0000\u0001234"))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other problem I think is that this was asserting there is no corrupt record -- no result at all -- when I think clearly the test should result in a single row with a corrupt record.

@SparkQA
Copy link

SparkQA commented Aug 23, 2020

Test build #127793 has finished for PR 29516 at commit 87e8b65.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 23, 2020

Test build #127817 has finished for PR 29516 at commit f3d14c6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor

retest this please

@srowen
Copy link
Member Author

srowen commented Aug 24, 2020

BTW I think we may still have a real test failure here, I'm looking into it.

@SparkQA
Copy link

SparkQA commented Aug 24, 2020

Test build #127824 has finished for PR 29516 at commit f3d14c6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 24, 2020

Test build #127843 has finished for PR 29516 at commit b685f1c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good otherwise.

@srowen srowen changed the title [WIP][SPARK-32614][SQL] Don't apply comment processing if 'comment' unset for CSV [SPARK-32614][SQL] Don't apply comment processing if 'comment' unset for CSV Aug 25, 2020
@srowen
Copy link
Member Author

srowen commented Aug 25, 2020

@HyukjinKwon did this go into 3.0 as well? I think it's meant to

HyukjinKwon pushed a commit that referenced this pull request Aug 25, 2020
…for CSV

Spark's CSV source can optionally ignore lines starting with a comment char. Some code paths check to see if it's set before applying comment logic (i.e. not set to default of `\0`), but many do not, including the one that passes the option to Univocity. This means that rows beginning with a null char were being treated as comments even when 'disabled'.

To avoid dropping rows that start with a null char when this is not requested or intended. See JIRA for an example.

Nothing beyond the effect of the bug fix.

Existing tests plus new test case.

Closes #29516 from srowen/SPARK-32614.

Authored-by: Sean Owen <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
(cherry picked from commit a9d4e60)
Signed-off-by: HyukjinKwon <[email protected]>
@HyukjinKwon
Copy link
Member

Merged to master and branch-3.0.

Yes, there was a minor conflict which I resolved mnaully.

@tooptoop4
Copy link
Contributor

can this go in 2.4.7 ?

@srowen
Copy link
Member Author

srowen commented Aug 28, 2020

@tooptoop4 I think it could go into 2.4.x. Do you want to try a back-port PR to see if it picks cleanly and passes tests? there's a bump in univocity version which I'd want to be sure doesn't change other behavior.

@tooptoop4
Copy link
Contributor

@srowen I would of but my last PR (#27697) got shot down for no reason

@srowen
Copy link
Member Author

srowen commented Aug 29, 2020

@tooptoop4 that change looks unrelated? There was also quite a bit of reason given.
I don't think it's relevant to creating a back-port, which I am saying is worth evaluating.

@srowen srowen deleted the SPARK-32614 branch September 12, 2020 21:11
@srowen
Copy link
Member Author

srowen commented Mar 5, 2021 via email

@koertkuipers
Copy link
Contributor

Isn't that expected? or can you set the comment char to something else?

On Thu, Mar 4, 2021 at 5:41 PM koertkuipers @.***> wrote: this has unintended side effect of now dropping rows that start with # we ran into this because we had comments disabled but we noticed that rows in a csv that start with # were dropped — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#29516 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGIZ6QY3FJXY33QJFCRVKDTCALDDANCNFSM4QIJ32RA .

if in spark csv comment is not set (isCommentSet is false) then univocity should process with comment feature disabled. per univocity documentation the way to do this is to set comment to \0. i realize this seems to be not working exactly as desired although i not yet fully grasp how or why.

but what we are doing now instead is: if in spark comment is not set (isCommentSet is false) then we leave the default comment in univocity, which is #. that is not the same as unsetting/disabling comment feature. i feel like this might be confusing and maybe also can have unintended consequences?

i am still unsure how this is impacting us but what i see is that when we disable comment feature in spark csv we see univocity quote values that start with # upon writing. since we are generated bar delimited output for systems that do not support quotes this causes trouble for us.
we actually had disabled quote (which sets it to \0) leading to # becoming \0#\0 upon writing, which then in older versions of spark was considered a comment line and got dropped! so i spend a few days going down the rabbit hole of trying to understand why we were losing records... it came down to this change in behavior here.
you are right that for my particular issue this can be fixed by explicitly setting comment to something else (as long as its not \0).

@manuzhang
Copy link
Contributor

Same issue as @koertkuipers. Even if explicitly setting comment to something else, we still can't select rows with a column value # in Spark 2.3.1 (univocity-parser 2.5.9)

select * from table where column = '#'

@manuzhang
Copy link
Contributor

I created uniVocity/univocity-parsers#505 to request optionally disabling quoting row-starting comment char.

srowen pushed a commit that referenced this pull request Feb 8, 2023
…ers set it explicitly in CSV dataSource

### What changes were proposed in this pull request?
Pass the comment option through to univocity if users set it explicitly in CSV dataSource.

### Why are the changes needed?
In #29516 , in order to fix some bugs, univocity-parsers was upgrade from 2.8.3 to 2.9.0, it also involved a new feature of univocity-parsers that quoting values of the first column that start with the comment character. It made a breaking for users downstream that handing a whole row as input.
Before this change:
#abc,1
After this change:
"#abc",1
We change the related `isCommentSet` check logic to enable users to keep behavior as before.

### Does this PR introduce _any_ user-facing change?
Yes, a little. If users set comment option as '\u0000' explicitly, now they should remove it to keep comment option unset.

### How was this patch tested?
Add a full new test.

Closes #39878 from wayneguow/comment.

Authored-by: wayneguow <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants