-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-32614][SQL] Don't apply comment processing if 'comment' unset for CSV #29516
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CC maybe @HyukjinKwon and/or @dongjoon-hyun
if (options.isCommentSet) { | ||
val commentPrefix = options.comment.toString | ||
iter.filter { line => | ||
val trimmed = line.trim |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just added this while here to avoid trimming twice
@@ -220,7 +220,9 @@ class CSVOptions( | |||
format.setQuote(quote) | |||
format.setQuoteEscape(escape) | |||
charToEscapeQuoteEscaping.foreach(format.setCharToEscapeQuoteEscaping) | |||
format.setComment(comment) | |||
if (isCommentSet) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arguably we should rework the handling of 'optional' configs to not use this default of \u0000
to mean "none" but I avoided that here. One consequence is that you cannot use \u0000
as a comment char right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we will change that way then it might impact existing users for which \u0000 is a comment character by default. So I would say a separate optional config is a better solution. What I am saying here is that we need to wait for univocity 3.0.0 to be available where the new changes will be available then we can add spark changes in a proper manner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct, but, this has never been a valid comment character, and the flip side is the bug you describe: it's always a comment character. I think it's reasonable to fix as a bug. I don't think we need yet another config, as I think it would be quite obscure to use this non-printing control code for comments in a CSV file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but once the changes will be done then \u0000 won't be treated as comment character. It will resolve this bug. But then default comment character will be # as in univocity this is the default comment character. So if my data row starts with # then will the row be processed now. If not then it will break most of the existing jobs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I'll fix that in the next commit - we need to set the comment char to whatever Spark is using no matter what. However it looks like we are going to need your univocity fix to really fix this. Looks like that was just released in 2.9.0: uniVocity/univocity-parsers@f392311
let me try that.
@dongjoon-hyun it is a correctness issue but I wouldn't hold up a release for it. We should address it but doesn't absolutely have to happen in 2.4.7 or 3.0.1. It's not a regression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, if you are fine I can also raise a PR for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would think this is rather a bug fix. If comment is not set, it shouldn't assume anything else is a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's also what we documented, see also DataFrameReader.csv
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, yeah, so we have to use the new method in univocity 2.9.0 to turn off its comment handling if its unset in Spark (= \u0000
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right, this stanza is for writer settings. There is no setCommentProcessingEnabled
for writers in univocity. Comments aren't generated. In fact the comment setting doesn't matter, really?
Oh, this sounds like a kind of correctness issue. Did I understand correctly, @srowen ? |
cc @ScrapCodes for Apache Spark 2.4.7 and @zhengruifeng for Apache Spark 3.0.1 because this is a correctness issue. |
Test build #127788 has finished for PR 29516 at commit
|
} | ||
} | ||
|
||
def skipComments(iter: Iterator[String], options: CSVOptions): Iterator[String] = { | ||
if (options.isCommentSet) { | ||
val commentPrefix = options.comment.toString | ||
iter.dropWhile { line => | ||
line.trim.isEmpty || line.trim.startsWith(commentPrefix) | ||
line.trim.isEmpty || line.startsWith(commentPrefix) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's correct to not trim the string that's checked to see if it starts with a comment, which is a slightly separate issue. \u0000
can't be used as a comment char, but other non-printable chars could.
@@ -1902,25 +1902,26 @@ abstract class CSVSuite extends QueryTest with SharedSparkSession with TestCsvDa | |||
|
|||
test("SPARK-25387: bad input should not cause NPE") { | |||
val schema = StructType(StructField("a", IntegerType) :: Nil) | |||
val input = spark.createDataset(Seq("\u0000\u0000\u0001234")) | |||
val input = spark.createDataset(Seq("\u0001\u0000\u0001234")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this test was wrong in 2 ways. First it relied on, actually, ignoring lines starting with \u0000
, which is the very bug we're fixing. You can see below it's asserting there is no result at all, when there should be some result.
|
||
checkAnswer( | ||
spark.read | ||
.option("columnNameOfCorruptRecord", "_corrupt_record") | ||
.schema(schema) | ||
.csv(input), | ||
Row(null, null)) | ||
assert(spark.read.csv(input).collect().toSet == Set(Row())) | ||
Row(null, "\u0001\u0000\u0001234")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other problem I think is that this was asserting there is no corrupt record -- no result at all -- when I think clearly the test should result in a single row with a corrupt record.
Test build #127793 has finished for PR 29516 at commit
|
Test build #127817 has finished for PR 29516 at commit
|
retest this please |
BTW I think we may still have a real test failure here, I'm looking into it. |
Test build #127824 has finished for PR 29516 at commit
|
Test build #127843 has finished for PR 29516 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVExprUtils.scala
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good otherwise.
@HyukjinKwon did this go into 3.0 as well? I think it's meant to |
…for CSV Spark's CSV source can optionally ignore lines starting with a comment char. Some code paths check to see if it's set before applying comment logic (i.e. not set to default of `\0`), but many do not, including the one that passes the option to Univocity. This means that rows beginning with a null char were being treated as comments even when 'disabled'. To avoid dropping rows that start with a null char when this is not requested or intended. See JIRA for an example. Nothing beyond the effect of the bug fix. Existing tests plus new test case. Closes #29516 from srowen/SPARK-32614. Authored-by: Sean Owen <[email protected]> Signed-off-by: HyukjinKwon <[email protected]> (cherry picked from commit a9d4e60) Signed-off-by: HyukjinKwon <[email protected]>
Merged to master and branch-3.0. Yes, there was a minor conflict which I resolved mnaully. |
can this go in 2.4.7 ? |
@tooptoop4 I think it could go into 2.4.x. Do you want to try a back-port PR to see if it picks cleanly and passes tests? there's a bump in univocity version which I'd want to be sure doesn't change other behavior. |
@tooptoop4 that change looks unrelated? There was also quite a bit of reason given. |
Isn't that expected? or can you set the comment char to something else?
…On Thu, Mar 4, 2021 at 5:41 PM koertkuipers ***@***.***> wrote:
this has unintended side effect of now dropping rows that start with #
we ran into this because we had comments disabled but we noticed that rows
in a csv that start with # were dropped
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#29516 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGIZ6QY3FJXY33QJFCRVKDTCALDDANCNFSM4QIJ32RA>
.
|
if in spark csv comment is not set (isCommentSet is false) then univocity should process with comment feature disabled. per univocity documentation the way to do this is to set comment to but what we are doing now instead is: if in spark comment is not set (isCommentSet is false) then we leave the default comment in univocity, which is i am still unsure how this is impacting us but what i see is that when we disable comment feature in spark csv we see univocity quote values that start with |
Same issue as @koertkuipers. Even if explicitly setting comment to something else, we still can't select rows with a column value
|
I created uniVocity/univocity-parsers#505 to request optionally disabling quoting row-starting comment char. |
…ers set it explicitly in CSV dataSource ### What changes were proposed in this pull request? Pass the comment option through to univocity if users set it explicitly in CSV dataSource. ### Why are the changes needed? In #29516 , in order to fix some bugs, univocity-parsers was upgrade from 2.8.3 to 2.9.0, it also involved a new feature of univocity-parsers that quoting values of the first column that start with the comment character. It made a breaking for users downstream that handing a whole row as input. Before this change: #abc,1 After this change: "#abc",1 We change the related `isCommentSet` check logic to enable users to keep behavior as before. ### Does this PR introduce _any_ user-facing change? Yes, a little. If users set comment option as '\u0000' explicitly, now they should remove it to keep comment option unset. ### How was this patch tested? Add a full new test. Closes #39878 from wayneguow/comment. Authored-by: wayneguow <[email protected]> Signed-off-by: Sean Owen <[email protected]>
What changes were proposed in this pull request?
Spark's CSV source can optionally ignore lines starting with a comment char. Some code paths check to see if it's set before applying comment logic (i.e. not set to default of
\0
), but many do not, including the one that passes the option to Univocity. This means that rows beginning with a null char were being treated as comments even when 'disabled'.Why are the changes needed?
To avoid dropping rows that start with a null char when this is not requested or intended. See JIRA for an example.
Does this PR introduce any user-facing change?
Nothing beyond the effect of the bug fix.
How was this patch tested?
Existing tests plus new test case.