Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-42335][SQL] Pass the comment option through to univocity if users set it explicitly in CSV dataSource #39878

Closed
wants to merge 4 commits into from

Conversation

wayneguow
Copy link
Contributor

@wayneguow wayneguow commented Feb 3, 2023

What changes were proposed in this pull request?

Pass the comment option through to univocity if users set it explicitly in CSV dataSource.

Why are the changes needed?

In #29516 , in order to fix some bugs, univocity-parsers was upgrade from 2.8.3 to 2.9.0, it also involved a new feature of univocity-parsers that quoting values of the first column that start with the comment character. It made a breaking for users downstream that handing a whole row as input.
Before this change:
#abc,1
After this change:
"#abc",1
We change the related isCommentSet check logic to enable users to keep behavior as before.

Does this PR introduce any user-facing change?

Yes, a little. If users set comment option as '\u0000' explicitly, now they should remove it to keep comment option unset.

How was this patch tested?

Add a full new test.

@github-actions github-actions bot added the SQL label Feb 3, 2023
@wayneguow
Copy link
Contributor Author

@@ -283,6 +286,8 @@ class CSVOptions(
charToEscapeQuoteEscaping.foreach(format.setCharToEscapeQuoteEscaping)
if (isCommentSet) {
format.setComment(comment)
} else if (legacyDefaultUnicodeNullAsWrittenComment) {
Copy link
Contributor Author

@wayneguow wayneguow Feb 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If univocity-parsers#518 is merged, the code can be changed to:
else { writerSettings.setCommentProcessingEnabled(false) }

@srowen
Copy link
Member

srowen commented Feb 3, 2023

Wait, if the comment char is # then isn't quoting needed to avoid ambiguity? Or why not just set the comment char to something else on write if desired?

@wayneguow
Copy link
Contributor Author

Wait, if the comment char is # then isn't quoting needed to avoid ambiguity? Or why not just set the comment char to something else on write if desired?

@srowen Actually, the real goal is that we don't want any character to be used as a comment when writing CSV files, and just keeping output as the original. If set to another char, it would cause that the fist column of rows starting with the new char will be quoted, the problem still exists.

@srowen
Copy link
Member

srowen commented Feb 3, 2023

But you set it to \u0000 here. The caller can already do that.

@wayneguow
Copy link
Contributor Author

But you set it to \u0000 here. The caller can already do that.

With the isCommentSet condition:
val isCommentSet = this.comment != '\u0000'
when users set comment as '\u0000' explicitly, it can not be passed to univocity-parsers. The default value in univocity-parsers is '#'.

@srowen
Copy link
Member

srowen commented Feb 3, 2023

I see, can we fix that instead and adopt this behavior? So I can set the comment char in Spark with similar semantics? It's just not clear why we need a different flag for this vs letting users select the comment, like Univocity does

@wayneguow
Copy link
Contributor Author

wayneguow commented Feb 3, 2023

How do you think of fixing that if users set comment as '\u0000' explicitly, Spark passes it to univocity-parsers?
If you agree with it, I would change the related issue description and recommit some code for this pr.

@wayneguow wayneguow marked this pull request as draft February 3, 2023 14:42
@srowen
Copy link
Member

srowen commented Feb 3, 2023

That seems simpler yeah, and means these semantics are available now, in non-'legacy' uses

@wayneguow wayneguow changed the title [SPARK-42335][SQL] Add a legacy config for restoring written comment option behavior in CSV dataSource [SPARK-42335][SQL] Pass the comment option through to univocity if users set it explicitly in CSV dataSource Feb 5, 2023
@wayneguow wayneguow marked this pull request as ready for review February 6, 2023 03:48
@wayneguow
Copy link
Contributor Author

Gentle ping @srowen
I have made some changes for isCommentSet , but it will bring some little differences with before. I put the behavior comparison in a table in SPARK-42335 's description. I'm not sure it's worth it.

The best way is to add a else condition like parsers after this pr univocity-parsers#518 is merged, but univocity-parsers seems to be inactive, there is no any update of for a long time.

@srowen
Copy link
Member

srowen commented Feb 7, 2023

LGTM. The behavior changes are more like bug fixes. Where someone has \u0000 in data they can pick another comment char that isn't used. Arguably we can achieve this while keeping the special case behavior for \u0000 for setCommentChar, but I actually like not special casing this. Hm is there a way now to not set any comment char in univocity?

@wayneguow
Copy link
Contributor Author

Hm is there a way now to not set any comment char in univocity?

As I know, there is no way to not set it currently.

@srowen
Copy link
Member

srowen commented Feb 8, 2023

Merged to master

@srowen srowen closed this in 67b6f0e Feb 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants