-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-42335][SQL] Pass the comment option through to univocity if users set it explicitly in CSV dataSource #39878
Conversation
@@ -283,6 +286,8 @@ class CSVOptions( | |||
charToEscapeQuoteEscaping.foreach(format.setCharToEscapeQuoteEscaping) | |||
if (isCommentSet) { | |||
format.setComment(comment) | |||
} else if (legacyDefaultUnicodeNullAsWrittenComment) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If univocity-parsers#518 is merged, the code can be changed to:
else { writerSettings.setCommentProcessingEnabled(false) }
Wait, if the comment char is # then isn't quoting needed to avoid ambiguity? Or why not just set the comment char to something else on write if desired? |
@srowen Actually, the real goal is that we don't want any character to be used as a comment when writing CSV files, and just keeping output as the original. If set to another char, it would cause that the fist column of rows starting with the new char will be quoted, the problem still exists. |
But you set it to \u0000 here. The caller can already do that. |
With the |
I see, can we fix that instead and adopt this behavior? So I can set the comment char in Spark with similar semantics? It's just not clear why we need a different flag for this vs letting users select the comment, like Univocity does |
How do you think of fixing that if users set comment as '\u0000' explicitly, Spark passes it to univocity-parsers? |
That seems simpler yeah, and means these semantics are available now, in non-'legacy' uses |
Gentle ping @srowen The best way is to add a |
LGTM. The behavior changes are more like bug fixes. Where someone has \u0000 in data they can pick another comment char that isn't used. Arguably we can achieve this while keeping the special case behavior for \u0000 for setCommentChar, but I actually like not special casing this. Hm is there a way now to not set any comment char in univocity? |
As I know, there is no way to not set it currently. |
Merged to master |
What changes were proposed in this pull request?
Pass the comment option through to univocity if users set it explicitly in CSV dataSource.
Why are the changes needed?
In #29516 , in order to fix some bugs, univocity-parsers was upgrade from 2.8.3 to 2.9.0, it also involved a new feature of univocity-parsers that quoting values of the first column that start with the comment character. It made a breaking for users downstream that handing a whole row as input.
Before this change:
#abc,1
After this change:
"#abc",1
We change the related
isCommentSet
check logic to enable users to keep behavior as before.Does this PR introduce any user-facing change?
Yes, a little. If users set comment option as '\u0000' explicitly, now they should remove it to keep comment option unset.
How was this patch tested?
Add a full new test.