[FEA] Fuzz testing of CSV #6926

revans2 · 2022-10-27T14:20:46Z

Is your feature request related to a problem? Please describe.
CSV is not a very well defined standard. There are lots and lots of different options for parsing values, escaping characters and configuring delimiters. Because of this complexity we should develop a fuzz testing framework to be able to verify that our code behaves the same as Spark on the CPU. We should concentrate on the default settings.

format: UTF-8
delimiter: ,
quote: "
escape: \
lineSeparator: (not set so it is \r|\n|\r\n)
charToEscapeQuoteEscaping: not set
comment: \u0000 (aka not set)
ignoreLeadingWhiteSpace: false
ignoreTrailingWhiteSpace: false
emptyValue: (empty string)
unescapedQuoteHandling: STOP_AT_DELIMITER

And a schema is also provided.

It would be great to expand this out further in the future, but for now this is the most important. The next things to look at testing would be changing the delimiter.

The text was updated successfully, but these errors were encountered:

revans2 added feature request New feature or request ? - Needs Triage Need team to review and classify test Only impacts tests labels Oct 27, 2022

revans2 mentioned this issue Oct 27, 2022

[BUG] Fix CSV Parsing #2063

Open

38 tasks

sameerz removed the ? - Needs Triage Need team to review and classify label Nov 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Fuzz testing of CSV #6926

[FEA] Fuzz testing of CSV #6926

revans2 commented Oct 27, 2022

[FEA] Fuzz testing of CSV #6926

[FEA] Fuzz testing of CSV #6926

Comments

revans2 commented Oct 27, 2022