You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
I am not totally sure if this is a bug or an epic, but I have marked it as both. CSV parsing has been on by default, but it has a lot of inconsistencies with what spark does and this can cause problems. The current plan is to mitigate this by disabling CSV by default and then working through issues until we can enable it again everywhere.
[BUG] CSV parsing of malformed lines is empty string not null #2068 cudf interprets missing values at the end of the line as empty strings. Spark interprets them a null values. This is fine when the null value is the default (an empty string), but is a problem if you set it to anything else.
Header with lots of comments at the beginning. We currently play games with turning off header checking in all but the first partition for a file. This is fine, so long as the first partition is more than just comments. This is minor, and possibly a bug in Spark as well.
Support alternate character sets ("encoding" option).
Support FAILFAST mode. By default when Spark sees malformed data it converts it into a null PERMISSIVE mode. In FAILFAST mode it throws an exception. There is also DROPMALFORMED mode that is supposed to drop bad data, but it looks like CSV does not support this.
Add tests for enforceSchema set to false. TODO file issues Oddly for Spark when you set this to false it does more schema enforcement. The enforce here really means force the set schema on the files. It looks like it works for our plugin, but we want to have tests to verify this
Describe the bug
I am not totally sure if this is a bug or an epic, but I have marked it as both. CSV parsing has been on by default, but it has a lot of inconsistencies with what spark does and this can cause problems. The current plan is to mitigate this by disabling CSV by default and then working through issues until we can enable it again everywhere.
Important Issues:
columnNameofCorruptRecord
spark.sql.legacy.timeParserPolicy
when parsing CSV files #1111 Dates and timestamps need to deal withspark.sql.legacy.timeParserPolicy
Investigate:
Low Priority Issues:
FAILFAST
mode. By default when Spark sees malformed data it converts it into a nullPERMISSIVE
mode. InFAILFAST
mode it throws an exception. There is alsoDROPMALFORMED
mode that is supposed to drop bad data, but it looks like CSV does not support this.positiveInf
,negativeInf
, andnanValue
#4644multiline
supportTests:
enforceSchema
set to false. TODO file issues Oddly for Spark when you set this to false it does more schema enforcement. The enforce here really means force the set schema on the files. It looks like it works for our plugin, but we want to have tests to verify thisThe text was updated successfully, but these errors were encountered: