[SPARK-18699][SQL] Put malformed tokens into a new field when parsing CSV data #16928

maropu · 2017-02-14T15:28:33Z

What changes were proposed in this pull request?

This pr added a logic to put malformed tokens into a new field when parsing CSV data in case of permissive modes. In the current master, if the CSV parser hits these malformed ones, it throws an exception below (and then a job fails);

Caused by: java.lang.IllegalArgumentException
	at java.sql.Date.valueOf(Date.java:143)
	at org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:137)
	at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply$mcJ$sp(CSVInferSchema.scala:272)
	at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272)
	at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272)
	at scala.util.Try.getOrElse(Try.scala:79)
	at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:269)
	at

In case that users load large CSV-formatted data, the job failure makes users get some confused. So, this fix set NULL for original columns and put malformed tokens in a new field.

How was this patch tested?

Added tests in CSVSuite.

SparkQA · 2017-02-14T17:56:29Z

Test build #72877 has finished for PR 16928 at commit 266819a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2017-02-15T01:28:01Z

@HyukjinKwon Could you check this and give me any insight before committers do?

HyukjinKwon · 2017-02-15T05:10:14Z

@maropu, I just ran some similar tests with JSON datasource. What do you think about matching it to JSON's behaviour by introducing columnNameOfCorruptRecord?

I ran with the data and schema as below:

Seq("""{"a": "a", "b" : 1}""").toDF().write.text("/tmp/path")
val schema = StructType(StructField("a", IntegerType, true) :: StructField("b", StringType, true) :: StructField("_corrupt_record", StringType, true) :: Nil)

FAILFAST

scala> spark.read.schema(schema).option("mode", "FAILFAST").json("/tmp/path").show()
org.apache.spark.sql.catalyst.json.SparkSQLJsonProcessingException: Malformed line in FAILFAST mode: {"a": "a", "b" : 1}

DROPMALFORMED

scala> spark.read.schema(schema).option("mode", "DROPMALFORMED").json("/tmp/path").show()
+---+---+---------------+
|  a|  b|_corrupt_record|
+---+---+---------------+
+---+---+---------------+

PERMISSIVE

scala> spark.read.schema(schema).option("mode", "PERMISSIVE").json("/tmp/path").show()
+----+----+-------------------+
|   a|   b|    _corrupt_record|
+----+----+-------------------+
|null|null|{"a": "a", "b" : 1}|
+----+----+-------------------+

maropu · 2017-02-15T05:13:06Z

Aha, looks good to me. Just a sec, and I'll modify the code.

HyukjinKwon · 2017-02-16T01:45:30Z

cc @cloud-fan, could I ask if you think matching it seems reasonable? I wrote some details below for you to track this issue easily.

Currently, JSON produces

Seq("""{"a": "a", "b" : 1}""").toDF().write.text("/tmp/path")
val schema = StructType(
  StructField("a", IntegerType, true) ::
  StructField("b", StringType, true) :: 
  StructField("_corrupt_record", StringType, true) :: Nil)
spark.read.schema(schema)
  .option("mode", "PERMISSIVE")
  .json("/tmp/path").show()

+----+----+-------------------+
|   a|   b|    _corrupt_record|
+----+----+-------------------+
|null|null|{"a": "a", "b" : 1}|
+----+----+-------------------+

whereas CSV produces

Seq("""a,1""").toDF().write.text("/tmp/path")
val schema = StructType(
  StructField("a", IntegerType, true) ::
  StructField("b", StringType, true) :: Nil)
spark.read.schema(schema)
  .option("mode", "PERMISSIVE")
  .csv("/tmp/path").show()

java.lang.NumberFormatException: For input string: "a"
	at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
	at java.lang.Integer.parseInt(Integer.java:580)

To cut it short, the problem is, parsing itself is fine but when a value is unable to convert. In case of JSON, it fills the value in the column specified in columnNameOfCorruptRecord whereas CSV throws an exception even if it is PERMISSIVE mode.

It seems there are two ways to fix this.

One is what @maropu initially suggested - permissively fills it as null. My worry here is losing the data and inconsistency with JSON's parse mode.

The other way I suggested here - matching it to JSON's one, storing the value in the column specified in columnNameOfCorruptRecord. However, note that I guess we will not get this column while inferring in CSV in most cases as I can't imagine the case when CSV itself is malformed.

JSON schema inference produces this column when the JSON is malformed as below:

Seq("""{"a": "a", "b" :""").toDF().write.text("/tmp/test123")
spark.read.json("/tmp/test123").printSchema
root
 |-- _corrupt_record: string (nullable = true)

HyukjinKwon · 2017-02-16T01:45:54Z

Let me try to review further at my best once it is decided.

maropu · 2017-02-16T01:50:42Z

I also keep considering other ways to fix this...

SparkQA · 2017-02-16T03:08:42Z

Test build #72971 has finished for PR 16928 at commit f09a899.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-16T03:29:45Z

Test build #72969 has finished for PR 16928 at commit 5486f5d.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2017-02-16T10:23:16Z

Test build #72991 has finished for PR 16928 at commit df39e39.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2017-02-16T10:53:47Z

Jenkins, retest this please.

SparkQA · 2017-02-16T13:29:53Z

Test build #72993 has finished for PR 16928 at commit df39e39.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-02-16T23:55:04Z

definitely we should match the behavior of json

HyukjinKwon · 2017-02-16T23:56:42Z

Thanks, let me review further within tomorrow.

HyukjinKwon

@maropu, I left my opinion on the codes. I hope they make sense.

HyukjinKwon · 2017-02-18T14:48:10Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala

@@ -101,6 +101,12 @@ class CSVFileFormat extends TextBasedFileFormat with DataSourceRegister {
    val broadcastedHadoopConf =
      sparkSession.sparkContext.broadcast(new SerializableConfiguration(hadoopConf))

+    val columnNameOfCorruptRecord = csvOptions.columnNameOfCorruptRecord.getOrElse(sparkSession
+      .sessionState.conf.columnNameOfCorruptRecord)


Maybe, we could put columnNameOfCorruptRecord as an argument in CSVOptions consistently with JSONOptions. I guess then it won't require the changes between 129L - 133L as well.

HyukjinKwon · 2017-02-18T14:50:24Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala

@@ -222,12 +250,6 @@ private[csv] class UnivocityParser(
            logWarning("Parse exception. " +
              s"Dropping malformed line: ${tokens.mkString(options.delimiter.toString)}")
          }
-          if (numMalformedRecords == options.maxMalformedLogPerPartition - 1) {


Maybe, I missed something. Do you mind if I ask elaborate why it is removed?

Oh, my bad. I'll revert this part.

HyukjinKwon · 2017-02-18T14:51:05Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala

+            row(i) = value
+          }
+        } catch {
+          case _: NumberFormatException | _: IllegalArgumentException


Could we maybe decouple this parsing mode handling logics into convertWithParseMode?

HyukjinKwon · 2017-02-18T14:52:13Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala


  private val parser = new CsvParser(options.asParserSettings)

  private var numMalformedRecords = 0

  private val row = new GenericInternalRow(requiredSchema.length)

-  private val indexArr: Array[Int] = {
+  private val shouldHandleCorruptRecord = columnNameOfCorruptRecord.isDefined
+  private val corruptIndex = columnNameOfCorruptRecord.flatMap { fn =>


It is a trivial but maybe

private val corruptFieldIndex = schema.getFieldIndex(options.columnNameOfCorruptRecord)

if possible.

HyukjinKwon · 2017-02-18T14:53:54Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala

-    fields.map(schema.indexOf(_: StructField)).toArray
+    fields.zipWithIndex.filter { case (_, i) => i != corruptIndex }.map { case (f, i) =>
+      (inputSchema.indexOf(f), i)
+    }.toArray


Let me checkout this PR and check if this bit can be cleaner soon but it might be even nicer if you could double check please.

HyukjinKwon · 2017-02-18T14:57:58Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala

@@ -26,7 +26,7 @@ import org.apache.commons.lang3.time.FastDateFormat
 import org.apache.spark.internal.Logging
 import org.apache.spark.sql.catalyst.util.{CaseInsensitiveMap, CompressionCodecs, ParseModes}

-private[csv] class CSVOptions(
+private[sql] class CSVOptions(


execution package itself is meant to be a private according to 511f52f . We could just remove this if this should be accessible other than csv package.

maropu · 2017-02-18T15:00:32Z

@HyukjinKwon okay, thanks! I'll check soon

SparkQA · 2017-02-18T18:23:40Z

Test build #73109 has finished for PR 16928 at commit 873a383.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon

@maropu, I left some more comments that might be helpful. Let me take a look further in following few days.

HyukjinKwon · 2017-02-19T06:59:07Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala

+    val parsedOptions = new CSVOptions(
+      options,
+      sparkSession.sessionState.conf.sessionLocalTimeZone,
+      sparkSession.sessionState.conf.columnNameOfCorruptRecord)


(It seems CSVOptions is created twice above :)).

HyukjinKwon · 2017-02-19T07:05:32Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala

@@ -95,6 +104,9 @@ private[csv] class CSVOptions(
  val dropMalformed = ParseModes.isDropMalformedMode(parseMode)
  val permissive = ParseModes.isPermissiveMode(parseMode)

+  val columnNameOfCorruptRecord =
+    parameters.getOrElse("columnNameOfCorruptRecord", defaultColumnNameOfCorruptRecord)


Maybe, we should add this in readwriter.py too and document this in readwriter.py, DataFrameReader and DataStreamReader.

Added doc descriptions in readwriter.py, DataFrameReader, and DataStreamReader .

HyukjinKwon · 2017-02-19T07:51:55Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala


  private val parser = new CsvParser(options.asParserSettings)

  private var numMalformedRecords = 0

  private val row = new GenericInternalRow(requiredSchema.length)

-  private val indexArr: Array[Int] = {
+  private val corruptIndex =
+    requiredSchema.getFieldIndex(options.columnNameOfCorruptRecord).getOrElse(-1)


Could we make this an Option and use it?

private val corruptFieldIndex = schema.getFieldIndex(options.columnNameOfCorruptRecord) corruptFieldIndex.foreach { idx => require(schema(idx).dataType == StringType, "blabla") require(schema(idx).nullable, "blabla") }

HyukjinKwon · 2017-02-19T07:55:29Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala

+        // (e.g., SparkException about unsupported types).
+        case _: NumberFormatException | _: IllegalArgumentException
+            if options.permissive && shouldHandleCorruptRecord =>
+          row(corruptIndex) = UTF8String.fromString(tokens.mkString(options.delimiter.toString))


Ah, so it had to be manually concatenated. I think it is okay to rename convertWithParseMode to parseWithMode and make it take intput: String as an argument. I worry if the original input is different with what is stored in the corrupt column.

Then,.. we would be able to do something as below:

case NonFatal(e) if options.permissive => val row = new GenericInternalRow(requiredSchema.length) corruptFieldIndex.foreach { idx => row(idx) = UTF8String.fromString(input) }

I think catching all NonFatal seems consistent but could you check if the behaviour is consistent with JSON for unsupported types when PERMISSIVE mode?

If we should throw an exception, we could utilize CSVUtils.verifySchema to throw an exception ahead.

Aha, it seems reasonable to me and I'll re-check again.

I added CSVUtils.verifySchema in the head of buildReader, then modified the code above to handle all the NonFatal exceptions with a permissive mode in parseWithMode.

HyukjinKwon · 2017-02-19T08:14:29Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala

-      var i: Int = 0
-      while (i < indexArr.length) {
-        val pos = indexArr(i)
+      indexArr.foreach { case (pos, i) =>


(Oh, BTW, we should just use while in the critical path if we are not pretty sure whether the byte codes are virtually the same or more efficient. This is also written as Use while loops instead of for loops or functional transformations (e.g. map, foreach) in scala-style-guide#traversal-and-zipwithindex)

(off-topic: oh, I found the style-guide has great Chinese and Korean versions!)

maropu · 2017-02-19T14:50:49Z

I'll update in a day, thanks!

SparkQA · 2017-02-20T13:29:04Z

Test build #73166 has finished for PR 16928 at commit bf286c7.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-20T14:14:37Z

Test build #73167 has finished for PR 16928 at commit 4df4bc6.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

…` field violates requirements in json formats ## What changes were proposed in this pull request? This pr comes from #16928 and fixed a json behaviour along with the CSV one. ## How was this patch tested? Added tests in `JsonSuite`. Author: Takeshi Yamamuro <[email protected]> Closes #17023 from maropu/SPARK-19695.

cloud-fan · 2017-02-23T05:52:08Z

python/pyspark/sql/readwriter.py

@@ -367,10 +368,18 @@ def csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=Non
                         If None is set, it uses the default value, session local timezone.

                * ``PERMISSIVE`` : sets other fields to ``null`` when it meets a corrupted record.
-                    When a schema is set by user, it sets ``null`` for extra fields.
+                    If users set a string type field named ``columnNameOfCorruptRecord`` in a


can we just copy-paste the doc from json? or you can make some changes but please make them consistent.

cloud-fan · 2017-02-23T06:03:20Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala

+    require(schema(corrFieldIndex).nullable)
+  }
+
+  private val inputSchema = StructType(schema.filter(_.name != options.columnNameOfCorruptRecord))


json doesn't do this, why is this difference? cc @HyukjinKwon

It is because parsing CSV is dependent on the order/length of schema and tokens. In case of JSON, this can be just mapped by its key but for CSV it depends on the order/length of schema.

So, it seems this filters the corrupt field out in order to match the data schema with parsed tokens.

(For example, this code path is dependent on the length of the data schema. It drops/adds tokens after comparing the length between data schema and tokens. If we keep the corrupt column, then the schema length would be different with the tokens.)

cloud-fan · 2017-02-23T06:22:39Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala

      } else {
        tokens
      }

      try {
        Some(convert(checkedTokens))
      } catch {
+        case NonFatal(e) if options.permissive =>
+          val row = new GenericInternalRow(requiredSchema.length)
+          corruptFieldIndex.map(row(_) = UTF8String.fromString(input))


map -> foreach

BTW, I would like to note that this change is actually an important point that we should avoid. This could cause some bugs that are hard to figure out. (see SPARK-16694)

Aha, fixed. Thanks!

cloud-fan · 2017-02-23T06:24:57Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala

+      val checkedTokens = if (options.permissive && inputSchema.length > tokens.length) {
+        tokens ++ new Array[String](inputSchema.length - tokens.length)
+      } else if (options.permissive && inputSchema.length < tokens.length) {
+        tokens.take(inputSchema.length)


not related to this PR, but can you add some comments for this code block? It's kind if hard to follow the logic here

okay, I'll update soon

SparkQA · 2017-02-23T06:40:57Z

Test build #73322 has finished for PR 16928 at commit c86febe.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-23T06:42:43Z

Test build #73328 has started for PR 16928 at commit 8d9386a.

SparkQA · 2017-02-23T07:27:33Z

Test build #73331 has started for PR 16928 at commit 512fb42.

maropu · 2017-02-23T07:33:32Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala

+          tokens
+        }
+
+        // If we need to handle corrupt fields, it adds an extra token to skip a field for malformed


@HyukjinKwon This fix satisfies your intention? I slightly modified code based on your code.

cloud-fan · 2017-02-23T07:54:43Z

python/pyspark/sql/readwriter.py

-                 ``null`` for extra fields.
+                 ``columnNameOfCorruptRecord``. An user-defined schema can include \
+                 a string type field named ``columnNameOfCorruptRecord`` for corrupt records. \
+                 When a schema is set by user, it sets ``null`` for extra fields.


what about the other 2 modes? do they also set null for extra fields?

No. The two mode does not set null. In failFast mode, it fails a job. In dropMalformed mode, it drops the malformed lines whose length is shorter or longer.

does json have similar behavior?

Ah..., a bit different I think. As @HyukjinKwon said above(#16928 (comment)), CSV formats depend on a length of parsed tokens (if the length shorter, fills null, and if longer, drops them in permissive mode). One the other hand, in JSON formats, fields in a required schema are mapped by key. In case of missing keys in JSON formats, it just sets null in these fields with the keys in all the three mode. cc: @HyukjinKwon
e.x.)

import org.apache.spark.sql.types._ scala> Seq("""{"a": "a", "b" : 1}""", """{"a": "a"}""").toDF().write.text("/Users/maropu/Desktop/data") scala> val dataSchema = StructType(StructField("a", StringType, true) :: StructField("b", IntegerType, true) :: Nil) scala> spark.read.schema(dataSchema).option("mode", "PERMISSIVE").json("/Users/maropu/Desktop/data").show() +---+----+ | a| b| +---+----+ | a| 1| | a|null| +---+----+ scala> spark.read.schema(dataSchema).option("mode", "FAILFAST").json("/Users/maropu/Desktop/data").show() +---+----+ | a| b| +---+----+ | a| 1| | a|null| +---+----+ scala> spark.read.schema(dataSchema).option("mode", "DROPMALFORMED").json("/Users/maropu/Desktop/data").show() +---+----+ | a| b| +---+----+ | a| 1| | a|null| +---+----+

cloud-fan · 2017-02-23T08:02:49Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala

+      val checkedTokens = if (options.permissive) {
+        // If a length of parsed tokens is not equal to expected one, it makes the length the same
+        // with the expected. If the length is shorter, it adds extra tokens in the tail.
+        // If longer, it drops extra tokens.


we should revisit this in the future. If the token length doesn't match the expected schema, we should treat it as a malformed record. cc @HyukjinKwon

Should we also put that malformed record (shorter or longer) into a corrupt field?

Yup, I agree in a way but I guess "it is pretty common that CSV is malformed in this way" (said by the analysis team in my company). Could we leave it as is for now here?

Let me try to raise a different JIRA after checking R's read.csv or other libraries.

maropu · 2017-02-23T08:07:21Z

Jenkins, retest this please.

cloud-fan · 2017-02-23T08:08:08Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala

+        // strings when loading parsed tokens into a resulting `row`.
+        corruptFieldIndex.map { corrFieldIndex =>
+          val (front, back) = lengthSafeTokens.splitAt(corrFieldIndex)
+          front ++ new Array[String](1) ++ back


this will introduce a lot of extra object allocation, I think the previous version is better

We have two options; 1) we just revert this part, or (2) modify this part to avoid the allocation based on this code. cc: @HyukjinKwon

e.x.)
This is just an example and it seems to be a little hard to understand.

val parsedTokens = new Array[String](schema.length) ... // If we need to handle corrupt fields, it adds an extra token to skip a field for malformed // strings when loading parsed tokens into a resulting `row`. corruptFieldIndex.map { corrFieldIndex => lengthSafeTokens.splitAt(corrFieldIndex) match { case (front, back) => front.zipWithIndex.foreach { case (s, i) => parsedTokens(i) = s } back.zipWithIndex.foreach { case (s, i) => parsedTokens(schema.length - back.length + i) = s } } parsedTokens

We value on the committer's opinion. I am fine if we revert. I personally prefer 1) revert this change then if this sounds not good.

Probably, it'd be better to leave comments here as TODO.

Added TODO comments in here and here

cloud-fan · 2017-02-23T08:12:56Z

python/pyspark/sql/readwriter.py

@@ -193,8 +193,9 @@ def json(self, path, schema=None, primitivesAsString=None, prefersDecimal=None,

                *  ``PERMISSIVE`` : sets other fields to ``null`` when it meets a corrupted \
                  record and puts the malformed string into a new field configured by \
-                 ``columnNameOfCorruptRecord``. When a schema is set by user, it sets \
-                 ``null`` for extra fields.
+                 ``columnNameOfCorruptRecord``. An user-defined schema can include \


please rephrase this document a little bit, to make it more clear

okay, I'll brush up

Updated and could you check this again? Thanks!

HyukjinKwon · 2017-02-23T08:24:14Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala

+    require(schema(corrFieldIndex).nullable)
+  }
+
+  private val dataSchema = StructType(schema.filter(_.name != options.columnNameOfCorruptRecord))


I just realised now we only use the length of dataSchema now. Could we just use the length if more commits should be pushed?

ok, I'll update

I reverted some parts of code and then dataSchema is used except for the length https://github.com/apache/spark/pull/16928/files#diff-d19881aceddcaa5c60620fdcda99b4c4R57. So, I kept this variable as it is.

SparkQA · 2017-02-23T09:38:06Z

Test build #73335 has finished for PR 16928 at commit 512fb42.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-23T13:47:55Z

Test build #73348 has finished for PR 16928 at commit 3d514e5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-23T17:09:13Z

Test build #73353 has finished for PR 16928 at commit a58ff1f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-02-23T20:10:00Z

thanks, merging to master!

…` field violates requirements in json formats ## What changes were proposed in this pull request? This pr comes from apache#16928 and fixed a json behaviour along with the CSV one. ## How was this patch tested? Added tests in `JsonSuite`. Author: Takeshi Yamamuro <[email protected]> Closes apache#17023 from maropu/SPARK-19695.

… CSV data ## What changes were proposed in this pull request? This pr added a logic to put malformed tokens into a new field when parsing CSV data in case of permissive modes. In the current master, if the CSV parser hits these malformed ones, it throws an exception below (and then a job fails); ``` Caused by: java.lang.IllegalArgumentException at java.sql.Date.valueOf(Date.java:143) at org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:137) at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply$mcJ$sp(CSVInferSchema.scala:272) at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272) at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272) at scala.util.Try.getOrElse(Try.scala:79) at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:269) at ``` In case that users load large CSV-formatted data, the job failure makes users get some confused. So, this fix set NULL for original columns and put malformed tokens in a new field. ## How was this patch tested? Added tests in `CSVSuite`. Author: Takeshi Yamamuro <[email protected]> Closes apache#16928 from maropu/SPARK-18699-2.

gatorsmile · 2017-06-04T21:53:46Z

python/pyspark/sql/readwriter.py

+                 a string type field named ``columnNameOfCorruptRecord`` in an user-defined \
+                 schema. If a schema does not have the field, it drops corrupt records during \
+                 parsing. When inferring a schema, it implicitly adds a \
+                 ``columnNameOfCorruptRecord`` field in an output schema.


@maropu For JSON, we implicitly add the columnNameOfCorruptRecord field during schema inference, when the mode is PERMISSIVE . What is the reason we are not doing the same thing for CSV schema inference?

(Sorry for interrupting) yea, it should be consistent and we probably should change. Probably, we should also consider the records with tokens less or more than the schema as malformed records in PERMISSIVE mode rafher than filling some of it. @cloud-fan raised this issue before and I had a talk with some data analysists. It looked some agree and others do not. So, I just decided to not change the current behaviour for now.

To cut it short, the reason (I assume) is I could not imagine a simple common case that fails to parse CSV (not during conversion) for the current implementation. If there are, we should match the behaviour.

I am currently outside and this is my phone. I will double check this when I get to my computer but this will be correct if I haven't missed some changes in this code path.

In CSV, the records with tokens less or more than the schema are already viewed as malformed records in (at least) 2.2. I did not check the previous versions.

I think we need to implicitly add the column columnNameOfCorruptRecord during the schema inference too.

It has more than one issue here. The default of columnNameOfCorruptRecord does not respect the session conf spark.sql.columnNameOfCorruptRecord

Will submit a PR soon for fixing both issues.

Now, users have to manually add the column columnNameOfCorruptRecord for seeing these malformed records.

@gatorsmile, I just got to my laptop.

I checked when the length of tokens are more than the schema it fills the malformed column. with the data below:

a,a

(BTW, it looks respecting spark.sql.columnNameOfCorruptRecord ?)

scala> spark.read.schema("a string, _corrupt_record string").csv("test.csv").show() +---+---------------+ | a|_corrupt_record| +---+---------------+ | a| a,a| +---+---------------+

scala> spark.conf.set("spark.sql.columnNameOfCorruptRecord", "abc") scala> spark.read.schema("a string, abc string").csv("test.csv").show() +---+---+ | a|abc| +---+---+ | a|a,a| +---+---+

And, I found another bug (when the length is less then the schema):

with data

a a a a a

scala> spark.read.schema("a string, b string, _corrupt_record string").csv("test.csv").show()

prints ...

17/06/05 09:45:26 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 5) java.lang.NullPointerException at scala.collection.immutable.StringLike$class.stripLineEnd(StringLike.scala:89) at scala.collection.immutable.StringOps.stripLineEnd(StringOps.scala:29) at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$getCurrentInput(UnivocityParser.scala:56) at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert$1.apply(UnivocityParser.scala:211) at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert$1.apply(UnivocityParser.scala:211) at org.apache.spark.sql.execution.datasources.FailureSafeParser$$anonfun$2.apply(FailureSafeParser.scala:50) at org.apache.spark.sql.execution.datasources.FailureSafeParser$$anonfun$2.apply(FailureSafeParser.scala:43) at org.apache.spark.sql.execution.datasources.FailureSafeParser.parse(FailureSafeParser.scala:64) at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$parseIterator$1.apply(UnivocityParser.scala:312) at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$parseIterator$1.apply(UnivocityParser.scala:312) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:236) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:230) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

It looks getCurrentInput produces null as the input is all parsed.

Another thing I would like to leave is (just to note the difference for all of us to not forget), JSON produces null in the columns and put the contents in the malformed column:
With the input:

{"a": 1, "b": "a"}

scala> spark.read.json("test.json").show() +---+---+ | a| b| +---+---+ | 1| a| +---+---+

scala> spark.read.schema("a string, b int, _corrupt_record string").json("test.json").show() +----+----+------------------+ | a| b| _corrupt_record| +----+----+------------------+ |null|null|{"a": 1, "b": "a"}| +----+----+------------------+

Oh.. I was writing the comments before seeing your comments ... Yes, I agree with your comments.

Let me give a shot to fix the bug I found above (NullPointerException). I think this can be easily fixed (but I am pretty sure the behaviour could be arguable). I will open a PR and cc you to show what it looks like.

Sorry for my late response. yea, I also think these behaviour should be the same. But, I tried though in this pr though, I couldn't because (both you already noticed this...) we couldn't easily add a new column in the CSV code path. So, I think we probably need some refactoring DataSource to make this behaviour consistent.

maropu force-pushed the SPARK-18699-2 branch from 5486f5d to f09a899 Compare February 16, 2017 00:41

maropu changed the title ~~[SPARK-18699][SQL] Fill NULL in a field when detecting a malformed token~~ [SPARK-18699][SQL] Put malformed tokens into a new field when parsing CSV data Feb 16, 2017

maropu force-pushed the SPARK-18699-2 branch from f09a899 to df39e39 Compare February 16, 2017 08:48

HyukjinKwon reviewed Feb 18, 2017

View reviewed changes

maropu added 3 commits February 19, 2017 00:29

Fill NULL in a field when detecting a malformed token

74e1fc5

Add an option columnNameOfCorruptRecord for CSV formats

763601d

Apply review comments

873a383

maropu force-pushed the SPARK-18699-2 branch from df39e39 to 873a383 Compare February 18, 2017 16:28

HyukjinKwon reviewed Feb 19, 2017

View reviewed changes

maropu force-pushed the SPARK-18699-2 branch from bf286c7 to 4df4bc6 Compare February 20, 2017 14:10

cloud-fan reviewed Feb 23, 2017

View reviewed changes

Update docs

8d9386a

Clean up code in UnivocityParser

512fb42

maropu commented Feb 23, 2017

View reviewed changes

cloud-fan reviewed Feb 23, 2017

View reviewed changes

HyukjinKwon reviewed Feb 23, 2017

View reviewed changes

Revert clean-up code for a CSV parser

3d514e5

Update descriptions for comments

a58ff1f

asfgit closed this in 09ed6e7 Feb 23, 2017

gatorsmile reviewed Jun 4, 2017

View reviewed changes

maropu mentioned this pull request Jun 7, 2017

[SPARK-20978][SQL] Set null for malformed column when the number of tokens is less than schema in CSV read/permissive mode #18200

Closed

maropu deleted the SPARK-18699-2 branch July 5, 2017 11:48

[SPARK-18699][SQL] Put malformed tokens into a new field when parsing CSV data #16928

[SPARK-18699][SQL] Put malformed tokens into a new field when parsing CSV data #16928

Conversation

maropu commented Feb 14, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Feb 14, 2017

maropu commented Feb 15, 2017

HyukjinKwon commented Feb 15, 2017

maropu commented Feb 15, 2017

HyukjinKwon commented Feb 16, 2017 • edited Loading

HyukjinKwon commented Feb 16, 2017

maropu commented Feb 16, 2017

SparkQA commented Feb 16, 2017

SparkQA commented Feb 16, 2017

SparkQA commented Feb 16, 2017

maropu commented Feb 16, 2017

SparkQA commented Feb 16, 2017

cloud-fan commented Feb 16, 2017

HyukjinKwon commented Feb 16, 2017

HyukjinKwon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Feb 18, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Feb 18, 2017 • edited Loading

Choose a reason for hiding this comment

HyukjinKwon Feb 18, 2017 • edited Loading

Choose a reason for hiding this comment

maropu commented Feb 18, 2017

SparkQA commented Feb 18, 2017

HyukjinKwon left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Feb 19, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Feb 19, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maropu commented Feb 19, 2017

SparkQA commented Feb 20, 2017

SparkQA commented Feb 20, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Feb 23, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Feb 23, 2017

SparkQA commented Feb 23, 2017

SparkQA commented Feb 23, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maropu commented Feb 23, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maropu commented Feb 14, 2017 •

edited

Loading

HyukjinKwon commented Feb 16, 2017 •

edited

Loading

HyukjinKwon Feb 18, 2017 •

edited

Loading

HyukjinKwon Feb 18, 2017 •

edited

Loading

HyukjinKwon Feb 18, 2017 •

edited

Loading

HyukjinKwon left a comment •

edited

Loading

HyukjinKwon Feb 19, 2017 •

edited

Loading

HyukjinKwon Feb 19, 2017 •

edited

Loading

HyukjinKwon Feb 23, 2017 •

edited

Loading

gatorsmile Jun 4, 2017 •

edited

Loading

gatorsmile Jun 4, 2017 •

edited

Loading