-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-20978][SQL] Set null for malformed column when the number of tokens is less than schema in CSV read/permissive mode #18200
Conversation
Test build #77731 has started for PR 18200 at commit |
cc @cloud-fan and @gatorsmile, I think this needs a decision (putting it in that column or not). It looks either is better than NPE. Please let me know. |
retest this please |
Test build #77743 has finished for PR 18200 at commit
|
Sorry for interrupting you though, is this a blocker for the v2.2 release? It seems v2.1 accepts this case:
|
@maropu, I think it is fine. It only happens when the malformed column is selected. For example, The case below is fine scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._
scala> spark.read.schema(new StructType().add("a", IntegerType).add("b", IntegerType)).csv("test.csv").show
+---+----+
| a| b|
+---+----+
| 1| 3|
| 1|null|
+---+----+ but it only happens when malformed column exists in the schema: scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._
scala> spark.read.schema(new StructType().add("a", IntegerType).add("b", IntegerType).add("_corrupt_record", StringType)).csv("test.csv").show
17/06/07 05:54:55 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
java.lang.NullPointerException
at scala.collection.immutable.StringLike$class.stripLineEnd(StringLike.scala:89)
at scala.collection.immutable.StringOps.stripLineEnd(StringOps.scala:29)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$getCurrentInput(UnivocityParser.scala:56)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert$1.apply(UnivocityParser.scala:211)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert$1.apply(UnivocityParser.scala:211)
at org.apache.spark.sql.execution.datasources.FailureSafeParser$$anonfun$2.apply(FailureSafeParser.scala:50)
at org.apache.spark.sql.execution.datasources.FailureSafeParser$$anonfun$2.apply(FailureSafeParser.scala:43)
at org.apache.spark.sql.execution.datasources.FailureSafeParser.parse(FailureSafeParser.scala:64) I think we introduced |
@HyukjinKwon oh, I see. Thanks for your kindly explanation! |
.schema("a string, b string, unparsed string") | ||
.option("columnNameOfCorruptRecord", "unparsed") | ||
.csv(Seq("a").toDS()) | ||
checkAnswer(df, Row("a", null, null)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not read the code yet, but it looks like the result is wrong? We should output the a
in the corrupted column.
In addition, this scenario works if we pass the csv files that contain less column values than the schema, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for cutting in, but how to handle these longer/shorter cases is some arguable as @HyukjinKwon said in #16928 (comment). IMHO currently we just regard shorter cases as not-corrupted so as to keep existing behaviour (the previous Spark releases have regarded these case as not-corrupted, so other developers said that this behaviour change was not acceptable in earlier prs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this PR, you can try this example.
val corruptCVSRecords: Dataset[String] =
spark.createDataset(spark.sparkContext.parallelize(
"""1997, "b"""" ::
"""2015""" :: Nil))(Encoders.STRING)
val df1 = spark.read
.schema("col1 int, col2 String, __corrupted_column_name string")
.option("columnNameOfCorruptRecord", "__corrupted_column_name")
.option("mode", PermissiveMode.name)
.csv(corruptCVSRecords)
df1.show()
The output result of Permissive
mode also treats the records with less columns as corrupted.
+----+----+-----------------------+
|col1|col2|__corrupted_column_name|
+----+----+-----------------------+
|1997| "b"| null|
|2015|null| 2015|
+----+----+-----------------------+
@@ -53,7 +53,8 @@ class UnivocityParser( | |||
|
|||
// Retrieve the raw record string. | |||
private def getCurrentInput: UTF8String = { | |||
UTF8String.fromString(tokenizer.getContext.currentParsedContent().stripLineEnd) | |||
UTF8String.fromString( | |||
Option(tokenizer.getContext.currentParsedContent()).map(_.stripLineEnd).orNull) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, this looks only called when PERMISSIVE mode when the malformed column is defined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look so, is this something wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Sorry, I meant just a note for a review ...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(oh, ... sorry)
@maropu and @gatorsmile, I will ask and double check this one in Univocity parser (it looks I misunderstood). I will reopen this PR with more details soon. Thanks for your review. |
I tried to upgrade |
Let me give the example here. val corruptCVSRecords: Dataset[String] =
spark.createDataset(spark.sparkContext.parallelize(
"""1, "b"""" ::
"""2015""" :: Nil))(Encoders.STRING) Case 2: val corruptCVSRecords: Dataset[String] =
spark.createDataset(spark.sparkContext.parallelize(
"""1, "b"""" ::
"""2""" :: Nil))(Encoders.STRING) When passing the val df1 = spark.read
.schema("col1 int, col2 String, __corrupted_column_name string")
.option("columnNameOfCorruptRecord", "__corrupted_column_name")
.option("mode", PermissiveMode.name)
.csv(corruptCVSRecords)
df1.show() |
Thank for trying out. I also opened an issue - |
@maropu and @gatorsmile, I double-checked
it is fixed in val df = spark.read
.schema("a string, b string, unparsed string")
.option("columnNameOfCorruptRecord", "unparsed")
.csv(Seq("a").toDS())
df.show()
val corruptCVSRecords: Dataset[String] =
spark.createDataset(spark.sparkContext.parallelize(
"""1, "b"""" ::
"""2""" :: Nil))(Encoders.STRING)
val df1 = spark.read
.schema("col1 int, col2 String, __corrupted_column_name string")
.option("columnNameOfCorruptRecord", "__corrupted_column_name")
.option("mode", PermissiveMode.name)
.csv(corruptCVSRecords)
df1.show()
Thank you for checking this. I will double check and will propose to bump up this library after the release. |
Looks great! |
It might be too early to bump to 2.5.x, which might introduce the other bugs that have not been exposed. We should still be able to resolve the issue without the fix in the parser. |
Yea, I do underatand the concern. I asked this in the issue for sure. I will check other other related changes before opening a PR. |
Instead of upgrading the parser, as @gatorsmile suggested above, we better looking for another way to keep original texts about this case? |
I am not against fixing this within Spark for now. That would be a workaround anyway and I guess we want to remove out when upgrading Univocity. If the fix is simple, I could work around for now but I am not sure if it is that simple (uniVocity/univocity-parsers@08b67c3). I guess using existing functionalities should be safe. Also, there is another (not for now) related issue fixed in I will ask to the author if it is safe or not again when I am about to open a PR. I guess we will have some time to check this after upgrading this (in the early state of a release, Spark 2.3.0, assuming 2.2.0 release is soon) and it should be good to find/report other bugs in Univocity if there are. |
What changes were proposed in this pull request?
This PR proposes to fix NPE when the number of tokens are less then the schema (namely the input was parsed all but the number of tokens is not matched with the one of schema).
Currently,
tokenizer.getContext.currentParsedContent
can returnnull
as the actual input was successfully parsed and at this stage it does not look started to parse the next record (and therefore it appears no current parsed content).Before
After
It looks we might not need to put this in the malformed column as actually in this case the input is parsed correctly and putting in it might not useful much (whereas the number of tokens exceeds the schema, we need to know the truncated contents).
How was this patch tested?
Unit tests added in
CSVSuite
.