[BUG] rereading a written csv file drops rows #6917

eordentlich · 2022-10-25T22:02:10Z

Describe the bug
Reading a csv file results in a dataframe with 120000 rows. Writing the dataframe to a new csv file and then rereading with spark-rapids results in a dataframe with fewer rows (119470 if the write is done by spark rapids, 117308 if the write is done with pyspark). No error is thrown.

Steps/Code to reproduce bug
This was observed with the csv file downloadable here.

Note this csv file uses quotes for the second column to allow commas in the fields, and also has escaped quotes.

Use pyspark api, tested in Databricks python notebook. Haven't checked local.

Download and save the file to dbfs:/news_category_train.csv.

Run the following:

news_df = spark.read.option("header", True).csv("dbfs:/news_category_train.csv")
print(news_df.count())
news_df.write.csv("dbfs:/news_category_train_rapids.csv",header=True)
news_df_reread = spark.read.option("header", True).csv("dbfs:/news_category_train_rapids.csv")
print(news_df_reread.count())

The two counts were observed to be different with the first one at 120000 and the second one at 119470. Similar behavior if first read and write are done without spark-rapids, and the second read is via spark rapids.

Baseline spark yields 120000 rows in all cases, even when reading spark-rapids written csv.

Expected behavior
No dropping of rows when writing and rereading a Dataframe in csv mode, originally read from a csv file.

Environment details (please complete the following information)

Environment location: Azure Databricks notebook, spark-rapids 22.10, 2 T4 worker nodes.
Spark configuration settings related to the issue

spark.task.resource.gpu.amount 1
spark.task.cpus 8
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.databricks.delta.preview.enabled true
spark.kryoserializer.buffer.max 2000M
spark.executorEnv.PYTHONPATH /databricks/jars/rapids-4-spark_2.12-22.10.0.jar:/databricks/spark/python
spark.jsl.settings.pretrained.cache_folder dbfs:/eordentlich/spark-nlp/cached_pretrained
spark.sql.execution.arrow.maxRecordsPerBatch 100000
spark.jsl.settings.annotator.log_folder dbfs:/eordentlich/spark-nlp/annotator_logs
spark.executor.cores 8
spark.rapids.memory.gpu.minAllocFraction 0.0001
spark.rapids.memory.gpu.allocFraction 0.25
spark.plugins com.nvidia.spark.SQLPlugin
spark.locality.wait 0s
spark.rapids.memory.gpu.pooling.enabled true
spark.rapids.sql.explain ALL
spark.kryo.registrator com.nvidia.spark.rapids.GpuKryoRegistrator
spark.rapids.memory.gpu.reserve 20
spark.rapids.sql.python.gpu.enabled true
spark.rapids.memory.pinnedPool.size 2G
spark.python.daemon.module rapids.daemon_databricks
spark.rapids.sql.batchSizeBytes 128m
spark.sql.adaptive.enabled false
spark.rapids.sql.enabled true
spark.databricks.delta.optimizeWrite.enabled false
spark.rapids.sql.concurrentGpuTasks 2

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

firestarman · 2022-10-27T08:39:17Z

The root cause is probably the same with #6435 (comment).

Some rows are changed in the writen CSV file, comparing to the original file. And some of the changed rows now contain strings like x"", which the GPU reading can not handle correctly, as mentioned in #6435 (comment).

Here is a minimal repro case.
The 4 rows are picked from the writen file, and the first row ends with \"", leading to the following two rows being skipped in GPU reading.

category,description
Sci/Tech,The U.S. Forest Service on Wednesday rejected environmentalists' appeal of a plan to poison a stream south of Lake Tahoe to aid what wildlife officials call \"the rarest trout in America.\""
Sci/Tech,One of the pleasures of    stargazing is noticing and enjoying the various colors that stars display in    dark skies. These hues offer direct visual evidence of how stellar temperatures    vary.
Sci/Tech,"Britain granted its first license for human cloning Wednesday, joining South Korea on the leading edge of stem cell research, which is restricted by the Bush administration and which many scientists believe may lead to new treatments for a range of diseases."
Sci/Tech,Meteorologists at North Carolina State University are working on a way to more accurately measure rainfall in small areas.

CPU

scala> spark.conf.set("spark.rapids.sql.enabled", "false")

scala> spark.read.option("header", "true").csv("/data/tmp/6917/test.csv").show()
+--------+--------------------+
|category|         description|
+--------+--------------------+
|Sci/Tech|The U.S. Forest S...|
|Sci/Tech|One of the pleasu...|
|Sci/Tech|Britain granted i...|
|Sci/Tech|Meteorologists at...|
+--------+--------------------+

GPU

scala> spark.conf.set("spark.rapids.sql.enabled", "true")

scala> spark.read.option("header", "true").csv("/data/tmp/6917/test.csv").show()
+--------+--------------------+
|category|         description|
+--------+--------------------+
|Sci/Tech|The U.S. Forest S...|
|Sci/Tech|Meteorologists at...|
+--------+--------------------+

eordentlich added ? - Needs Triage Need team to review and classify bug Something isn't working labels Oct 25, 2022

revans2 mentioned this issue Oct 27, 2022

[BUG] Fix CSV Parsing #2063

Open

38 tasks

sameerz assigned firestarman Nov 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] rereading a written csv file drops rows #6917

[BUG] rereading a written csv file drops rows #6917

eordentlich commented Oct 25, 2022

firestarman commented Oct 27, 2022 •

edited

Loading

[BUG] rereading a written csv file drops rows #6917

[BUG] rereading a written csv file drops rows #6917

Comments

eordentlich commented Oct 25, 2022

firestarman commented Oct 27, 2022 • edited Loading

firestarman commented Oct 27, 2022 •

edited

Loading