Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] rereading a written csv file drops rows #6917

Open
Tracked by #2063
eordentlich opened this issue Oct 25, 2022 · 1 comment
Open
Tracked by #2063

[BUG] rereading a written csv file drops rows #6917

eordentlich opened this issue Oct 25, 2022 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@eordentlich
Copy link
Contributor

Describe the bug
Reading a csv file results in a dataframe with 120000 rows. Writing the dataframe to a new csv file and then rereading with spark-rapids results in a dataframe with fewer rows (119470 if the write is done by spark rapids, 117308 if the write is done with pyspark). No error is thrown.

Steps/Code to reproduce bug
This was observed with the csv file downloadable here.

Note this csv file uses quotes for the second column to allow commas in the fields, and also has escaped quotes.

Use pyspark api, tested in Databricks python notebook. Haven't checked local.

Download and save the file to dbfs:/news_category_train.csv.

Run the following:

news_df = spark.read.option("header", True).csv("dbfs:/news_category_train.csv")
print(news_df.count())
news_df.write.csv("dbfs:/news_category_train_rapids.csv",header=True)
news_df_reread = spark.read.option("header", True).csv("dbfs:/news_category_train_rapids.csv")
print(news_df_reread.count())

The two counts were observed to be different with the first one at 120000 and the second one at 119470. Similar behavior if first read and write are done without spark-rapids, and the second read is via spark rapids.

Baseline spark yields 120000 rows in all cases, even when reading spark-rapids written csv.

Expected behavior
No dropping of rows when writing and rereading a Dataframe in csv mode, originally read from a csv file.

Environment details (please complete the following information)

  • Environment location: Azure Databricks notebook, spark-rapids 22.10, 2 T4 worker nodes.
  • Spark configuration settings related to the issue
spark.task.resource.gpu.amount 1
spark.task.cpus 8
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.databricks.delta.preview.enabled true
spark.kryoserializer.buffer.max 2000M
spark.executorEnv.PYTHONPATH /databricks/jars/rapids-4-spark_2.12-22.10.0.jar:/databricks/spark/python
spark.jsl.settings.pretrained.cache_folder dbfs:/eordentlich/spark-nlp/cached_pretrained
spark.sql.execution.arrow.maxRecordsPerBatch 100000
spark.jsl.settings.annotator.log_folder dbfs:/eordentlich/spark-nlp/annotator_logs
spark.executor.cores 8
spark.rapids.memory.gpu.minAllocFraction 0.0001
spark.rapids.memory.gpu.allocFraction 0.25
spark.plugins com.nvidia.spark.SQLPlugin
spark.locality.wait 0s
spark.rapids.memory.gpu.pooling.enabled true
spark.rapids.sql.explain ALL
spark.kryo.registrator com.nvidia.spark.rapids.GpuKryoRegistrator
spark.rapids.memory.gpu.reserve 20
spark.rapids.sql.python.gpu.enabled true
spark.rapids.memory.pinnedPool.size 2G
spark.python.daemon.module rapids.daemon_databricks
spark.rapids.sql.batchSizeBytes 128m
spark.sql.adaptive.enabled false
spark.rapids.sql.enabled true
spark.databricks.delta.optimizeWrite.enabled false
spark.rapids.sql.concurrentGpuTasks 2

Additional context
Add any other context about the problem here.

@eordentlich eordentlich added ? - Needs Triage Need team to review and classify bug Something isn't working labels Oct 25, 2022
@firestarman
Copy link
Collaborator

firestarman commented Oct 27, 2022

The root cause is probably the same with #6435 (comment).

Some rows are changed in the writen CSV file, comparing to the original file. And some of the changed rows now contain strings like x"", which the GPU reading can not handle correctly, as mentioned in #6435 (comment).

Here is a minimal repro case.
The 4 rows are picked from the writen file, and the first row ends with \"", leading to the following two rows being skipped in GPU reading.

category,description
Sci/Tech,The U.S. Forest Service on Wednesday rejected environmentalists' appeal of a plan to poison a stream south of Lake Tahoe to aid what wildlife officials call \"the rarest trout in America.\""
Sci/Tech,One of the pleasures of    stargazing is noticing and enjoying the various colors that stars display in    dark skies. These hues offer direct visual evidence of how stellar temperatures    vary.
Sci/Tech,"Britain granted its first license for human cloning Wednesday, joining South Korea on the leading edge of stem cell research, which is restricted by the Bush administration and which many scientists believe may lead to new treatments for a range of diseases."
Sci/Tech,Meteorologists at North Carolina State University are working on a way to more accurately measure rainfall in small areas.

CPU

scala> spark.conf.set("spark.rapids.sql.enabled", "false")

scala> spark.read.option("header", "true").csv("/data/tmp/6917/test.csv").show()
+--------+--------------------+
|category|         description|
+--------+--------------------+
|Sci/Tech|The U.S. Forest S...|
|Sci/Tech|One of the pleasu...|
|Sci/Tech|Britain granted i...|
|Sci/Tech|Meteorologists at...|
+--------+--------------------+

GPU

scala> spark.conf.set("spark.rapids.sql.enabled", "true")

scala> spark.read.option("header", "true").csv("/data/tmp/6917/test.csv").show()
+--------+--------------------+
|category|         description|
+--------+--------------------+
|Sci/Tech|The U.S. Forest S...|
|Sci/Tech|Meteorologists at...|
+--------+--------------------+

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants