[BUG] spark.rapids.sql.exec.CollectLimitExec=true can mess up the CSV header row #6814

viadea · 2022-10-14T23:36:01Z

If we enable spark.rapids.sql.exec.CollectLimitExec=true on a 2 nodes cluster, the CSV with header may be messed up.

For example, let's use this example csv file:

wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Public/data/news_category_train.csv

The format of this csv file is like this:

category,description
Business," Short sellers, Wall Street's dwindling band of ultra cynics, are seeing green again."
Business," Private investment firm Carlyle Group, which has a reputation for making well timed and occasionally controversial plays in the defense industry, has quietly placed its bets on another part of the market."
Business, Soaring crude prices plus worries about the economy and the outlook for earnings are expected to hang over the stock market next week during the depth of the summer doldrums.

I can reproduce the issue in both databricks and dataproc.
Here is the minimum repro on dataproc:

After a 2-nodes Dataproc cluster is ready, ssh to master node

gcloud compute ssh $CLUSTER_NAME-w-0 --project=rapids-spark --zone=$ZONE
wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Public/data/news_category_train.csv
hadoop fs -put news_category_train.csv /tmp/

in spark-shell

spark.conf.set("spark.rapids.sql.exec.CollectLimitExec",true)
spark.read.option("header", true).csv("/tmp/news_category_train.csv").show(5, truncate=50)

run above command couple of times, some times it will show result like:

scala> spark.read.option("header", true).csv("/tmp/news_category_train.csv").show(5, truncate=50)
+--------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Sci/Tech|Scot Wingo, author of eBay Strategies: 10 Proven Methods to Maximize Your eBay Business, will answer reader questions about the online marketplace. Wingo is president and chief executive of ChannelAdvisor, an eBay consignment franchise.|
+--------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Business|                                                                                                                                                                                           Short sellers, Wall Street's dwindling band of...|
|Business|                                                                                                                                                                                           Private investment firm Carlyle Group, which h...|

Sometimes it will show correct result:

scala> spark.read.option("header", true).csv("/tmp/news_category_train.csv").show(5, truncate=50)
+--------+--------------------------------------------------+
|category|                                       description|
+--------+--------------------------------------------------+
|Business| Short sellers, Wall Street's dwindling band of...|
|Business| Private investment firm Carlyle Group, which h...|
|Business| Soaring crude prices plus worries about the ec...|
|Business| Authorities have halted oil export flows from ...|
|Business| Tearaway world oil prices, toppling records an...|
+--------+--------------------------------------------------+
only showing top 5 rows

Env:
I can reproduce using latest 22.10 snapshot and also 22.06GA jar

The text was updated successfully, but these errors were encountered:

firestarman · 2022-11-01T08:28:37Z

Hi @viadea , does "2 nodes cluster" mean two executors or two workers ?
And could you share the spark version being used?

firestarman · 2022-11-04T06:12:56Z

Actually this is not a CSV issue, but an issue of the GPU version of CollectLlimitExec.

The GPU version of CollectLimitExec will not always return the head line as the first row due to a shuffle inside it, but Spark treats the returned single row as the CSV header line when inferring the CSV schema.

Details can be found in this early issue #882, especially the comment #882 (comment) .

firestarman · 2022-11-04T07:07:06Z

Filed a tracking issue #7005. And I am going to mark it as done in the CSV epic issue.

viadea · 2022-11-09T17:37:18Z

@firestarman "2 nodes cluster" means 2 worker servers

GaryShen2008 · 2022-11-23T01:34:28Z

Move out of 22.12 target.

viadea added bug Something isn't working ? - Needs Triage Need team to review and classify labels Oct 14, 2022

sameerz mentioned this issue Oct 14, 2022

[BUG] Fix CSV Parsing #2063

Open

38 tasks

sameerz removed the ? - Needs Triage Need team to review and classify label Oct 18, 2022

GaryShen2008 assigned firestarman Oct 31, 2022

firestarman mentioned this issue Nov 4, 2022

[FEA] Better support for CollectLimitExec #7005

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] spark.rapids.sql.exec.CollectLimitExec=true can mess up the CSV header row #6814

[BUG] spark.rapids.sql.exec.CollectLimitExec=true can mess up the CSV header row #6814

viadea commented Oct 14, 2022

firestarman commented Nov 1, 2022 •

edited

Loading

firestarman commented Nov 4, 2022

firestarman commented Nov 4, 2022

viadea commented Nov 9, 2022 •

edited

Loading

GaryShen2008 commented Nov 23, 2022

[BUG] spark.rapids.sql.exec.CollectLimitExec=true can mess up the CSV header row #6814

[BUG] spark.rapids.sql.exec.CollectLimitExec=true can mess up the CSV header row #6814

Comments

viadea commented Oct 14, 2022

firestarman commented Nov 1, 2022 • edited Loading

firestarman commented Nov 4, 2022

firestarman commented Nov 4, 2022

viadea commented Nov 9, 2022 • edited Loading

GaryShen2008 commented Nov 23, 2022

firestarman commented Nov 1, 2022 •

edited

Loading

viadea commented Nov 9, 2022 •

edited

Loading