Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] spark.rapids.sql.exec.CollectLimitExec=true can mess up the CSV header row #6814

Open
Tracked by #2063
viadea opened this issue Oct 14, 2022 · 5 comments
Open
Tracked by #2063
Assignees
Labels
bug Something isn't working

Comments

@viadea
Copy link
Collaborator

viadea commented Oct 14, 2022

If we enable spark.rapids.sql.exec.CollectLimitExec=true on a 2 nodes cluster, the CSV with header may be messed up.

For example, let's use this example csv file:

wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Public/data/news_category_train.csv

The format of this csv file is like this:

category,description
Business," Short sellers, Wall Street's dwindling band of ultra cynics, are seeing green again."
Business," Private investment firm Carlyle Group, which has a reputation for making well timed and occasionally controversial plays in the defense industry, has quietly placed its bets on another part of the market."
Business, Soaring crude prices plus worries about the economy and the outlook for earnings are expected to hang over the stock market next week during the depth of the summer doldrums.

I can reproduce the issue in both databricks and dataproc.
Here is the minimum repro on dataproc:

  1. After a 2-nodes Dataproc cluster is ready, ssh to master node
gcloud compute ssh $CLUSTER_NAME-w-0 --project=rapids-spark --zone=$ZONE
wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Public/data/news_category_train.csv
hadoop fs -put news_category_train.csv /tmp/
  1. in spark-shell
spark.conf.set("spark.rapids.sql.exec.CollectLimitExec",true)
spark.read.option("header", true).csv("/tmp/news_category_train.csv").show(5, truncate=50)
  1. run above command couple of times, some times it will show result like:
scala> spark.read.option("header", true).csv("/tmp/news_category_train.csv").show(5, truncate=50)
+--------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Sci/Tech|Scot Wingo, author of eBay Strategies: 10 Proven Methods to Maximize Your eBay Business, will answer reader questions about the online marketplace. Wingo is president and chief executive of ChannelAdvisor, an eBay consignment franchise.|
+--------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Business|                                                                                                                                                                                           Short sellers, Wall Street's dwindling band of...|
|Business|                                                                                                                                                                                           Private investment firm Carlyle Group, which h...|

Sometimes it will show correct result:

scala> spark.read.option("header", true).csv("/tmp/news_category_train.csv").show(5, truncate=50)
+--------+--------------------------------------------------+
|category|                                       description|
+--------+--------------------------------------------------+
|Business| Short sellers, Wall Street's dwindling band of...|
|Business| Private investment firm Carlyle Group, which h...|
|Business| Soaring crude prices plus worries about the ec...|
|Business| Authorities have halted oil export flows from ...|
|Business| Tearaway world oil prices, toppling records an...|
+--------+--------------------------------------------------+
only showing top 5 rows

Env:
I can reproduce using latest 22.10 snapshot and also 22.06GA jar

@viadea viadea added bug Something isn't working ? - Needs Triage Need team to review and classify labels Oct 14, 2022
@sameerz sameerz mentioned this issue Oct 14, 2022
38 tasks
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Oct 18, 2022
@firestarman
Copy link
Collaborator

firestarman commented Nov 1, 2022

Hi @viadea , does "2 nodes cluster" mean two executors or two workers ?
And could you share the spark version being used?

@firestarman
Copy link
Collaborator

Actually this is not a CSV issue, but an issue of the GPU version of CollectLlimitExec.

The GPU version of CollectLimitExec will not always return the head line as the first row due to a shuffle inside it, but Spark treats the returned single row as the CSV header line when inferring the CSV schema.

Details can be found in this early issue #882, especially the comment #882 (comment) .

@firestarman
Copy link
Collaborator

Filed a tracking issue #7005. And I am going to mark it as done in the CSV epic issue.

@viadea
Copy link
Collaborator Author

viadea commented Nov 9, 2022

@firestarman "2 nodes cluster" means 2 worker servers

@GaryShen2008
Copy link
Collaborator

Move out of 22.12 target.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants