You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
category,description
Business," Short sellers, Wall Street's dwindling band of ultra cynics, are seeing green again."
Business," Private investment firm Carlyle Group, which has a reputation for making well timed and occasionally controversial plays in the defense industry, has quietly placed its bets on another part of the market."
Business, Soaring crude prices plus worries about the economy and the outlook for earnings are expected to hang over the stock market next week during the depth of the summer doldrums.
I can reproduce the issue in both databricks and dataproc.
Here is the minimum repro on dataproc:
After a 2-nodes Dataproc cluster is ready, ssh to master node
run above command couple of times, some times it will show result like:
scala> spark.read.option("header", true).csv("/tmp/news_category_train.csv").show(5, truncate=50)
+--------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Sci/Tech|Scot Wingo, author of eBay Strategies: 10 Proven Methods to Maximize Your eBay Business, will answer reader questions about the online marketplace. Wingo is president and chief executive of ChannelAdvisor, an eBay consignment franchise.|
+--------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Business| Short sellers, Wall Street's dwindling band of...|
|Business| Private investment firm Carlyle Group, which h...|
Sometimes it will show correct result:
scala> spark.read.option("header", true).csv("/tmp/news_category_train.csv").show(5, truncate=50)
+--------+--------------------------------------------------+
|category| description|
+--------+--------------------------------------------------+
|Business| Short sellers, Wall Street's dwindling band of...|
|Business| Private investment firm Carlyle Group, which h...|
|Business| Soaring crude prices plus worries about the ec...|
|Business| Authorities have halted oil export flows from ...|
|Business| Tearaway world oil prices, toppling records an...|
+--------+--------------------------------------------------+
only showing top 5 rows
Env:
I can reproduce using latest 22.10 snapshot and also 22.06GA jar
The text was updated successfully, but these errors were encountered:
Actually this is not a CSV issue, but an issue of the GPU version of CollectLlimitExec.
The GPU version of CollectLimitExec will not always return the head line as the first row due to a shuffle inside it, but Spark treats the returned single row as the CSV header line when inferring the CSV schema.
Details can be found in this early issue #882, especially the comment #882 (comment) .
If we enable spark.rapids.sql.exec.CollectLimitExec=true on a 2 nodes cluster, the CSV with header may be messed up.
For example, let's use this example csv file:
The format of this csv file is like this:
I can reproduce the issue in both databricks and dataproc.
Here is the minimum repro on dataproc:
Sometimes it will show correct result:
Env:
I can reproduce using latest 22.10 snapshot and also 22.06GA jar
The text was updated successfully, but these errors were encountered: