You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Inner join is losing data when one of the tables is written bucketed. See example below for the specifics.
With code below results is empty:
+---+--------+---+-----------+
|Url|SatCount|Url|ResolvedUrl|
+---+--------+---+-----------+
+---+--------+---+-----------+
Note that if I turn off the GPU shuffle (spark.conf.set("spark.rapids.sql.exec.ShuffleExchangeExec", "false")) then it works and produces the correct results. I had tried changing the join to be a CPU join but it still failed which makes me think its not the join itself but something higher up. Also if you throw in an extra repartition after reading from the table then it works ok as well.
Describe the bug
Inner join is losing data when one of the tables is written bucketed. See example below for the specifics.
With code below results is empty:
+---+--------+---+-----------+
|Url|SatCount|Url|ResolvedUrl|
+---+--------+---+-----------+
+---+--------+---+-----------+
It should be:
+--------------------------------------+--------+--------------------------------------+---------------------------------------+
|Url |SatCount|Url |ResolvedUrl |
+--------------------------------------+--------+--------------------------------------+---------------------------------------+
|http://fooblog.com/blog-entry-116.html|21 |http://fooblog.com/blog-entry-116.html|https://fooblog.com/blog-entry-116.html|
|http://fooblog.com/blog-entry-116.html|21 |http://fooblog.com/blog-entry-116.html|http://fooblog.com/blog-entry-116.html |
+--------------------------------------+--------+--------------------------------------+---------------------------------------+
Note that if I turn off the GPU shuffle (spark.conf.set("spark.rapids.sql.exec.ShuffleExchangeExec", "false")) then it works and produces the correct results. I had tried changing the join to be a CPU join but it still failed which makes me think its not the join itself but something higher up. Also if you throw in an extra repartition after reading from the table then it works ok as well.
Steps/Code to reproduce bug
$SPARK_HOME/bin/spark-shell --master local --jars ~/.m2/repository/ai/rapids/cudf/0.15-SNAPSHOT/cudf-0.15-SNAPSHOT-cuda10-1.jar,/home/tgraves/workspace/spark-rapids-another/dist/target/rapids-4-spark_2.12-0.2.0-SNAPSHOT.jar --conf spark.driver.extraJavaOptions=-Duser.timezone=GMT --conf spark.sql.session.timeZone=UTC --conf spark.executor.extraJavaOptions=-Duser.timezone=GMT --conf spark.plugins=com.nvidia.spark.SQLPlugin --conf spark.rapids.sql.explain="NOT_ON_GPU"
The text was updated successfully, but these errors were encountered: