[BUG] Inner Join dropping data with bucketed Table input #780

tgravescs · 2020-09-16T15:46:50Z

Describe the bug

Inner join is losing data when one of the tables is written bucketed. See example below for the specifics.

With code below results is empty:
+---+--------+---+-----------+
|Url|SatCount|Url|ResolvedUrl|
+---+--------+---+-----------+
+---+--------+---+-----------+

It should be:
+--------------------------------------+--------+--------------------------------------+---------------------------------------+
|Url |SatCount|Url |ResolvedUrl |
+--------------------------------------+--------+--------------------------------------+---------------------------------------+
|http://fooblog.com/blog-entry-116.html|21 |http://fooblog.com/blog-entry-116.html|https://fooblog.com/blog-entry-116.html|
|http://fooblog.com/blog-entry-116.html|21 |http://fooblog.com/blog-entry-116.html|http://fooblog.com/blog-entry-116.html |
+--------------------------------------+--------+--------------------------------------+---------------------------------------+

Note that if I turn off the GPU shuffle (spark.conf.set("spark.rapids.sql.exec.ShuffleExchangeExec", "false")) then it works and produces the correct results. I had tried changing the join to be a CPU join but it still failed which makes me think its not the join itself but something higher up. Also if you throw in an extra repartition after reading from the table then it works ok as well.

Steps/Code to reproduce bug

$SPARK_HOME/bin/spark-shell --master local --jars ~/.m2/repository/ai/rapids/cudf/0.15-SNAPSHOT/cudf-0.15-SNAPSHOT-cuda10-1.jar,/home/tgraves/workspace/spark-rapids-another/dist/target/rapids-4-spark_2.12-0.2.0-SNAPSHOT.jar --conf spark.driver.extraJavaOptions=-Duser.timezone=GMT --conf spark.sql.session.timeZone=UTC --conf spark.executor.extraJavaOptions=-Duser.timezone=GMT --conf spark.plugins=com.nvidia.spark.SQLPlugin --conf spark.rapids.sql.explain="NOT_ON_GPU"

spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1")

val rdd = sc.parallelize(Seq(("http://fooblog.com/blog-entry-116.html", "https://fooblog.com/blog-entry-116.html"), ("http://fooblog.com/blog-entry-116.html", "http://fooblog.com/blog-entry-116.html")))

val resolved = rdd.toDF("Url", "ResolvedUrl")
val rdd2 =  sc.parallelize(Seq(("http://fooblog.com/blog-entry-116.html", "21")))

val feature = rdd2.toDF("Url", "SatCount")
feature.write
      .bucketBy(4000, "Url")
      .sortBy("Url")
      .format("parquet")
   .mode("overwrite")
      .option("path", "/home/tgraves/tgravesfeatureset")
      .saveAsTable("tgravesfeatureset")

val testurls = spark.sql("SELECT Url, SatCount FROM tgravesfeatureset")
val res = testurls.join(resolved, testurls("Url") === resolved("Url"), "inner")
res.show(false)

The text was updated successfully, but these errors were encountered:

tgravescs · 2020-09-16T16:00:20Z

So it seems we should have been not allowing this to run on the GPU due to the hashing won't be the same across cpu and gpu.

…IDIA#780) Signed-off-by: spark-rapids automation <[email protected]> Signed-off-by: spark-rapids automation <[email protected]>

tgravescs added bug Something isn't working P0 Must have for release labels Sep 16, 2020

tgravescs self-assigned this Sep 16, 2020

sameerz added this to the Sep 14 - Sep 25 milestone Sep 16, 2020

tgravescs mentioned this issue Sep 16, 2020

Make shuffle run on CPU if we do a join where we read from bucketed table #785

Merged

tgravescs closed this as completed Sep 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Inner Join dropping data with bucketed Table input #780

[BUG] Inner Join dropping data with bucketed Table input #780

tgravescs commented Sep 16, 2020 •

edited

Loading

tgravescs commented Sep 16, 2020

[BUG] Inner Join dropping data with bucketed Table input #780

[BUG] Inner Join dropping data with bucketed Table input #780

Comments

tgravescs commented Sep 16, 2020 • edited Loading

tgravescs commented Sep 16, 2020

tgravescs commented Sep 16, 2020 •

edited

Loading