Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Inner Join dropping data with bucketed Table input #780

Closed
tgravescs opened this issue Sep 16, 2020 · 1 comment · Fixed by #785
Closed

[BUG] Inner Join dropping data with bucketed Table input #780

tgravescs opened this issue Sep 16, 2020 · 1 comment · Fixed by #785
Assignees
Labels
bug Something isn't working P0 Must have for release

Comments

@tgravescs
Copy link
Collaborator

tgravescs commented Sep 16, 2020

Describe the bug

Inner join is losing data when one of the tables is written bucketed. See example below for the specifics.

With code below results is empty:
+---+--------+---+-----------+
|Url|SatCount|Url|ResolvedUrl|
+---+--------+---+-----------+
+---+--------+---+-----------+

It should be:
+--------------------------------------+--------+--------------------------------------+---------------------------------------+
|Url |SatCount|Url |ResolvedUrl |
+--------------------------------------+--------+--------------------------------------+---------------------------------------+
|http://fooblog.com/blog-entry-116.html|21 |http://fooblog.com/blog-entry-116.html|https://fooblog.com/blog-entry-116.html|
|http://fooblog.com/blog-entry-116.html|21 |http://fooblog.com/blog-entry-116.html|http://fooblog.com/blog-entry-116.html |
+--------------------------------------+--------+--------------------------------------+---------------------------------------+

Note that if I turn off the GPU shuffle (spark.conf.set("spark.rapids.sql.exec.ShuffleExchangeExec", "false")) then it works and produces the correct results. I had tried changing the join to be a CPU join but it still failed which makes me think its not the join itself but something higher up. Also if you throw in an extra repartition after reading from the table then it works ok as well.

Steps/Code to reproduce bug

$SPARK_HOME/bin/spark-shell --master local --jars ~/.m2/repository/ai/rapids/cudf/0.15-SNAPSHOT/cudf-0.15-SNAPSHOT-cuda10-1.jar,/home/tgraves/workspace/spark-rapids-another/dist/target/rapids-4-spark_2.12-0.2.0-SNAPSHOT.jar --conf spark.driver.extraJavaOptions=-Duser.timezone=GMT --conf spark.sql.session.timeZone=UTC --conf spark.executor.extraJavaOptions=-Duser.timezone=GMT --conf spark.plugins=com.nvidia.spark.SQLPlugin --conf spark.rapids.sql.explain="NOT_ON_GPU"

spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1")

val rdd = sc.parallelize(Seq(("http://fooblog.com/blog-entry-116.html", "https://fooblog.com/blog-entry-116.html"), ("http://fooblog.com/blog-entry-116.html", "http://fooblog.com/blog-entry-116.html")))

val resolved = rdd.toDF("Url", "ResolvedUrl")
val rdd2 =  sc.parallelize(Seq(("http://fooblog.com/blog-entry-116.html", "21")))

val feature = rdd2.toDF("Url", "SatCount")
feature.write
      .bucketBy(4000, "Url")
      .sortBy("Url")
      .format("parquet")
   .mode("overwrite")
      .option("path", "/home/tgraves/tgravesfeatureset")
      .saveAsTable("tgravesfeatureset")

val testurls = spark.sql("SELECT Url, SatCount FROM tgravesfeatureset")
val res = testurls.join(resolved, testurls("Url") === resolved("Url"), "inner")
res.show(false)

@tgravescs tgravescs added bug Something isn't working P0 Must have for release labels Sep 16, 2020
@tgravescs
Copy link
Collaborator Author

So it seems we should have been not allowing this to run on the GPU due to the hashing won't be the same across cpu and gpu.

@tgravescs tgravescs self-assigned this Sep 16, 2020
@sameerz sameerz added this to the Sep 14 - Sep 25 milestone Sep 16, 2020
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023
…IDIA#780)

Signed-off-by: spark-rapids automation <[email protected]>

Signed-off-by: spark-rapids automation <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 Must have for release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants