Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Mismatch cases in json_tuple function when json items have leading zeroes #8607

Closed
thirtiseven opened this issue Jun 25, 2023 · 2 comments · Fixed by #8594
Closed

[BUG] Mismatch cases in json_tuple function when json items have leading zeroes #8607

thirtiseven opened this issue Jun 25, 2023 · 2 comments · Fixed by #8594
Assignees
Labels
bug Something isn't working invalid This doesn't seem right

Comments

@thirtiseven
Copy link
Collaborator

Describe the bug
Found some CPU/GPU mismatch cases of json_tuple function when I'm changing the StringGen in IT. In these cases, some results from plugins look correct, but results from plugins show null. I guess that's a bug in spark, but anyway it leads to mismatching.

Steps/Code to reproduce bug
Simply making the length longer will fail the test:

@pytest.mark.parametrize('json_str_pattern', json_str_patterns, ids=idfn)
def test_json_tuple(json_str_pattern):
    gen = mk_json_str_gen(json_str_pattern)
    assert_gpu_and_cpu_are_equal_collect(
        lambda spark: unary_op_df(spark, gen, length=1000).selectExpr(
            'json_tuple(a, "a", "email", "owner", "b", "b$", "b$$")'),
        conf={'spark.sql.parser.escapedStringLiterals': 'true'})

Here is a case that can be reproduced in spark-shell:

scala> val data = spark.createDataset("""{"store": {"fruit": [{"weight":1,"type":"khtwotsqj"}], "bicycle":{"price":96.99,"color":"shfa"}},"email":"[email protected]","owner":"fzgbbtbm"}""" :: 
"""{"store": {"fruit": [{"weight":0,"type":"ooelwdeww"}], "bicycle":{"price":12.72,"color":"bcnt"}},"email":"[email protected]","owner":"zrpixdyb"}""" :: 
"""""" :: 
"""{"store": {"fruit": [{"weight":7,"type":"qdvtwxvyi"}], "bicycle":{"price":48.57,"color":"mpob"}},"email":"[email protected]","owner":"crdswdlu"}""" :: 
"""{"store": {"fruit": [{"weight":2,"type":"sxtgvplzw"}], "bicycle":{"price":57.36,"color":"oxmf"}},"email":"[email protected]","owner":"qskumfra"}""" :: 
"""{"store": {"fruit": [{"weight":2,"type":"yjtokpbma"}], "bicycle":{"price":04.02,"color":"lwlv"}},"email":"[email protected]","owner":"ccezwsja"}""" :: 
"""{"store": {"fruit": [{"weight":7,"type":"dokyjoisr"}], "bicycle":{"price":29.03,"color":"wrmq"}},"email":"[email protected]","owner":"prmhpbkd"}""" :: 
"""{"store": {"fruit": [{"weight":1,"type":"coyjmgtvt"}], "bicycle":{"price":64.23,"color":"mvif"}},"email":"[email protected]","owner":"gcvgoqzu"}""" :: 
"""{"store": {"fruit": [{"weight":2,"type":"badtnynju"}], "bicycle":{"price":15.55,"color":"lzfu"}},"email":"[email protected]","owner":"ocdrlqus"}""" :: 
"""{"store": {"fruit": [{"weight":9,"type":"lgbzopaom"}], "bicycle":{"price":44.99,"color":"spew"}},"email":"[email protected]","owner":"ieuiziyq"}""" :: Nil)

scala> val df = data.toDF("c1")

scala> spark.conf.set("spark.rapids.sql.enabled", false)

scala> df.select(json_tuple(col("c1"), "a", "email", "owner", "b", "b$", "b$$")).show(false)

cpu result:

+----+--------------------+--------+----+----+----+
|c0  |c1                  |c2      |c3  |c4  |c5  |
+----+--------------------+--------+----+----+----+
|null|[email protected]|fzgbbtbm|null|null|null|
|null|[email protected]|zrpixdyb|null|null|null|
|null|null                |null    |null|null|null|
|null|[email protected]|crdswdlu|null|null|null|
|null|[email protected]|qskumfra|null|null|null|
|null|null                |null    |null|null|null|
|null|[email protected]|prmhpbkd|null|null|null|
|null|[email protected]|gcvgoqzu|null|null|null|
|null|[email protected]|ocdrlqus|null|null|null|
|null|[email protected]|ieuiziyq|null|null|null|
+----+--------------------+--------+----+----+----+
scala> spark.conf.set("spark.rapids.sql.enabled", true)
scala> df.select(json_tuple(col("c1"), "a", "email", "owner", "b", "b$", "b$$")).show(false)

gpu result:

23/06/25 19:44:34 WARN GpuOverrides:
!Exec <CollectLimitExec> cannot run on GPU because the Exec CollectLimitExec has been disabled, and is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU. Set spark.rapids.sql.exec.CollectLimitExec to true if you wish to enable it
  @Partitioning <SinglePartition$> could run on GPU
  *Exec <GenerateExec> will run on GPU
    *Expression <JsonTuple> json_tuple(c1#4, a, email, owner, b, b$, b$$) will run on GPU
    ! <LocalTableScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.LocalTableScanExec
      @Expression <AttributeReference> c1#4 could run on GPU

+----+--------------------+--------+----+----+----+
|c0  |c1                  |c2      |c3  |c4  |c5  |
+----+--------------------+--------+----+----+----+
|null|[email protected]|fzgbbtbm|null|null|null|
|null|[email protected]|zrpixdyb|null|null|null|
|null|null                |null    |null|null|null|
|null|[email protected]|crdswdlu|null|null|null|
|null|[email protected]|qskumfra|null|null|null|
|null|[email protected]|ccezwsja|null|null|null|
|null|[email protected]|prmhpbkd|null|null|null|
|null|[email protected]|gcvgoqzu|null|null|null|
|null|[email protected]|ocdrlqus|null|null|null|
|null|[email protected]|ieuiziyq|null|null|null|
+----+--------------------+--------+----+----+----+

Note: the strings generated by StringGen in this test_json_tuple may not cover all cases because of #8593

Expected behavior
The GPU should produce the same results as the CPU. If we don't plan to fix it, at least related IT cases shouldn't fail because of it.

Environment details (please complete the following information)
Latest code(23.08) and spark 3.3.0

@thirtiseven thirtiseven added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jun 25, 2023
@thirtiseven
Copy link
Collaborator Author

Seems it is because of the possible leading zero in the price of the bike. changing the price in the missing line to 14.02 will make the line appear in the cpu results. I will avoid generating leading zeros in test cases as a workaround.

Also, the issue will affect test_get_json_object too.

@thirtiseven thirtiseven changed the title [BUG] CPU/GPU mismatch cases in json_tuple function [BUG] Mismatch cases in json_tuple function when json items have leading zeroes Jun 26, 2023
@thirtiseven
Copy link
Collaborator Author

So the rapids plugin will strip leading zeros from all numbers, but allowNumericLeadingZeros is set to false in Spark. So simply avoiding generating leading zeros should fix it.

Reference: https://github.com/NVIDIA/spark-rapids/blob/branch-23.08/docs/compatibility.md#json-options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working invalid This doesn't seem right
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants