[BUG] Mismatch cases in json_tuple function when json items have leading zeroes #8607

thirtiseven · 2023-06-25T12:14:01Z

Describe the bug
Found some CPU/GPU mismatch cases of json_tuple function when I'm changing the StringGen in IT. In these cases, some results from plugins look correct, but results from plugins show null. I guess that's a bug in spark, but anyway it leads to mismatching.

Steps/Code to reproduce bug
Simply making the length longer will fail the test:

@pytest.mark.parametrize('json_str_pattern', json_str_patterns, ids=idfn)
def test_json_tuple(json_str_pattern):
    gen = mk_json_str_gen(json_str_pattern)
    assert_gpu_and_cpu_are_equal_collect(
        lambda spark: unary_op_df(spark, gen, length=1000).selectExpr(
            'json_tuple(a, "a", "email", "owner", "b", "b$", "b$$")'),
        conf={'spark.sql.parser.escapedStringLiterals': 'true'})

Here is a case that can be reproduced in spark-shell:

scala> val data = spark.createDataset("""{"store": {"fruit": [{"weight":1,"type":"khtwotsqj"}], "bicycle":{"price":96.99,"color":"shfa"}},"email":"[email protected]","owner":"fzgbbtbm"}""" :: 
"""{"store": {"fruit": [{"weight":0,"type":"ooelwdeww"}], "bicycle":{"price":12.72,"color":"bcnt"}},"email":"[email protected]","owner":"zrpixdyb"}""" :: 
"""""" :: 
"""{"store": {"fruit": [{"weight":7,"type":"qdvtwxvyi"}], "bicycle":{"price":48.57,"color":"mpob"}},"email":"[email protected]","owner":"crdswdlu"}""" :: 
"""{"store": {"fruit": [{"weight":2,"type":"sxtgvplzw"}], "bicycle":{"price":57.36,"color":"oxmf"}},"email":"[email protected]","owner":"qskumfra"}""" :: 
"""{"store": {"fruit": [{"weight":2,"type":"yjtokpbma"}], "bicycle":{"price":04.02,"color":"lwlv"}},"email":"[email protected]","owner":"ccezwsja"}""" :: 
"""{"store": {"fruit": [{"weight":7,"type":"dokyjoisr"}], "bicycle":{"price":29.03,"color":"wrmq"}},"email":"[email protected]","owner":"prmhpbkd"}""" :: 
"""{"store": {"fruit": [{"weight":1,"type":"coyjmgtvt"}], "bicycle":{"price":64.23,"color":"mvif"}},"email":"[email protected]","owner":"gcvgoqzu"}""" :: 
"""{"store": {"fruit": [{"weight":2,"type":"badtnynju"}], "bicycle":{"price":15.55,"color":"lzfu"}},"email":"[email protected]","owner":"ocdrlqus"}""" :: 
"""{"store": {"fruit": [{"weight":9,"type":"lgbzopaom"}], "bicycle":{"price":44.99,"color":"spew"}},"email":"[email protected]","owner":"ieuiziyq"}""" :: Nil)

scala> val df = data.toDF("c1")

scala> spark.conf.set("spark.rapids.sql.enabled", false)

scala> df.select(json_tuple(col("c1"), "a", "email", "owner", "b", "b$", "b$$")).show(false)

cpu result:

+----+--------------------+--------+----+----+----+
|c0  |c1                  |c2      |c3  |c4  |c5  |
+----+--------------------+--------+----+----+----+
|null|[email protected]|fzgbbtbm|null|null|null|
|null|[email protected]|zrpixdyb|null|null|null|
|null|null                |null    |null|null|null|
|null|[email protected]|crdswdlu|null|null|null|
|null|[email protected]|qskumfra|null|null|null|
|null|null                |null    |null|null|null|
|null|[email protected]|prmhpbkd|null|null|null|
|null|[email protected]|gcvgoqzu|null|null|null|
|null|[email protected]|ocdrlqus|null|null|null|
|null|[email protected]|ieuiziyq|null|null|null|
+----+--------------------+--------+----+----+----+

scala> spark.conf.set("spark.rapids.sql.enabled", true)
scala> df.select(json_tuple(col("c1"), "a", "email", "owner", "b", "b$", "b$$")).show(false)

gpu result:

23/06/25 19:44:34 WARN GpuOverrides:
!Exec <CollectLimitExec> cannot run on GPU because the Exec CollectLimitExec has been disabled, and is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU. Set spark.rapids.sql.exec.CollectLimitExec to true if you wish to enable it
  @Partitioning <SinglePartition$> could run on GPU
  *Exec <GenerateExec> will run on GPU
    *Expression <JsonTuple> json_tuple(c1#4, a, email, owner, b, b$, b$$) will run on GPU
    ! <LocalTableScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.LocalTableScanExec
      @Expression <AttributeReference> c1#4 could run on GPU

+----+--------------------+--------+----+----+----+
|c0  |c1                  |c2      |c3  |c4  |c5  |
+----+--------------------+--------+----+----+----+
|null|[email protected]|fzgbbtbm|null|null|null|
|null|[email protected]|zrpixdyb|null|null|null|
|null|null                |null    |null|null|null|
|null|[email protected]|crdswdlu|null|null|null|
|null|[email protected]|qskumfra|null|null|null|
|null|[email protected]|ccezwsja|null|null|null|
|null|[email protected]|prmhpbkd|null|null|null|
|null|[email protected]|gcvgoqzu|null|null|null|
|null|[email protected]|ocdrlqus|null|null|null|
|null|[email protected]|ieuiziyq|null|null|null|
+----+--------------------+--------+----+----+----+

Note: the strings generated by StringGen in this test_json_tuple may not cover all cases because of #8593

Expected behavior
The GPU should produce the same results as the CPU. If we don't plan to fix it, at least related IT cases shouldn't fail because of it.

Environment details (please complete the following information)
Latest code(23.08) and spark 3.3.0

The text was updated successfully, but these errors were encountered:

thirtiseven · 2023-06-26T03:18:42Z

Seems it is because of the possible leading zero in the price of the bike. changing the price in the missing line to 14.02 will make the line appear in the cpu results. I will avoid generating leading zeros in test cases as a workaround.

Also, the issue will affect test_get_json_object too.

thirtiseven · 2023-06-26T03:29:20Z

So the rapids plugin will strip leading zeros from all numbers, but allowNumericLeadingZeros is set to false in Spark. So simply avoiding generating leading zeros should fix it.

Reference: https://github.com/NVIDIA/spark-rapids/blob/branch-23.08/docs/compatibility.md#json-options

thirtiseven added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jun 25, 2023

thirtiseven changed the title ~~[BUG] CPU/GPU mismatch cases in json_tuple function~~ [BUG] Mismatch cases in json_tuple function when json items have leading zeroes Jun 26, 2023

thirtiseven self-assigned this Jun 26, 2023

thirtiseven mentioned this issue Jun 26, 2023

String generation from complex regex in integration tests #8594

Merged

mattahrens removed the ? - Needs Triage Need team to review and classify label Jun 27, 2023

thirtiseven closed this as completed in #8594 Jul 17, 2023

sameerz added the invalid This doesn't seem right label Jul 19, 2023

thirtiseven mentioned this issue Sep 18, 2023

[FEA] Enhance integration test to avoid test failures with different seeds #9241

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Mismatch cases in json_tuple function when json items have leading zeroes #8607

[BUG] Mismatch cases in json_tuple function when json items have leading zeroes #8607

thirtiseven commented Jun 25, 2023

thirtiseven commented Jun 26, 2023

thirtiseven commented Jun 26, 2023

[BUG] Mismatch cases in json_tuple function when json items have leading zeroes #8607

[BUG] Mismatch cases in json_tuple function when json items have leading zeroes #8607

Comments

thirtiseven commented Jun 25, 2023

thirtiseven commented Jun 26, 2023

thirtiseven commented Jun 26, 2023