[BUG] orc_write_test.py::test_write_sql_save_table failed intermittently w/ pascal GPU #8336

pxLi · 2023-05-22T01:06:08Z

Describe the bug
We found intermittent failures in integration test run using PASCAL GPU
rapids_it-PASCAL, build IDs: 153, 150 (failed twice in recent 5 runs)

mismatched CPU and GPU output

[2023-05-18T13:41:12.893Z]         elif (t is dict):
[2023-05-18T13:41:12.893Z]             # The order of key/values is not guaranteed in python dicts, nor are they guaranteed by Spark
[2023-05-18T13:41:12.893Z]             # so sort the items to do our best with ignoring the order of dicts
[2023-05-18T13:41:12.893Z]             cpu_items = list(cpu.items()).sort(key=_RowCmp)
[2023-05-18T13:41:12.893Z]             gpu_items = list(gpu.items()).sort(key=_RowCmp)
[2023-05-18T13:41:12.893Z]             _assert_equal(cpu_items, gpu_items, float_check, path + ["map"])
[2023-05-18T13:41:12.893Z]         elif (t is int):
[2023-05-18T13:41:12.893Z]             assert cpu == gpu, "GPU and CPU int values are different at {}".format(path)
[2023-05-18T13:41:12.893Z]         elif (t is float):
[2023-05-18T13:41:12.893Z]             if (math.isnan(cpu)):
[2023-05-18T13:41:12.893Z]                 assert math.isnan(gpu), "GPU and CPU float values are different at {}".format(path)
[2023-05-18T13:41:12.893Z]             else:
[2023-05-18T13:41:12.893Z]                 assert float_check(cpu, gpu), "GPU and CPU float values are different {}".format(path)
[2023-05-18T13:41:12.893Z]         elif isinstance(cpu, str):
[2023-05-18T13:41:12.893Z]             assert cpu == gpu, "GPU and CPU string values are different at {}".format(path)
[2023-05-18T13:41:12.893Z]         elif isinstance(cpu, datetime):
[2023-05-18T13:41:12.893Z]             assert cpu == gpu, "GPU and CPU timestamp values are different at {}".format(path)
[2023-05-18T13:41:12.893Z]         elif isinstance(cpu, date):
[2023-05-18T13:41:12.893Z]             assert cpu == gpu, "GPU and CPU date values are different at {}".format(path)
[2023-05-18T13:41:12.893Z]         elif isinstance(cpu, bool):
[2023-05-18T13:41:12.893Z]             assert cpu == gpu, "GPU and CPU boolean values are different at {}".format(path)
[2023-05-18T13:41:12.893Z]         elif isinstance(cpu, Decimal):
[2023-05-18T13:41:12.893Z] >           assert cpu == gpu, "GPU and CPU decimal values are different at {}".format(path)
[2023-05-18T13:41:12.893Z] �[1m�[31mE           AssertionError: GPU and CPU decimal values are different at [1264, '_c0', 'child10']�[0m
[2023-05-18T13:41:12.893Z] 
[2023-05-18T13:41:12.893Z] �[1m�[31m../../src/main/python/asserts.py�[0m:91: AssertionError
....

FAILED ../../src/main/python/orc_write_test.py::test_write_sql_save_table[native-TIMESTAMP_MICROS-[Struct(['child0', Byte],
['child1', Short],['child2', Integer],['child3', Long],['child4', Float],['child5', Double],['child6', String],['child7', Boolean],
['child8', Date],['child9', Timestamp],['child10', Decimal(7,3)],['child11', Decimal(12,2)],['child12', Decimal(20,2)]), Struct(['child0', Byte],['child1', Struct(['child0', Byte],['child1', Short],['child2', Integer],['child3', Long],['child4', Float],
['child5', Double],['child6', String],['child7', Boolean],['child8', Date],['child9', Timestamp],['child10', Decimal(7,3)],
['child11', Decimal(12,2)],['child12', Decimal(20,2)])]), Struct(['child0', Array(Short)],['child1', Double])]][INJECT_OOM]

Steps/Code to reproduce bug
run integration test w/ pascal GPU

Expected behavior
A clear and concise description of what you expected to happen.

Environment details (please complete the following information)

Environment location: [Standalone, YARN, Kubernetes, Cloud(specify cloud provider)]
Spark configuration settings related to the issue

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

ttnghia · 2023-05-26T22:23:22Z

Sorry, I got a Tesla P40 and I tried to run that test (with compiled latest 23.06 commit, Spark 3.3) 10 times but could not see any failure. Can you share more detail about the reproducing environment, please?

pxLi · 2023-05-29T01:07:27Z

Sorry, I got a Tesla P40 and I tried to run that test (with compiled latest 23.06 commit, Spark 3.3) 10 times but could not see any failure. Can you share more detail about the reproducing environment, please?

we saw this in our nightly CI rapids_it-PASCAL build ID: 153, 150
and this is not always reproducible from our monitoring.

commands used in this one,

export PARALLELISM=3
BASE_SPARK_SUBMIT_ARGS=" --conf spark.sql.adaptive.enabled=true" bash jenkins/spark-tests.sh

ttnghia · 2023-05-31T00:44:59Z

Just check the log in CI. rapids_it-PASCAL also uses P40 GPU, same as in my test.

ttnghia · 2023-05-31T02:24:43Z

Okay, now I can reproduce it. The issue shows up very rarely: Each time I ran the test 10 times, it may show up but just once or twice. Now I started to investigate.

ttnghia · 2023-06-01T16:12:37Z

Ran the test 400 times and got 3 failures. The failure rate is quite small. I dumped input data of such failed tests and start to investigate with libcudf.

ttnghia · 2023-06-06T22:14:36Z

I ran a test 100,000 times in cudf using the same input table that caused failure in the plugin test. All of them passed.

Running main() from gmock_main.cc
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from MyTest
[ RUN      ] MyTest.Test
num rows: 2048, num cols: 3
test iter: 0
test iter: 1
...
test iter: 99999
[       OK ] MyTest.Test (2517872 ms)
[----------] 1 test from MyTest (2517872 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (2517872 ms total)
[  PASSED  ] 1 test.

So I probably deprioritize this a bit, will think of some way to better test/debug at lower priority.

mattahrens · 2023-12-01T19:39:12Z

Removing Pascal support in 24.02

pxLi added bug Something isn't working ? - Needs Triage Need team to review and classify test Only impacts tests labels May 22, 2023

ttnghia self-assigned this May 23, 2023

ttnghia removed the ? - Needs Triage Need team to review and classify label May 23, 2023

pxLi mentioned this issue Dec 1, 2023

[FEA] Remove Pascal support #9692

Closed

mattahrens closed this as completed Dec 1, 2023

sameerz added the wontfix This will not be worked on label Dec 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] orc_write_test.py::test_write_sql_save_table failed intermittently w/ pascal GPU #8336

[BUG] orc_write_test.py::test_write_sql_save_table failed intermittently w/ pascal GPU #8336

pxLi commented May 22, 2023 •

edited

Loading

ttnghia commented May 26, 2023 •

edited

Loading

pxLi commented May 29, 2023

ttnghia commented May 31, 2023

ttnghia commented May 31, 2023

ttnghia commented Jun 1, 2023

ttnghia commented Jun 6, 2023

mattahrens commented Dec 1, 2023

[BUG] orc_write_test.py::test_write_sql_save_table failed intermittently w/ pascal GPU #8336

[BUG] orc_write_test.py::test_write_sql_save_table failed intermittently w/ pascal GPU #8336

Comments

pxLi commented May 22, 2023 • edited Loading

ttnghia commented May 26, 2023 • edited Loading

pxLi commented May 29, 2023

ttnghia commented May 31, 2023

ttnghia commented May 31, 2023

ttnghia commented Jun 1, 2023

ttnghia commented Jun 6, 2023

mattahrens commented Dec 1, 2023

pxLi commented May 22, 2023 •

edited

Loading

ttnghia commented May 26, 2023 •

edited

Loading