Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] orc_write_test.py::test_write_sql_save_table failed intermittently w/ pascal GPU #8336

Closed
pxLi opened this issue May 22, 2023 · 7 comments
Assignees
Labels
bug Something isn't working test Only impacts tests wontfix This will not be worked on

Comments

@pxLi
Copy link
Collaborator

pxLi commented May 22, 2023

Describe the bug
We found intermittent failures in integration test run using PASCAL GPU
rapids_it-PASCAL, build IDs: 153, 150 (failed twice in recent 5 runs)

mismatched CPU and GPU output

[2023-05-18T13:41:12.893Z]         elif (t is dict):
[2023-05-18T13:41:12.893Z]             # The order of key/values is not guaranteed in python dicts, nor are they guaranteed by Spark
[2023-05-18T13:41:12.893Z]             # so sort the items to do our best with ignoring the order of dicts
[2023-05-18T13:41:12.893Z]             cpu_items = list(cpu.items()).sort(key=_RowCmp)
[2023-05-18T13:41:12.893Z]             gpu_items = list(gpu.items()).sort(key=_RowCmp)
[2023-05-18T13:41:12.893Z]             _assert_equal(cpu_items, gpu_items, float_check, path + ["map"])
[2023-05-18T13:41:12.893Z]         elif (t is int):
[2023-05-18T13:41:12.893Z]             assert cpu == gpu, "GPU and CPU int values are different at {}".format(path)
[2023-05-18T13:41:12.893Z]         elif (t is float):
[2023-05-18T13:41:12.893Z]             if (math.isnan(cpu)):
[2023-05-18T13:41:12.893Z]                 assert math.isnan(gpu), "GPU and CPU float values are different at {}".format(path)
[2023-05-18T13:41:12.893Z]             else:
[2023-05-18T13:41:12.893Z]                 assert float_check(cpu, gpu), "GPU and CPU float values are different {}".format(path)
[2023-05-18T13:41:12.893Z]         elif isinstance(cpu, str):
[2023-05-18T13:41:12.893Z]             assert cpu == gpu, "GPU and CPU string values are different at {}".format(path)
[2023-05-18T13:41:12.893Z]         elif isinstance(cpu, datetime):
[2023-05-18T13:41:12.893Z]             assert cpu == gpu, "GPU and CPU timestamp values are different at {}".format(path)
[2023-05-18T13:41:12.893Z]         elif isinstance(cpu, date):
[2023-05-18T13:41:12.893Z]             assert cpu == gpu, "GPU and CPU date values are different at {}".format(path)
[2023-05-18T13:41:12.893Z]         elif isinstance(cpu, bool):
[2023-05-18T13:41:12.893Z]             assert cpu == gpu, "GPU and CPU boolean values are different at {}".format(path)
[2023-05-18T13:41:12.893Z]         elif isinstance(cpu, Decimal):
[2023-05-18T13:41:12.893Z] >           assert cpu == gpu, "GPU and CPU decimal values are different at {}".format(path)
[2023-05-18T13:41:12.893Z] �[1m�[31mE           AssertionError: GPU and CPU decimal values are different at [1264, '_c0', 'child10']�[0m
[2023-05-18T13:41:12.893Z] 
[2023-05-18T13:41:12.893Z] �[1m�[31m../../src/main/python/asserts.py�[0m:91: AssertionError
....

FAILED ../../src/main/python/orc_write_test.py::test_write_sql_save_table[native-TIMESTAMP_MICROS-[Struct(['child0', Byte],
['child1', Short],['child2', Integer],['child3', Long],['child4', Float],['child5', Double],['child6', String],['child7', Boolean],
['child8', Date],['child9', Timestamp],['child10', Decimal(7,3)],['child11', Decimal(12,2)],['child12', Decimal(20,2)]), Struct(['child0', Byte],['child1', Struct(['child0', Byte],['child1', Short],['child2', Integer],['child3', Long],['child4', Float],
['child5', Double],['child6', String],['child7', Boolean],['child8', Date],['child9', Timestamp],['child10', Decimal(7,3)],
['child11', Decimal(12,2)],['child12', Decimal(20,2)])]), Struct(['child0', Array(Short)],['child1', Double])]][INJECT_OOM]

Steps/Code to reproduce bug
run integration test w/ pascal GPU

Expected behavior
A clear and concise description of what you expected to happen.

Environment details (please complete the following information)

  • Environment location: [Standalone, YARN, Kubernetes, Cloud(specify cloud provider)]
  • Spark configuration settings related to the issue

Additional context
Add any other context about the problem here.

@pxLi pxLi added bug Something isn't working ? - Needs Triage Need team to review and classify test Only impacts tests labels May 22, 2023
@ttnghia ttnghia self-assigned this May 23, 2023
@ttnghia ttnghia removed the ? - Needs Triage Need team to review and classify label May 23, 2023
@ttnghia
Copy link
Collaborator

ttnghia commented May 26, 2023

Sorry, I got a Tesla P40 and I tried to run that test (with compiled latest 23.06 commit, Spark 3.3) 10 times but could not see any failure. Can you share more detail about the reproducing environment, please?

@pxLi
Copy link
Collaborator Author

pxLi commented May 29, 2023

Sorry, I got a Tesla P40 and I tried to run that test (with compiled latest 23.06 commit, Spark 3.3) 10 times but could not see any failure. Can you share more detail about the reproducing environment, please?

we saw this in our nightly CI rapids_it-PASCAL build ID: 153, 150
and this is not always reproducible from our monitoring.

commands used in this one,

export PARALLELISM=3
BASE_SPARK_SUBMIT_ARGS=" --conf spark.sql.adaptive.enabled=true" bash jenkins/spark-tests.sh

@ttnghia
Copy link
Collaborator

ttnghia commented May 31, 2023

Just check the log in CI. rapids_it-PASCAL also uses P40 GPU, same as in my test.

@ttnghia
Copy link
Collaborator

ttnghia commented May 31, 2023

Okay, now I can reproduce it. The issue shows up very rarely: Each time I ran the test 10 times, it may show up but just once or twice. Now I started to investigate.

@ttnghia
Copy link
Collaborator

ttnghia commented Jun 1, 2023

Ran the test 400 times and got 3 failures. The failure rate is quite small. I dumped input data of such failed tests and start to investigate with libcudf.

@ttnghia
Copy link
Collaborator

ttnghia commented Jun 6, 2023

I ran a test 100,000 times in cudf using the same input table that caused failure in the plugin test. All of them passed.

Running main() from gmock_main.cc
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from MyTest
[ RUN      ] MyTest.Test
num rows: 2048, num cols: 3
test iter: 0
test iter: 1
...
test iter: 99999
[       OK ] MyTest.Test (2517872 ms)
[----------] 1 test from MyTest (2517872 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (2517872 ms total)
[  PASSED  ] 1 test.

So I probably deprioritize this a bit, will think of some way to better test/debug at lower priority.

@mattahrens
Copy link
Collaborator

Removing Pascal support in 24.02

@sameerz sameerz added the wontfix This will not be worked on label Dec 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working test Only impacts tests wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

4 participants