Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Loading timestamps from ORC sometimes returns incorrect results #6947

Closed
jlowe opened this issue Dec 8, 2020 · 0 comments · Fixed by #6959
Closed

[BUG] Loading timestamps from ORC sometimes returns incorrect results #6947

jlowe opened this issue Dec 8, 2020 · 0 comments · Fixed by #6959
Assignees
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@jlowe
Copy link
Member

jlowe commented Dec 8, 2020

Describe the bug
Some of our Spark integration tests for ORC have been failing with timestamp data mismatches. The test writes out randomly generated data (with a deterministic RNG seed) using Spark CPU then reads the data with Spark CPU and then Spark using the RAPIDS Accelerator plugin. Sometimes, not always, the tests will mismatch on timestamp data, e.g.:

03:58:29  cpu = datetime.datetime(1898, 5, 12, 21, 45, 8, 426000)
03:58:29  gpu = datetime.datetime(1898, 5, 13, 6, 47, 12, 426000)

Steps/Code to reproduce bug
This has been hard to reproduce outside of the Spark integration tests and even outside of the Jenkins pipelines for those tests. Latest news is we believe it is triggered by using the GMT timezone on a CentOS 7 machine. We've never seen it fail on an Ubuntu 18.04 machine with the GMT timezone. @nvdbaranec has had some success trying to isolate the issue.

Expected behavior
ORC timestamps are loaded via cudf the same way they are loaded by Spark for the GMT timezone.

@jlowe jlowe added bug Something isn't working Needs Triage Need team to review and classify cuIO cuIO issue labels Dec 8, 2020
@jlowe jlowe added the Spark Functionality that helps Spark RAPIDS label Dec 8, 2020
@vuule vuule added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Dec 10, 2020
rapids-bot bot pushed a commit that referenced this issue Dec 12, 2020
…6959)

Fixes #6947 

When TZif file has no transitions (e.g. GMT), `build_timezone_transition_table` has an out-of-bounds read that leads to undefined behavior and intermittent issues.

This PR makes two changes to behavior:
1. When there are no transitions, the ancient rule is initialized from the first time offset (instead of the first transition rule, which does not exist in this case).
2. When there are no transitions and the time offset is zero, an empty table is returned (avoid using a no-op table in CUDA).

Authors:
  - vuule <[email protected]>
  - Vukasin Milovanovic <[email protected]>

Approvers:
  - GALI PREM SAGAR
  - null
  - Ram (Ramakrishna Prabhu)
  - David

URL: #6959
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants