[BUG] Loading timestamps from ORC sometimes returns incorrect results #6947

jlowe · 2020-12-08T23:20:38Z

Describe the bug
Some of our Spark integration tests for ORC have been failing with timestamp data mismatches. The test writes out randomly generated data (with a deterministic RNG seed) using Spark CPU then reads the data with Spark CPU and then Spark using the RAPIDS Accelerator plugin. Sometimes, not always, the tests will mismatch on timestamp data, e.g.:

03:58:29  cpu = datetime.datetime(1898, 5, 12, 21, 45, 8, 426000)
03:58:29  gpu = datetime.datetime(1898, 5, 13, 6, 47, 12, 426000)

Steps/Code to reproduce bug
This has been hard to reproduce outside of the Spark integration tests and even outside of the Jenkins pipelines for those tests. Latest news is we believe it is triggered by using the GMT timezone on a CentOS 7 machine. We've never seen it fail on an Ubuntu 18.04 machine with the GMT timezone. @nvdbaranec has had some success trying to isolate the issue.

Expected behavior
ORC timestamps are loaded via cudf the same way they are loaded by Spark for the GMT timezone.

The text was updated successfully, but these errors were encountered:

…6959) Fixes #6947 When TZif file has no transitions (e.g. GMT), `build_timezone_transition_table` has an out-of-bounds read that leads to undefined behavior and intermittent issues. This PR makes two changes to behavior: 1. When there are no transitions, the ancient rule is initialized from the first time offset (instead of the first transition rule, which does not exist in this case). 2. When there are no transitions and the time offset is zero, an empty table is returned (avoid using a no-op table in CUDA). Authors: - vuule <[email protected]> - Vukasin Milovanovic <[email protected]> Approvers: - GALI PREM SAGAR - null - Ram (Ramakrishna Prabhu) - David URL: #6959

jlowe added bug Something isn't working Needs Triage Need team to review and classify cuIO cuIO issue labels Dec 8, 2020

jlowe assigned vuule Dec 8, 2020

jlowe added the Spark Functionality that helps Spark RAPIDS label Dec 8, 2020

jlowe mentioned this issue Dec 9, 2020

[BUG] orc_test.py is failing NVIDIA/spark-rapids#1061

Closed

vuule mentioned this issue Dec 9, 2020

Fix timestamp parsing in ORC reader for timezones without transitions #6959

Merged

vuule added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Dec 10, 2020

rapids-bot bot closed this as completed in #6959 Dec 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Loading timestamps from ORC sometimes returns incorrect results #6947

[BUG] Loading timestamps from ORC sometimes returns incorrect results #6947

jlowe commented Dec 8, 2020

[BUG] Loading timestamps from ORC sometimes returns incorrect results #6947

[BUG] Loading timestamps from ORC sometimes returns incorrect results #6947

Comments

jlowe commented Dec 8, 2020