Skip to content

Commit

Permalink
Fix timestamp parsing in ORC reader for timezones without transitions(#…
Browse files Browse the repository at this point in the history
…6959)

Fixes #6947 

When TZif file has no transitions (e.g. GMT), `build_timezone_transition_table` has an out-of-bounds read that leads to undefined behavior and intermittent issues.

This PR makes two changes to behavior:
1. When there are no transitions, the ancient rule is initialized from the first time offset (instead of the first transition rule, which does not exist in this case).
2. When there are no transitions and the time offset is zero, an empty table is returned (avoid using a no-op table in CUDA).

Authors:
  - vuule <[email protected]>
  - Vukasin Milovanovic <[email protected]>

Approvers:
  - GALI PREM SAGAR
  - null
  - Ram (Ramakrishna Prabhu)
  - David

URL: #6959
  • Loading branch information
vuule authored Dec 12, 2020
1 parent ab8c931 commit 929c3f4
Show file tree
Hide file tree
Showing 3 changed files with 27 additions and 3 deletions.
18 changes: 15 additions & 3 deletions cpp/src/io/orc/timezone.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -398,9 +398,21 @@ timezone_table build_timezone_transition_table(std::string const &timezone_name)
offsets.push_back(utcoff);
if (!earliest_std_idx && !tzf.ttype[idx].isdst) { earliest_std_idx = ttimes.size() - 1; }
}
if (!earliest_std_idx) { earliest_std_idx = 1; }
ttimes[0] = ttimes[earliest_std_idx];
offsets[0] = offsets[earliest_std_idx];

if (tzf.timecnt() != 0) {
if (!earliest_std_idx) { earliest_std_idx = 1; }
ttimes[0] = ttimes[earliest_std_idx];
offsets[0] = offsets[earliest_std_idx];
} else {
if (tzf.typecnt() == 0 || tzf.ttype[0].utcoff == 0) {
// No transitions, offset is zero; Table would be a no-op.
// Return an empty table to speed up parsing.
return {};
}
// No transitions to use for the time/offset - use the first offset and apply to all timestamps
ttimes[0] = std::numeric_limits<int64_t>::max();
offsets[0] = tzf.ttype[0].utcoff;
}

// Generate entries for times after the last transition
auto future_std_offset = offsets[tzf.timecnt()];
Expand Down
Binary file not shown.
12 changes: 12 additions & 0 deletions python/cudf/cudf/tests/test_orc.py
Original file line number Diff line number Diff line change
Expand Up @@ -584,3 +584,15 @@ def test_orc_write_bool_statistics(tmpdir, datadir, nrows):
"number_of_values"
]
assert normalized_equals(actual_valid_count, stats_valid_count)


def test_orc_reader_gmt_timestamps(datadir):
path = datadir / "TestOrcFile.gmt.orc"
try:
orcfile = pa.orc.ORCFile(path)
except pa.ArrowIOError as e:
pytest.skip(".orc file is not found: %s" % e)

pdf = orcfile.read().to_pandas()
gdf = cudf.read_orc(path, engine="cudf").to_pandas()
assert_eq(pdf, gdf)

0 comments on commit 929c3f4

Please sign in to comment.