Skip to content

Commit

Permalink
Fix ORC reader issue with reading empty string columns (#7656)
Browse files Browse the repository at this point in the history
There was a [condition in reader where if the data size is zero](https://github.com/rapidsai/cudf/blob/8773a40f4c8ce63f56ed6eb67b4eaf959106939f/cpp/src/io/orc/reader_impl.cu#L538), then stream pointer was not getting updated. 
But in case of `["", ""]` where it is a valid data with 0 size, it was reading it as `[null, null]`, so the condition has been removed which caused this issue.

I have also added test cases to validate.

closes #7620

Authors:
  - Ram (Ramakrishna Prabhu) (@rgsl888prabhu)

Approvers:
  - Devavret Makkar (@devavret)
  - Vukasin Milovanovic (@vuule)
  - Keith Kraus (@kkraus14)

URL: #7656
  • Loading branch information
rgsl888prabhu authored Mar 19, 2021
1 parent 7086732 commit 217d702
Show file tree
Hide file tree
Showing 2 changed files with 17 additions and 3 deletions.
4 changes: 1 addition & 3 deletions cpp/src/io/orc/reader_impl.cu
Original file line number Diff line number Diff line change
Expand Up @@ -535,9 +535,7 @@ table_with_metadata reader::impl::read(size_type skip_rows,
chunk.ts_clock_rate = to_clockrate(_timestamp_type.id());
}
for (int k = 0; k < gpu::CI_NUM_STREAMS; k++) {
if (chunk.strm_len[k] > 0) {
chunk.streams[k] = dst_base + stream_info[chunk.strm_id[k]].dst_pos;
}
chunk.streams[k] = dst_base + stream_info[chunk.strm_id[k]].dst_pos;
}
}
stripe_start_row += stripe_info->numberOfRows;
Expand Down
16 changes: 16 additions & 0 deletions python/cudf/cudf/tests/test_orc.py
Original file line number Diff line number Diff line change
Expand Up @@ -738,3 +738,19 @@ def test_nanoseconds_overflow():

pyarrow_got = pa.orc.ORCFile(buffer).read()
assert_eq(expected.to_pandas(), pyarrow_got.to_pandas())


@pytest.mark.parametrize(
"data", [[None, ""], ["", None], [None, None], ["", ""]]
)
def test_empty_string_columns(data):
buffer = BytesIO()

expected = cudf.DataFrame({"string": data}, dtype="str")
expected.to_orc(buffer)

expected_pdf = pd.read_orc(buffer)
got_df = cudf.read_orc(buffer)

assert_eq(expected, got_df)
assert_eq(expected_pdf, got_df)

0 comments on commit 217d702

Please sign in to comment.