[BUG] Higher memory footprint when writing strings to orc #7661

ayushdg · 2021-03-19T23:12:53Z

Describe the bug
df.to_orc has a higher memory footprint in 0.19 nightlies vs 0.18 when writing string columns.

Steps/Code to reproduce bug

nrows = 20_000_000
df =  cudf.DataFrame()
df['a'] = ['abc'] * nrows
for i in range(5):
    df[f'e{i}'] = "random_string"

df.to_orc(f'./test.orc', compression='snappy')

Peak memory usage in 0.18: 5864 MB
Peak memory usage in 0.19 @ a568432: 8432MB

Expected behavior
Similar memory usage

Environment overview (please complete the following information)

Environment location: Docker

Method of cuDF install: Docker

If method of install is [Docker], provide docker pull & docker run commands used
0.18 release CUDA 10.2, python 3.8

0.19 nightly, CUDA 10.2, python 3.8

cudf                      0.19.0a210318   cuda_10.2_py38_ga568432872_236    rapidsai-nightly
cudf_kafka                0.19.0a210318   py38_ga568432872_236    rapidsai-nightly
dask-cudf                 0.19.0a210318   py38_ga568432872_236    rapidsai-nightly
libcudf                   0.19.0a210318   cuda10.2_ga568432872_236    rapidsai-nightly

Additional Context
Script used to measure memory usage.

cc: @randerzander

The text was updated successfully, but these errors were encountered:

devavret · 2021-03-25T15:04:29Z

Narrowed it down to the size of encoded_data

cudf/cpp/src/io/orc/writer_impl.cu

Line 606 in 34cccfe

rmm::device_uvector<uint8_t> encoded_data(stream_offsets.data_size(), stream);

It is 1.72 GB but it shouldn't be.

vuule · 2021-03-25T15:46:48Z

Narrowed it down to the size of encoded_data

cudf/cpp/src/io/orc/writer_impl.cu

Line 606 in 34cccfe

rmm::device_uvector<uint8_t> encoded_data(stream_offsets.data_size(), stream);

It is 1.72 GB but it shouldn't be.

I messed with this code in a recent PR, can take a look today.

devavret · 2021-03-25T15:49:50Z

Narrowed it down to the size of encoded_data

cudf/cpp/src/io/orc/writer_impl.cu

Line 606 in 34cccfe

rmm::device_uvector<uint8_t> encoded_data(stream_offsets.data_size(), stream);

It is 1.72 GB but it shouldn't be.

I messed with this code in a recent PR, can take a look today.

It looks like the code hasn't changed in function. The part which is supposed to calculate the amount of memory to allocate is unchanged.

@vuule

Fixes #7661 Corrects the field order in `std::accumulate` that computes the string column size w.r.t encoding. Authors: - Vukasin Milovanovic (@vuule) Approvers: - Kumar Aatish (@kaatish) - Ram (Ramakrishna Prabhu) (@rgsl888prabhu) URL: #7737

Addresses #7661. Dictionary related device_uvector were released after use. Authors: - Kumar Aatish (https://github.com/kaatish) - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Devavret Makkar (https://github.com/devavret) - Vukasin Milovanovic (https://github.com/vuule) URL: #7719

ayushdg added bug Something isn't working Needs Triage Need team to review and classify labels Mar 19, 2021

ayushdg added the cuIO cuIO issue label Mar 19, 2021

vuule assigned kaatish and vuule Mar 22, 2021

devavret self-assigned this Mar 24, 2021

kaatish mentioned this issue Mar 25, 2021

Reduce peak device memory usage in ORC writer #7719

Merged

devavret mentioned this issue Mar 26, 2021

Fix dictionary size computation in ORC writer #7737

Merged

kkraus14 added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Mar 26, 2021

rapids-bot bot closed this as completed in #7737 Mar 27, 2021

kaatish mentioned this issue Mar 31, 2021

Add peak memory usage tracking to cuIO benchmarks #7770

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Higher memory footprint when writing strings to orc #7661

[BUG] Higher memory footprint when writing strings to orc #7661

ayushdg commented Mar 19, 2021

devavret commented Mar 25, 2021

vuule commented Mar 25, 2021

devavret commented Mar 25, 2021

[BUG] Higher memory footprint when writing strings to orc #7661

[BUG] Higher memory footprint when writing strings to orc #7661

Comments

ayushdg commented Mar 19, 2021

devavret commented Mar 25, 2021

vuule commented Mar 25, 2021

devavret commented Mar 25, 2021