Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Higher memory footprint when writing strings to orc #7661

Closed
ayushdg opened this issue Mar 19, 2021 · 3 comments · Fixed by #7737
Closed

[BUG] Higher memory footprint when writing strings to orc #7661

ayushdg opened this issue Mar 19, 2021 · 3 comments · Fixed by #7737
Assignees
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.

Comments

@ayushdg
Copy link
Member

ayushdg commented Mar 19, 2021

Describe the bug
df.to_orc has a higher memory footprint in 0.19 nightlies vs 0.18 when writing string columns.

Steps/Code to reproduce bug

nrows = 20_000_000
df =  cudf.DataFrame()
df['a'] = ['abc'] * nrows
for i in range(5):
    df[f'e{i}'] = "random_string"

df.to_orc(f'./test.orc', compression='snappy')

Peak memory usage in 0.18: 5864 MB
Peak memory usage in 0.19 @ a568432: 8432MB

Expected behavior
Similar memory usage

Environment overview (please complete the following information)

  • Environment location: Docker
  • Method of cuDF install: Docker
    • If method of install is [Docker], provide docker pull & docker run commands used
    • 0.18 release CUDA 10.2, python 3.8
    • 0.19 nightly, CUDA 10.2, python 3.8
      cudf                      0.19.0a210318   cuda_10.2_py38_ga568432872_236    rapidsai-nightly
      cudf_kafka                0.19.0a210318   py38_ga568432872_236    rapidsai-nightly
      dask-cudf                 0.19.0a210318   py38_ga568432872_236    rapidsai-nightly
      libcudf                   0.19.0a210318   cuda10.2_ga568432872_236    rapidsai-nightly
      

Additional Context
Script used to measure memory usage.

cc: @randerzander

@ayushdg ayushdg added bug Something isn't working Needs Triage Need team to review and classify labels Mar 19, 2021
@ayushdg ayushdg added the cuIO cuIO issue label Mar 19, 2021
@devavret devavret self-assigned this Mar 24, 2021
@devavret
Copy link
Contributor

Narrowed it down to the size of encoded_data

rmm::device_uvector<uint8_t> encoded_data(stream_offsets.data_size(), stream);

It is 1.72 GB but it shouldn't be.

@vuule
Copy link
Contributor

vuule commented Mar 25, 2021

Narrowed it down to the size of encoded_data

rmm::device_uvector<uint8_t> encoded_data(stream_offsets.data_size(), stream);

It is 1.72 GB but it shouldn't be.

I messed with this code in a recent PR, can take a look today.

@devavret
Copy link
Contributor

Narrowed it down to the size of encoded_data

rmm::device_uvector<uint8_t> encoded_data(stream_offsets.data_size(), stream);

It is 1.72 GB but it shouldn't be.

I messed with this code in a recent PR, can take a look today.

It looks like the code hasn't changed in function. The part which is supposed to calculate the amount of memory to allocate is unchanged.

@kkraus14 kkraus14 added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Mar 26, 2021
rapids-bot bot pushed a commit that referenced this issue Mar 27, 2021
Fixes #7661

Corrects the field order in `std::accumulate` that computes the string column size w.r.t encoding.

Authors:
  - Vukasin Milovanovic (@vuule)

Approvers:
  - Kumar Aatish (@kaatish)
  - Ram (Ramakrishna Prabhu) (@rgsl888prabhu)

URL: #7737
rapids-bot bot pushed a commit that referenced this issue Apr 23, 2021
Addresses #7661. Dictionary related device_uvector were released after use.

Authors:
  - Kumar Aatish (https://github.com/kaatish)
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Devavret Makkar (https://github.com/devavret)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #7719
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants