Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Huge Memory expansion when writing a wide uint8 cudf DataFrame to parquet #8890

Closed
VibhuJawa opened this issue Jul 28, 2021 · 5 comments · Fixed by #8476
Closed

[BUG] Huge Memory expansion when writing a wide uint8 cudf DataFrame to parquet #8890

VibhuJawa opened this issue Jul 28, 2021 · 5 comments · Fixed by #8476
Assignees
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.

Comments

@VibhuJawa
Copy link
Member

Describe the bug

I am seeing quite a bit of memory expansion when writing a wide uint8 dataframe. (1420MiB ->26762MiB )

Steps/Code to reproduce bug

import cupy as cp
import cudf
cudf.set_allocator(pool=True, initial_pool_size=1e+6)

df = cudf.DataFrame({'{}'.format(i): cp.ones(shape=80_000,dtype=cp.uint8) for i in range(0,5000)})

Nvidia-smi after creating frame

==============================+======================+======================|
|   0  Tesla V100-SXM3...  On   | 00000000:34:00.0 Off |                    0 |
| N/A   38C    P0    83W / 350W |   1420MiB / 32510MiB |      0%      Default |

!nvidia-smi after writing to disk

df.to_parquet('test.parquet')
!nvidia-smi
|===============================+======================+======================|
|   0  Tesla V100-SXM3...  On   | 00000000:34:00.0 Off |                    0 |
| N/A   38C    P0    83W / 350W |  26762MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A

Expected behavior
I would not expect this sort of memory expansion.

Environment details

cudf                      21.08.00a210727 cuda_11.2_py38_ga69a8a43b5_324    rapidsai-nightly
dask-cudf                 21.08.00a210727 py38_ga69a8a43b5_324    rapidsai-nightly
libcudf                   21.08.00a210727 cuda11.2_ga69a8a43b5_324    rapidsai-nightly
@VibhuJawa VibhuJawa added bug Something isn't working Needs Triage Need team to review and classify cuIO cuIO issue labels Jul 28, 2021
@vuule
Copy link
Contributor

vuule commented Jul 29, 2021

I may be misunderstanding the issue. AFAICT we should expect no increase in memory usage after to_parquet is done. Is this right?
CC @devavret

@VibhuJawa
Copy link
Member Author

VibhuJawa commented Jul 29, 2021

I may be misunderstanding the issue. AFAICT we should expect no increase in memory usage after to_parquet is done. Is this right?
CC @devavret

Sorry, its the peak memory thats a problem. Once to_parquet completes you dont see increase in memory.

What you are seeing here is the increase in the pool expansion which i used as a way to track the peak memory being used. So for essentially writing a 400 MB frame we hit about 25000MiB in peak memory which is prohibitive when writing such wide frames.

@vuule
Copy link
Contributor

vuule commented Jul 29, 2021

Can you run the same code without using the pool and get the peak memory use? I don't know what the actual peak usage is, since it depends on how the pool expands.

@devavret
Copy link
Contributor

This is happening in dictionary encoding. A wide dataframe is causing very thin rowgroups and each rowgroup would contain a dictionary. Due to the rowgroups being thin, there are a lot of them (>40000). The scratch space needed for one dictionary is 256KB. So a total of >10GB of temp allocation. There's just no way around it in new code or old. Coupled with the pool allocator that allocates double of the requirement, I can see why it would go over 25000MiB.

In the meantime, I think an option to turn off dictionary encoding (#7197 (comment)) would unblock this.

For cuIO insiders:

The real way to fix it to have fewer rowgroups, which is only really possible after dictionary encoding is refactored to be not dependent on InitFragments. After that, the purpose of InitFragments would only be to calculate the rowgroup boundaries. Which we can replace with row_bit_count.

@beckernick beckernick added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Jul 29, 2021
@devavret devavret self-assigned this Aug 5, 2021
devavret added a commit to devavret/cudf that referenced this issue Aug 9, 2021
@devavret
Copy link
Contributor

This turned out a lot easier to fix with the new dictionary encoding.

rapids-bot bot pushed a commit that referenced this issue Aug 19, 2021
Replaces previous parquet dictionary encoding code with one that uses `cuCollections`' static map.

Adds [`cuCollections`](https://github.com/NVIDIA/cuCollections) to `libcudf`

Closes #7873
Fixes #8890 

**Currently blocked on Pascal support for static_map in cuCollections**

(More details to be added)

<!--

Thank you for contributing to cuDF :)

Here are some guidelines to help the review process go smoothly.

1. Please write a description in this text box of the changes that are being
   made.

2. Please ensure that you have written units tests for the changes made/features
   added.

3. There are CI checks in place to enforce that committed code follows our style
   and syntax standards. Please see our contribution guide in `CONTRIBUTING.MD`
   in the project root for more information about the checks we perform and how
   you can run them locally.

4. If you are closing an issue please use one of the automatic closing words as
   noted here: https://help.github.com/articles/closing-issues-using-keywords/

5. If your pull request is not ready for review but you want to make use of the
   continuous integration testing facilities please mark your pull request as Draft.
   https://docs.github.com/en/github/collaborating-with-issues-and-pull-requests/changing-the-stage-of-a-pull-request#converting-a-pull-request-to-a-draft

6. If your pull request is ready to be reviewed without requiring additional
   work on top of it, then remove it from "Draft" and make it "Ready for Review".
   https://docs.github.com/en/github/collaborating-with-issues-and-pull-requests/changing-the-stage-of-a-pull-request#marking-a-pull-request-as-ready-for-review

   If assistance is required to complete the functionality, for example when the
   C/C++ code of a feature is complete but Python bindings are still required,
   then add the label `help wanted` so that others can triage and assist.
   The additional changes then can be implemented on top of the same PR.
   If the assistance is done by members of the rapidsAI team, then no
   additional actions are required by the creator of the original PR for this,
   otherwise the original author of the PR needs to give permission to the
   person(s) assisting to commit to their personal fork of the project. If that
   doesn't happen then a new PR based on the code of the original PR can be
   opened by the person assisting, which then will be the PR that will be
   merged.

7. Once all work has been done and review has taken place please do not add
   features or make changes out of the scope of those requested by the reviewer
   (doing this just add delays as already reviewed code ends up having to be
   re-reviewed/it is hard to tell what is new etc!). Further, please do not
   rebase your branch on the target branch, force push, or rewrite history.
   Doing any of these causes the context of any comments made by reviewers to be lost.
   If conflicts occur against the target branch they should be resolved by
   merging the target branch into the branch used for making the pull request.

Many thanks in advance for your cooperation!

-->

Authors:
  - Devavret Makkar (https://github.com/devavret)
  - Mark Harris (https://github.com/harrism)

Approvers:
  - Robert Maynard (https://github.com/robertmaynard)
  - Nghia Truong (https://github.com/ttnghia)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #8476
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants