-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Huge Memory expansion when writing a wide uint8 cudf DataFrame to parquet #8890
Comments
I may be misunderstanding the issue. AFAICT we should expect no increase in memory usage after |
Sorry, its the peak memory thats a problem. Once to_parquet completes you dont see increase in memory. What you are seeing here is the increase in the pool expansion which i used as a way to track the peak memory being used. So for essentially writing a |
Can you run the same code without using the pool and get the peak memory use? I don't know what the actual peak usage is, since it depends on how the pool expands. |
This is happening in dictionary encoding. A wide dataframe is causing very thin rowgroups and each rowgroup would contain a dictionary. Due to the rowgroups being thin, there are a lot of them (>40000). The scratch space needed for one dictionary is 256KB. So a total of >10GB of temp allocation. There's just no way around it in new code or old. Coupled with the pool allocator that allocates double of the requirement, I can see why it would go over 25000MiB. In the meantime, I think an option to turn off dictionary encoding (#7197 (comment)) would unblock this. For cuIO insiders:The real way to fix it to have fewer rowgroups, which is only really possible after dictionary encoding is refactored to be not dependent on InitFragments. After that, the purpose of InitFragments would only be to calculate the rowgroup boundaries. Which we can replace with row_bit_count. |
This turned out a lot easier to fix with the new dictionary encoding. |
Replaces previous parquet dictionary encoding code with one that uses `cuCollections`' static map. Adds [`cuCollections`](https://github.com/NVIDIA/cuCollections) to `libcudf` Closes #7873 Fixes #8890 **Currently blocked on Pascal support for static_map in cuCollections** (More details to be added) <!-- Thank you for contributing to cuDF :) Here are some guidelines to help the review process go smoothly. 1. Please write a description in this text box of the changes that are being made. 2. Please ensure that you have written units tests for the changes made/features added. 3. There are CI checks in place to enforce that committed code follows our style and syntax standards. Please see our contribution guide in `CONTRIBUTING.MD` in the project root for more information about the checks we perform and how you can run them locally. 4. If you are closing an issue please use one of the automatic closing words as noted here: https://help.github.com/articles/closing-issues-using-keywords/ 5. If your pull request is not ready for review but you want to make use of the continuous integration testing facilities please mark your pull request as Draft. https://docs.github.com/en/github/collaborating-with-issues-and-pull-requests/changing-the-stage-of-a-pull-request#converting-a-pull-request-to-a-draft 6. If your pull request is ready to be reviewed without requiring additional work on top of it, then remove it from "Draft" and make it "Ready for Review". https://docs.github.com/en/github/collaborating-with-issues-and-pull-requests/changing-the-stage-of-a-pull-request#marking-a-pull-request-as-ready-for-review If assistance is required to complete the functionality, for example when the C/C++ code of a feature is complete but Python bindings are still required, then add the label `help wanted` so that others can triage and assist. The additional changes then can be implemented on top of the same PR. If the assistance is done by members of the rapidsAI team, then no additional actions are required by the creator of the original PR for this, otherwise the original author of the PR needs to give permission to the person(s) assisting to commit to their personal fork of the project. If that doesn't happen then a new PR based on the code of the original PR can be opened by the person assisting, which then will be the PR that will be merged. 7. Once all work has been done and review has taken place please do not add features or make changes out of the scope of those requested by the reviewer (doing this just add delays as already reviewed code ends up having to be re-reviewed/it is hard to tell what is new etc!). Further, please do not rebase your branch on the target branch, force push, or rewrite history. Doing any of these causes the context of any comments made by reviewers to be lost. If conflicts occur against the target branch they should be resolved by merging the target branch into the branch used for making the pull request. Many thanks in advance for your cooperation! --> Authors: - Devavret Makkar (https://github.com/devavret) - Mark Harris (https://github.com/harrism) Approvers: - Robert Maynard (https://github.com/robertmaynard) - Nghia Truong (https://github.com/ttnghia) - Vukasin Milovanovic (https://github.com/vuule) URL: #8476
Describe the bug
I am seeing quite a bit of memory expansion when writing a wide uint8 dataframe. (
1420MiB
->26762MiB
)Steps/Code to reproduce bug
Nvidia-smi after creating frame
!nvidia-smi after writing to disk
Expected behavior
I would not expect this sort of memory expansion.
Environment details
The text was updated successfully, but these errors were encountered: