[BUG] Huge Memory expansion when writing a wide uint8 cudf DataFrame to parquet #8890

VibhuJawa · 2021-07-28T22:50:49Z

Describe the bug

I am seeing quite a bit of memory expansion when writing a wide uint8 dataframe. (1420MiB ->26762MiB )

Steps/Code to reproduce bug

import cupy as cp
import cudf
cudf.set_allocator(pool=True, initial_pool_size=1e+6)

df = cudf.DataFrame({'{}'.format(i): cp.ones(shape=80_000,dtype=cp.uint8) for i in range(0,5000)})

Nvidia-smi after creating frame

==============================+======================+======================|
|   0  Tesla V100-SXM3...  On   | 00000000:34:00.0 Off |                    0 |
| N/A   38C    P0    83W / 350W |   1420MiB / 32510MiB |      0%      Default |

!nvidia-smi after writing to disk

df.to_parquet('test.parquet')

!nvidia-smi

|===============================+======================+======================|
|   0  Tesla V100-SXM3...  On   | 00000000:34:00.0 Off |                    0 |
| N/A   38C    P0    83W / 350W |  26762MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A

Expected behavior
I would not expect this sort of memory expansion.

Environment details

cudf                      21.08.00a210727 cuda_11.2_py38_ga69a8a43b5_324    rapidsai-nightly
dask-cudf                 21.08.00a210727 py38_ga69a8a43b5_324    rapidsai-nightly
libcudf                   21.08.00a210727 cuda11.2_ga69a8a43b5_324    rapidsai-nightly

The text was updated successfully, but these errors were encountered:

vuule · 2021-07-29T01:00:37Z

I may be misunderstanding the issue. AFAICT we should expect no increase in memory usage after to_parquet is done. Is this right?
CC @devavret

VibhuJawa · 2021-07-29T01:05:31Z

I may be misunderstanding the issue. AFAICT we should expect no increase in memory usage after to_parquet is done. Is this right?
CC @devavret

Sorry, its the peak memory thats a problem. Once to_parquet completes you dont see increase in memory.

What you are seeing here is the increase in the pool expansion which i used as a way to track the peak memory being used. So for essentially writing a 400 MB frame we hit about 25000MiB in peak memory which is prohibitive when writing such wide frames.

vuule · 2021-07-29T01:08:04Z

Can you run the same code without using the pool and get the peak memory use? I don't know what the actual peak usage is, since it depends on how the pool expands.

devavret · 2021-07-29T11:28:12Z

This is happening in dictionary encoding. A wide dataframe is causing very thin rowgroups and each rowgroup would contain a dictionary. Due to the rowgroups being thin, there are a lot of them (>40000). The scratch space needed for one dictionary is 256KB. So a total of >10GB of temp allocation. There's just no way around it in new code or old. Coupled with the pool allocator that allocates double of the requirement, I can see why it would go over 25000MiB.

In the meantime, I think an option to turn off dictionary encoding (#7197 (comment)) would unblock this.

For cuIO insiders:

The real way to fix it to have fewer rowgroups, which is only really possible after dictionary encoding is refactored to be not dependent on InitFragments. After that, the purpose of InitFragments would only be to calculate the rowgroup boundaries. Which we can replace with row_bit_count.

devavret · 2021-08-11T21:58:35Z

This turned out a lot easier to fix with the new dictionary encoding.

Replaces previous parquet dictionary encoding code with one that uses `cuCollections`' static map. Adds [`cuCollections`](https://github.com/NVIDIA/cuCollections) to `libcudf` Closes #7873 Fixes #8890 **Currently blocked on Pascal support for static_map in cuCollections** (More details to be added)  Authors: - Devavret Makkar (https://github.com/devavret) - Mark Harris (https://github.com/harrism) Approvers: - Robert Maynard (https://github.com/robertmaynard) - Nghia Truong (https://github.com/ttnghia) - Vukasin Milovanovic (https://github.com/vuule) URL: #8476

VibhuJawa added bug Something isn't working Needs Triage Need team to review and classify cuIO cuIO issue labels Jul 28, 2021

beckernick added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Jul 29, 2021

devavret self-assigned this Aug 5, 2021

devavret added a commit to devavret/cudf that referenced this issue Aug 9, 2021

Fix for rapidsai#8890

b401a5f

devavret mentioned this issue Aug 9, 2021

Parquet writer dictionary encoding refactor #8476

Merged

rapids-bot bot closed this as completed in #8476 Aug 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Huge Memory expansion when writing a wide uint8 cudf DataFrame to parquet #8890

[BUG] Huge Memory expansion when writing a wide uint8 cudf DataFrame to parquet #8890

VibhuJawa commented Jul 28, 2021

vuule commented Jul 29, 2021

VibhuJawa commented Jul 29, 2021 •

edited

Loading

vuule commented Jul 29, 2021

devavret commented Jul 29, 2021

devavret commented Aug 11, 2021

[BUG] Huge Memory expansion when writing a wide uint8 cudf DataFrame to parquet #8890

[BUG] Huge Memory expansion when writing a wide uint8 cudf DataFrame to parquet #8890

Comments

VibhuJawa commented Jul 28, 2021

vuule commented Jul 29, 2021

VibhuJawa commented Jul 29, 2021 • edited Loading

vuule commented Jul 29, 2021

devavret commented Jul 29, 2021

For cuIO insiders:

devavret commented Aug 11, 2021

VibhuJawa commented Jul 29, 2021 •

edited

Loading