-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Pinned memory pools for parquet decode #14314
Comments
The new design for |
Note that cuda::mr::memory_resource doesn't directly expand support to include pinned host memory pools, it just enables us to more easily reuse the implementation of stream-ordered device memory pools for other kinds of device-accessible memory. |
@abellina It's really hard to tell in the image how much time is saved by this change. Can you provide comparable benchmark results? |
Sure, this is for NDS @ 3TB in our performance cluster, the benchmark was executed 3 times for baseline vs test. We see around 6% improvement. I need to look at query95 more because it found a regression there, but overall this was a win:
|
So what are the other ones with speedups of 21%, 19%, etc.? |
Sorry I should describe the benchmark better. This is NDS https://github.com/NVIDIA/spark-rapids-benchmarks/tree/dev/nds where we are running it in "power run" mode. In this case we are running queries in series one after the other in a cluster running spark rapids and 8xA100 GPUs. The results above show when one of the queries has significant regressions or speedups, and it also has a "benchmark" section at the end for the overall (sum of all query times compared between baseline and test). I ran these sets of tests three times, so the comparison tool is looking at the means and the variance and figuring out what is noise and what isn't. From the queries that have significant speedup, query9 and query88 are two queries we know are parquet/scan bound. Looking at traces for these, you'll mostly fine unsnap and parquet decode kernels. We see a lot of benefit here. Is this helping with your question @harrism, or am I missing it? |
Yes. So the 6% is overall average benefit. Good to know. |
As of 24.02 you can create a a) The static resource RMM currently stores for There is some design work around how and where to put this, and how to wire up configuration knobs for the initial / maximum pool size. Also how to expose that to Python and Spark. There are other places where this pool could be useful but let's start here. |
Thank you @harrism for kicking off the design discussion. Thank you for your help |
This issue describes a): rapidsai/rmm#826 Basically the function local static works on Linux but not on windows because each binary gets a it's own instance and so they won't be the same across DLLs. Although I suppose adding a duplicate of the incompatibility reason doesn't make it more incompatible, it's just more of the same. I still think it should be tested in libcudf first. For b) the answer is to model it after get/set_current_device_resource(), which is the function with the function local static. But mostly I was suggesting that libcudf should implement it the way libcudf wants it to be for libcudf. Thank you too! |
I just discussed this idea a bit more with @abellina, @mattahrens, @GregoryKimball, and @vuule. We're leaning towards adding a parameter for The @abellina is going to investigate a bit further and pursue an implementation for review. |
Sounds good. Because For an example, see #14873 |
One thing to be aware of for deciding your initial pool size is that if you make it too small, you will get unnecessary fragmentation. The reason is that the RMM pool MR grows by just allocating a new chunk. This isn't TOO bad because it uses a geometric growth strategy, but allocations that don't fit will cause new allocations that can't be merged with the previous pool chunks, hence fragmentation. |
Thank you for the meeting today and thank you @bdice for sharing this summary.
I'm curious also to hear from @vuule. Plus @nvdbaranec and @etseidl if you would like to weigh in. |
The introduction of the I have a separate concern about having a |
I am more of a fan of having a global function that sets an rmm allocator to be used for pinned allocations. For now we could limit it to cuIO only (which would really mean just Maybe |
This is a good observation. My primary concern was about ensuring thread safety for the global/static approach (what happens if that allocator is changed while in use / before its previous allocations are freed?). If that's not a problem, then I'm okay with using a global/static pinned (pool) host memory resource. I thought the |
Just measured the pinned memory peak use in the Parquet reader. In the benchmarks, which create tables with 512MB of data, the largest peak pinned memory use I saw was about 3.8MB, with integer columns. Many cases use around 1MB. So we should be able to make great use of a <1GB pool even with many threads. |
But isn't 512MB much smaller than real world use cases? |
BTW after @vuule and @nvdbaranec pointed it out I also think starting with a global config for an allocator that should be used internally in cuIO is a much better place to start than plumbing it in everywhere for flexibility. |
@vuule could spark-rapids still provide its own memory pool for this? We wouldn't want a default pinned allocator to initialize, then uninitialized because it is going to be replaced by our own allocator. |
Sure, but the measurements show that pinned memory requirement are less than 1% of the device memory required to read a PQ file. Even when fully using a 50GB GPU we won't fill a 1GB pinned pool. |
Oh, I guess I don't know how parquet reading works. :) |
…mo of pooled-pinned allocation. (#15079) This PR adds a new interface to cuIO which controls where host memory allocations come from. It adds two core functions: Addresses #14314 ``` rmm::host_async_resource_ref set_host_memory_resource(rmm::host_async_resource_ref mr); rmm::host_async_resource_ref get_host_memory_resource(); ``` `cudf::io::hostdevice_vector` was currently implemented in terms of a `thrust::host_vector<>` that explicitly uses an allocator called `pinned_host_vector`. I copied that and made a new class called `rmm_host_vector` which takes any host_resource_ref. This probably makes `pinned_host_vector` obsolete. Parquet benchmarks have a new commandline option which lets you toggle between 3 modes: ``` --cuio_host_mem pinned (the default, an unpooled, pinned memory source) --cuio_host_mem pinned_pool (the pooled/pinned resource) ``` The ultimate intent here is to reduce the cpu-side overhead of the setup code that comes before the decode kernels in the parquet reader. The wins are pretty significant for our faster kernels (that is, where we are less dominated by gpu time) Edit: Updated to use newly minted resource ref types from rmm itself. I also switched the type to be `host_async_resource_ref` even though in this case the user (`thrust::host_vector`) doesn't explicitly go through the async path. In addition, the pageable memory path (an experimental feature) has been removed. Pinned ``` | data_type | io_type | cardinality | run_length | Samples | CPU Time | Noise | GPU Time | Noise | bytes_per_second | peak_memory_usage | encoded_file_size | |-----------|---------------|-------------|------------|---------|-----------|-------|-----------|-------|------------------|-------------------|-------------------| | INTEGRAL | DEVICE_BUFFER | 0 | 1 | 25x | 20.443 ms | 0.45% | 20.438 ms | 0.45% | 26268890178 | 1.072 GiB | 498.123 MiB | | INTEGRAL | DEVICE_BUFFER | 1000 | 1 | 26x | 19.571 ms | 0.42% | 19.565 ms | 0.42% | 27440146729 | 756.210 MiB | 161.438 MiB | | INTEGRAL | DEVICE_BUFFER | 0 | 32 | 28x | 18.150 ms | 0.18% | 18.145 ms | 0.18% | 29587789525 | 602.424 MiB | 27.720 MiB | | INTEGRAL | DEVICE_BUFFER | 1000 | 32 | 29x | 17.306 ms | 0.37% | 17.300 ms | 0.37% | 31032523423 | 597.181 MiB | 14.403 MiB | ``` Pooled/pinned ``` | data_type | io_type | cardinality | run_length | Samples | CPU Time | Noise | GPU Time | Noise | bytes_per_second | peak_memory_usage | encoded_file_size | |-----------|---------------|-------------|------------|---------|-----------|-------|-----------|-------|------------------|-------------------|-------------------| | INTEGRAL | DEVICE_BUFFER | 0 | 1 | 117x | 17.258 ms | 0.50% | 17.254 ms | 0.50% | 31115706389 | 1.072 GiB | 498.123 MiB | | INTEGRAL | DEVICE_BUFFER | 1000 | 1 | 31x | 16.413 ms | 0.43% | 16.408 ms | 0.43% | 32719609450 | 756.210 MiB | 161.438 MiB | | INTEGRAL | DEVICE_BUFFER | 0 | 32 | 576x | 14.885 ms | 0.58% | 14.881 ms | 0.58% | 36077859564 | 602.519 MiB | 27.720 MiB | | INTEGRAL | DEVICE_BUFFER | 1000 | 32 | 36x | 14.069 ms | 0.48% | 14.065 ms | 0.48% | 38171646940 | 597.243 MiB | 14.403 MiB | ``` Authors: - https://github.com/nvdbaranec Approvers: - Mark Harris (https://github.com/harrism) - Vukasin Milovanovic (https://github.com/vuule) URL: #15079
We are investigating using pinned memory pool at the cuDF layer and replacing
cudaFreeHost
calls inpinned_host_vector
due to traces we have seen that indicate synchronization or a "lining up" of kernels during parquet decode. Here's query88 from NDS at 3TB on our performance cluster running with an A100. In the nsys trace (pardon the amount of streams), we can see parquet nvcomp and decode kernels working on the first three quarters of the trace:The bottom trace is cuDF without changes. The top trace is a modified cuDF where we replaced calls to cudaMallocHost and cudaFreeHost with
allocate
anddeallocate
against a modified RMMpool_memory_resource
that isn't stream aware and has a single free list.When we run with the modified cuDF, our NDS benchmark shows a 5% improvement at 3TB and a 6% improvement between old cuDF and new cuDF if we allow all 16 spark threads to submit work concurrently. In other words, we believe the
cudaFreeHost
calls specifically are preventing parquet heavy jobs from using more of the GPU due to synchronization.The proposal here is to allow a pinned memory pool to be passed to parquet primarily, but there are probably other formats and areas in cuDF that might benefit from this.
Note that another experiment we wanted to attempt was to remove pinned memory alltogether, which cuDF already has a flag for
LIBCUDF_IO_PREFER_PAGEABLE_TMP_MEMORY=1
, but we ran into issues for parquet only (#14311). Before I found this flag, I had tried replacingcudaMallocHost
andcudaFreeHost
withmalloc
andfree
and I ran into the same issue, so I think the parquet code is dependent on some sort of synchronization in the CUDA host pinned memory allocator.The text was updated successfully, but these errors were encountered: