[FEA] Pinned memory pools for parquet decode #14314

abellina · 2023-10-23T18:16:42Z

We are investigating using pinned memory pool at the cuDF layer and replacing cudaFreeHost calls in pinned_host_vector due to traces we have seen that indicate synchronization or a "lining up" of kernels during parquet decode. Here's query88 from NDS at 3TB on our performance cluster running with an A100. In the nsys trace (pardon the amount of streams), we can see parquet nvcomp and decode kernels working on the first three quarters of the trace:

The bottom trace is cuDF without changes. The top trace is a modified cuDF where we replaced calls to cudaMallocHost and cudaFreeHost with allocate and deallocate against a modified RMM pool_memory_resource that isn't stream aware and has a single free list.

When we run with the modified cuDF, our NDS benchmark shows a 5% improvement at 3TB and a 6% improvement between old cuDF and new cuDF if we allow all 16 spark threads to submit work concurrently. In other words, we believe the cudaFreeHost calls specifically are preventing parquet heavy jobs from using more of the GPU due to synchronization.

The proposal here is to allow a pinned memory pool to be passed to parquet primarily, but there are probably other formats and areas in cuDF that might benefit from this.

Note that another experiment we wanted to attempt was to remove pinned memory alltogether, which cuDF already has a flag for LIBCUDF_IO_PREFER_PAGEABLE_TMP_MEMORY=1, but we ran into issues for parquet only (#14311). Before I found this flag, I had tried replacing cudaMallocHost and cudaFreeHost with malloc and free and I ran into the same issue, so I think the parquet code is dependent on some sort of synchronization in the CUDA host pinned memory allocator.

The text was updated successfully, but these errors were encountered:

GregoryKimball · 2023-10-25T03:26:43Z

The new design for cuda::mr::memory_resource, which expands support in RMM to include pinned host memory pools, is tracked in this branch rapidsai/rmm#1095.

harrism · 2023-10-25T05:09:44Z

Note that cuda::mr::memory_resource doesn't directly expand support to include pinned host memory pools, it just enables us to more easily reuse the implementation of stream-ordered device memory pools for other kinds of device-accessible memory.

harrism · 2023-10-25T05:10:39Z

@abellina It's really hard to tell in the image how much time is saved by this change. Can you provide comparable benchmark results?

abellina · 2023-10-25T22:42:12Z

Sure, this is for NDS @ 3TB in our performance cluster, the benchmark was executed 3 times for baseline vs test. We see around 6% improvement. I need to look at query95 more because it found a regression there, but overall this was a win:

--------------------------------------------------------------------
Name = query2
Means = 3088.3333333333335, 2494.6666666666665
Time diff = 593.666666666667
Speedup = 1.2379743452699092
T-Test (test statistic, p value, df) = 7.8686774190832445, 0.0014098464938042464, 4.0
T-Test Confidence Interval = 384.1927204113131, 803.1406129220209
ALERT: significant change has been detected (p-value < 0.05)
ALERT: improvement in performance has been observed
--------------------------------------------------------------------
Name = query9
Means = 11011.666666666666, 8408.0
Time diff = 2603.666666666666
Speedup = 1.3096653980336188
T-Test (test statistic, p value, df) = 8.329919755519573, 0.0011349387959612939, 4.0
T-Test Confidence Interval = 1735.8386701912746, 3471.4946631420576
ALERT: significant change has been detected (p-value < 0.05)
ALERT: improvement in performance has been observed
--------------------------------------------------------------------
Name = query23_part2
Means = 23309.0, 22541.666666666668
Time diff = 767.3333333333321
Speedup = 1.0340406654343808
T-Test (test statistic, p value, df) = 3.5857045809141983, 0.023049979254582895, 4.0
T-Test Confidence Interval = 173.17984709011603, 1361.4868195765482
ALERT: significant change has been detected (p-value < 0.05)
ALERT: improvement in performance has been observed
--------------------------------------------------------------------
Name = query28
Means = 8749.666666666666, 7877.0
Time diff = 872.6666666666661
Speedup = 1.1107866785155092
T-Test (test statistic, p value, df) = 3.675737490963056, 0.021283509364083526, 4.0
T-Test Confidence Interval = 213.50341001605773, 1531.8299233172743
ALERT: significant change has been detected (p-value < 0.05)
ALERT: improvement in performance has been observed
--------------------------------------------------------------------
Name = query31
Means = 3846.3333333333335, 3217.6666666666665
Time diff = 628.666666666667
Speedup = 1.1953796747125247
T-Test (test statistic, p value, df) = 3.476522797723856, 0.02543224663133342, 4.0
T-Test Confidence Interval = 126.59646864855881, 1130.7368646847751
ALERT: significant change has been detected (p-value < 0.05)
ALERT: improvement in performance has been observed
--------------------------------------------------------------------
Name = query74
Means = 5813.333333333333, 5295.0
Time diff = 518.333333333333
Speedup = 1.0978910922253697
T-Test (test statistic, p value, df) = 4.145944413928419, 0.014307288000811205, 4.0
T-Test Confidence Interval = 171.21723564533386, 865.4494310213322
ALERT: significant change has been detected (p-value < 0.05)
ALERT: improvement in performance has been observed
--------------------------------------------------------------------
Name = query75
Means = 8002.0, 7139.0
Time diff = 863.0
Speedup = 1.120885278050147
T-Test (test statistic, p value, df) = 6.9462130566740035, 0.0022563913623439248, 4.0
T-Test Confidence Interval = 518.0534649259677, 1207.9465350740325
ALERT: significant change has been detected (p-value < 0.05)
ALERT: improvement in performance has been observed
--------------------------------------------------------------------
Name = query83
Means = 11889.0, 10728.333333333334
Time diff = 1160.666666666666
Speedup = 1.1081870436538759
T-Test (test statistic, p value, df) = 7.983215413800722, 0.0013345137828189723, 4.0
T-Test Confidence Interval = 757.0038418026304, 1564.3294915307017
ALERT: significant change has been detected (p-value < 0.05)
ALERT: improvement in performance has been observed
--------------------------------------------------------------------
Name = query88
Means = 8288.333333333334, 6815.666666666667
Time diff = 1472.666666666667
Speedup = 1.2160708172348023
T-Test (test statistic, p value, df) = 14.785515131088996, 0.00012180808800452909, 4.0
T-Test Confidence Interval = 1196.1272209995159, 1749.206112333818
ALERT: significant change has been detected (p-value < 0.05)
ALERT: improvement in performance has been observed
--------------------------------------------------------------------
Name = query95
Means = 8023.0, 9527.333333333334
Time diff = -1504.333333333334
Speedup = 0.8421034217339584
T-Test (test statistic, p value, df) = -5.098045536832821, 0.006992118575497059, 4.0
T-Test Confidence Interval = -2323.6078748695036, -685.0587917971645
ALERT: significant change has been detected (p-value < 0.05)
ALERT: regression in performance has been observed
--------------------------------------------------------------------
Name = benchmark
Means = 418333.3333333333, 394333.3333333333
Time diff = 24000.0
Speedup = 1.0608622147083686
T-Test (test statistic, p value, df) = 5.622255427989818, 0.004920931820960854, 4.0
T-Test Confidence Interval = 12148.051368670785, 35851.948631329215
ALERT: significant change has been detected (p-value < 0.05)
ALERT: improvement in performance has been observed

harrism · 2023-10-26T00:50:41Z

So what are the other ones with speedups of 21%, 19%, etc.?

abellina · 2023-10-26T13:01:14Z

So what are the other ones with speedups of 21%, 19%, etc.?

Sorry I should describe the benchmark better. This is NDS https://github.com/NVIDIA/spark-rapids-benchmarks/tree/dev/nds where we are running it in "power run" mode. In this case we are running queries in series one after the other in a cluster running spark rapids and 8xA100 GPUs. The results above show when one of the queries has significant regressions or speedups, and it also has a "benchmark" section at the end for the overall (sum of all query times compared between baseline and test). I ran these sets of tests three times, so the comparison tool is looking at the means and the variance and figuring out what is noise and what isn't.

From the queries that have significant speedup, query9 and query88 are two queries we know are parquet/scan bound. Looking at traces for these, you'll mostly fine unsnap and parquet decode kernels. We see a lot of benefit here.

Is this helping with your question @harrism, or am I missing it?

harrism · 2023-10-26T19:41:52Z

Yes. So the 6% is overall average benefit. Good to know.

harrism · 2024-01-25T02:37:49Z

As of 24.02 you can create a pool_memory_resource<pinned_host_memory_resource>. I suggest that libcudf / cuIO adds a cudf::get/set_current_pinned_host_memory_resource to use for this. This could also be added to RMM, but

a) The static resource RMM currently stores for get/set_current_device_memory_resource is one of the reasons RMM is not Windows compatible, so I'm averse to adding a second instance of that. Whereas libcudf has a binary library component so it can maintain that state without affecting platform compatibility.
b) I think it's good to try out this feature in libcudf and then move it lower in the stack later if it would be useful to other clients.

There is some design work around how and where to put this, and how to wire up configuration knobs for the initial / maximum pool size. Also how to expose that to Python and Spark.

There are other places where this pool could be useful but let's start here.

GregoryKimball · 2024-01-31T04:15:27Z

Thank you @harrism for kicking off the design discussion.
a) would you please clarify the Windows compatibility issue? Is it the presence of a static object with certain properties in a header, or some other issue?
b) If libcudf added cudf::get/set_current_pinned_host_memory_resource, would you please share a bit more about the components in RMM this would be modeled after? I'm not familiar enough with RMM to understand the scope of this suggestion.

Thank you for your help

harrism · 2024-01-31T08:19:42Z

This issue describes a): rapidsai/rmm#826

Basically the function local static works on Linux but not on windows because each binary gets a it's own instance and so they won't be the same across DLLs.

Although I suppose adding a duplicate of the incompatibility reason doesn't make it more incompatible, it's just more of the same.

I still think it should be tested in libcudf first.

For b) the answer is to model it after get/set_current_device_resource(), which is the function with the function local static.

But mostly I was suggesting that libcudf should implement it the way libcudf wants it to be for libcudf.

Thank you too!

bdice · 2024-01-31T23:52:47Z

I just discussed this idea a bit more with @abellina, @mattahrens, @GregoryKimball, and @vuule.

We're leaning towards adding a parameter for host_mr to the Parquet reading APIs. We're leaning away from using a global/static host allocator because we want this behavior to be friendly/simple if dealing with multiple threads like Spark does. Also callers may want to use different host allocators.

The host_mr would be optional, and would use a non-pooled allocator by default. This avoids some of the concerns about determining a default host pool size, and how that would be handled by Python cudf and other consumers of the library. The host_mr should allow any of: a pinned host allocator (which is effectively the current behavior), a pooling pinned host allocator, or possibly even an adapter that implements an "opportunistically pinned pool" (Spark's desired use case; this would attempt allocating pinned memory, and fall back to allocating paged memory rather than failing to allocate).

@abellina is going to investigate a bit further and pursue an implementation for review.

harrism · 2024-02-01T00:54:26Z

Sounds good. Because rmm::mr::pinned_host_memory_resource is not a device_memory_resource, you will need to use the new rmm::device_async_resource_ref for the host_mr parameter in the Parquet reading APIs. We will be transitioning to that in all of RAPIDS, and in the interim, this will make the parameter compatible with both legacy device_memory_resource/host_memory_resource MRs AND the newer MRs like pinned_host_memory_resource that just implement the cuda::async_memory_resource policy.

For an example, see #14873

harrism · 2024-02-01T03:46:52Z

One thing to be aware of for deciding your initial pool size is that if you make it too small, you will get unnecessary fragmentation. The reason is that the RMM pool MR grows by just allocating a new chunk. This isn't TOO bad because it uses a geometric growth strategy, but allocations that don't fit will cause new allocations that can't be merged with the previous pool chunks, hence fragmentation.

See https://github.com/harrism/rmm/blob/6bb0ef2973821c2a1b8f952298046e7b406dc9d2/include/rmm/mr/device/pool_memory_resource.hpp#L352

GregoryKimball · 2024-02-01T05:24:30Z

Thank you for the meeting today and thank you @bdice for sharing this summary.

Adding a host_mr parameter to the Parquet reading APIs would be consistent with the way we use the mr parameter today. For this approach I would love to sketch out the design changes to hostdevice_vector and pinned_host_vector that would be required to make use of an input host_mr.
Using a static memory resource in pinned_host_vector as in the prototype by @abellina seems to have the advantage of not adding additional plumbing to the IO modules that use the hostdevice_vector utility.

I'm curious also to hear from @vuule. Plus @nvdbaranec and @etseidl if you would like to weigh in.

vuule · 2024-02-01T18:19:36Z

The introduction of the host_mr option would require substantial code changes to pass it to all places where the hostdevice_vectors are created. And this would still be limited the PQ reader, until we do the same for other components.
I don't think we know at this point if this would bring additional benefit compared to a global API to set the resource. My proposal is to start with a global resource (that keeps the default behavior) and consider modifying the APIs once we see the effect of the global resource in production. Development would be pretty similar to the stream support (AFAICT), with a global default, and recently included stream parameters in most APIs.

I have a separate concern about having a host_mr parameter that is not an equivalent to the existing mr param. The pinned mem allocator does not change the output for the user, it's purely an internal optimization. But this is maybe just a naming issue, very secondary at this point.

nvdbaranec · 2024-02-01T19:17:06Z

I am more of a fan of having a global function that sets an rmm allocator to be used for pinned allocations. For now we could limit it to cuIO only (which would really mean just hostdevice_vector). Something similar to the way we initialize rmm (rmm::mr::set_current_device_resource(resource.get());)

Maybe cudf::io::set_current_pinned_memory_resource

bdice · 2024-02-02T00:23:07Z

The pinned mem allocator does not change the output for the user, it's purely an internal optimization

This is a good observation. My primary concern was about ensuring thread safety for the global/static approach (what happens if that allocator is changed while in use / before its previous allocations are freed?). If that's not a problem, then I'm okay with using a global/static pinned (pool) host memory resource. I thought the host_mr argument might give us more flexibility for our potential long term needs, but I don't want to overengineer for that case given how little we rely on pinned memory today.

vuule · 2024-02-02T03:13:52Z

Just measured the pinned memory peak use in the Parquet reader. In the benchmarks, which create tables with 512MB of data, the largest peak pinned memory use I saw was about 3.8MB, with integer columns. Many cases use around 1MB. So we should be able to make great use of a <1GB pool even with many threads.

harrism · 2024-02-02T08:26:13Z

But isn't 512MB much smaller than real world use cases?

harrism · 2024-02-02T08:28:32Z

BTW after @vuule and @nvdbaranec pointed it out I also think starting with a global config for an allocator that should be used internally in cuIO is a much better place to start than plumbing it in everywhere for flexibility.

abellina · 2024-02-02T15:52:43Z

The introduction of the host_mr option would require substantial code changes to pass it to all places where the hostdevice_vectors are created. And this would still be limited the PQ reader, until we do the same for other components. I don't think we know at this point if this would bring additional benefit compared to a global API to set the resource. My proposal is to start with a global resource (that keeps the default behavior) and consider modifying the APIs once we see the effect of the global resource in production. Development would be pretty similar to the stream support (AFAICT), with a global default, and recently included stream parameters in most APIs.

I have a separate concern about having a host_mr parameter that is not an equivalent to the existing mr param. The pinned mem allocator does not change the output for the user, it's purely an internal optimization. But this is maybe just a naming issue, very secondary at this point.

@vuule could spark-rapids still provide its own memory pool for this? We wouldn't want a default pinned allocator to initialize, then uninitialized because it is going to be replaced by our own allocator.

vuule · 2024-02-02T18:40:54Z

But isn't 512MB much smaller than real world use cases?

Sure, but the measurements show that pinned memory requirement are less than 1% of the device memory required to read a PQ file. Even when fully using a 50GB GPU we won't fill a 1GB pinned pool.

harrism · 2024-02-06T02:40:20Z

Oh, I guess I don't know how parquet reading works. :)

…mo of pooled-pinned allocation. (#15079) This PR adds a new interface to cuIO which controls where host memory allocations come from. It adds two core functions: Addresses #14314 ``` rmm::host_async_resource_ref set_host_memory_resource(rmm::host_async_resource_ref mr); rmm::host_async_resource_ref get_host_memory_resource(); ``` `cudf::io::hostdevice_vector` was currently implemented in terms of a `thrust::host_vector<>` that explicitly uses an allocator called `pinned_host_vector`. I copied that and made a new class called `rmm_host_vector` which takes any host_resource_ref. This probably makes `pinned_host_vector` obsolete. Parquet benchmarks have a new commandline option which lets you toggle between 3 modes: ``` --cuio_host_mem pinned (the default, an unpooled, pinned memory source) --cuio_host_mem pinned_pool (the pooled/pinned resource) ``` The ultimate intent here is to reduce the cpu-side overhead of the setup code that comes before the decode kernels in the parquet reader. The wins are pretty significant for our faster kernels (that is, where we are less dominated by gpu time) Edit: Updated to use newly minted resource ref types from rmm itself. I also switched the type to be `host_async_resource_ref` even though in this case the user (`thrust::host_vector`) doesn't explicitly go through the async path. In addition, the pageable memory path (an experimental feature) has been removed. Pinned ``` | data_type | io_type | cardinality | run_length | Samples | CPU Time | Noise | GPU Time | Noise | bytes_per_second | peak_memory_usage | encoded_file_size | |-----------|---------------|-------------|------------|---------|-----------|-------|-----------|-------|------------------|-------------------|-------------------| | INTEGRAL | DEVICE_BUFFER | 0 | 1 | 25x | 20.443 ms | 0.45% | 20.438 ms | 0.45% | 26268890178 | 1.072 GiB | 498.123 MiB | | INTEGRAL | DEVICE_BUFFER | 1000 | 1 | 26x | 19.571 ms | 0.42% | 19.565 ms | 0.42% | 27440146729 | 756.210 MiB | 161.438 MiB | | INTEGRAL | DEVICE_BUFFER | 0 | 32 | 28x | 18.150 ms | 0.18% | 18.145 ms | 0.18% | 29587789525 | 602.424 MiB | 27.720 MiB | | INTEGRAL | DEVICE_BUFFER | 1000 | 32 | 29x | 17.306 ms | 0.37% | 17.300 ms | 0.37% | 31032523423 | 597.181 MiB | 14.403 MiB | ``` Pooled/pinned ``` | data_type | io_type | cardinality | run_length | Samples | CPU Time | Noise | GPU Time | Noise | bytes_per_second | peak_memory_usage | encoded_file_size | |-----------|---------------|-------------|------------|---------|-----------|-------|-----------|-------|------------------|-------------------|-------------------| | INTEGRAL | DEVICE_BUFFER | 0 | 1 | 117x | 17.258 ms | 0.50% | 17.254 ms | 0.50% | 31115706389 | 1.072 GiB | 498.123 MiB | | INTEGRAL | DEVICE_BUFFER | 1000 | 1 | 31x | 16.413 ms | 0.43% | 16.408 ms | 0.43% | 32719609450 | 756.210 MiB | 161.438 MiB | | INTEGRAL | DEVICE_BUFFER | 0 | 32 | 576x | 14.885 ms | 0.58% | 14.881 ms | 0.58% | 36077859564 | 602.519 MiB | 27.720 MiB | | INTEGRAL | DEVICE_BUFFER | 1000 | 32 | 36x | 14.069 ms | 0.48% | 14.065 ms | 0.48% | 38171646940 | 597.243 MiB | 14.403 MiB | ``` Authors: - https://github.com/nvdbaranec Approvers: - Mark Harris (https://github.com/harrism) - Vukasin Milovanovic (https://github.com/vuule) URL: #15079

abellina added feature request New feature or request Needs Triage Need team to review and classify labels Oct 23, 2023

github-project-automation bot added this to cuDF/Dask/Numba/UCX Oct 23, 2023

github-project-automation bot moved this to In Progress in cuDF/Dask/Numba/UCX Oct 23, 2023

abellina mentioned this issue Oct 23, 2023

[FEA] Consider creating a pinned host memory resource that can leverage all of the allocation strategies rapidsai/rmm#618

Closed

GregoryKimball added libcudf Affects libcudf (C++/CUDA) code. 0 - Blocked Cannot progress due to external reasons Performance Performance related issue Spark Functionality that helps Spark RAPIDS and removed Needs Triage Need team to review and classify labels Nov 9, 2023

GregoryKimball added this to libcudf Nov 9, 2023

GregoryKimball moved this to To be revisited in libcudf Nov 9, 2023

abellina mentioned this issue Jan 18, 2024

[FEA][JNI] Adopt rmm::pinned_host_memory_resource in java #14782

Open

GregoryKimball added this to the Parquet continuous improvement milestone Feb 15, 2024

GregoryKimball mentioned this issue Feb 15, 2024

[FEA] improve performance of chunked parquet write by not using pinned memory #12186

Closed

nvdbaranec mentioned this issue Feb 16, 2024

Add general purpose host memory allocator reference to cuIO with a demo of pooled-pinned allocation. #15079

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Pinned memory pools for parquet decode #14314

[FEA] Pinned memory pools for parquet decode #14314

abellina commented Oct 23, 2023

GregoryKimball commented Oct 25, 2023

harrism commented Oct 25, 2023

harrism commented Oct 25, 2023

abellina commented Oct 25, 2023

harrism commented Oct 26, 2023

abellina commented Oct 26, 2023 •

edited

Loading

harrism commented Oct 26, 2023

harrism commented Jan 25, 2024

GregoryKimball commented Jan 31, 2024

harrism commented Jan 31, 2024

bdice commented Jan 31, 2024

harrism commented Feb 1, 2024

harrism commented Feb 1, 2024

GregoryKimball commented Feb 1, 2024 •

edited

Loading

vuule commented Feb 1, 2024

nvdbaranec commented Feb 1, 2024

bdice commented Feb 2, 2024 •

edited

Loading

vuule commented Feb 2, 2024

harrism commented Feb 2, 2024

harrism commented Feb 2, 2024

abellina commented Feb 2, 2024 •

edited

Loading

vuule commented Feb 2, 2024

harrism commented Feb 6, 2024

[FEA] Pinned memory pools for parquet decode #14314

[FEA] Pinned memory pools for parquet decode #14314

Comments

abellina commented Oct 23, 2023

GregoryKimball commented Oct 25, 2023

harrism commented Oct 25, 2023

harrism commented Oct 25, 2023

abellina commented Oct 25, 2023

harrism commented Oct 26, 2023

abellina commented Oct 26, 2023 • edited Loading

harrism commented Oct 26, 2023

harrism commented Jan 25, 2024

GregoryKimball commented Jan 31, 2024

harrism commented Jan 31, 2024

bdice commented Jan 31, 2024

harrism commented Feb 1, 2024

harrism commented Feb 1, 2024

GregoryKimball commented Feb 1, 2024 • edited Loading

vuule commented Feb 1, 2024

nvdbaranec commented Feb 1, 2024

bdice commented Feb 2, 2024 • edited Loading

vuule commented Feb 2, 2024

harrism commented Feb 2, 2024

harrism commented Feb 2, 2024

abellina commented Feb 2, 2024 • edited Loading

vuule commented Feb 2, 2024

harrism commented Feb 6, 2024

abellina commented Oct 26, 2023 •

edited

Loading

GregoryKimball commented Feb 1, 2024 •

edited

Loading

bdice commented Feb 2, 2024 •

edited

Loading

abellina commented Feb 2, 2024 •

edited

Loading