Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Pinned memory pools for parquet decode #14314

Open
abellina opened this issue Oct 23, 2023 · 23 comments
Open

[FEA] Pinned memory pools for parquet decode #14314

abellina opened this issue Oct 23, 2023 · 23 comments
Labels
0 - Blocked Cannot progress due to external reasons feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue Spark Functionality that helps Spark RAPIDS

Comments

@abellina
Copy link
Contributor

We are investigating using pinned memory pool at the cuDF layer and replacing cudaFreeHost calls in pinned_host_vector due to traces we have seen that indicate synchronization or a "lining up" of kernels during parquet decode. Here's query88 from NDS at 3TB on our performance cluster running with an A100. In the nsys trace (pardon the amount of streams), we can see parquet nvcomp and decode kernels working on the first three quarters of the trace:

Screenshot from 2023-10-23 13-08-15

The bottom trace is cuDF without changes. The top trace is a modified cuDF where we replaced calls to cudaMallocHost and cudaFreeHost with allocate and deallocate against a modified RMM pool_memory_resource that isn't stream aware and has a single free list.

When we run with the modified cuDF, our NDS benchmark shows a 5% improvement at 3TB and a 6% improvement between old cuDF and new cuDF if we allow all 16 spark threads to submit work concurrently. In other words, we believe the cudaFreeHost calls specifically are preventing parquet heavy jobs from using more of the GPU due to synchronization.

The proposal here is to allow a pinned memory pool to be passed to parquet primarily, but there are probably other formats and areas in cuDF that might benefit from this.

Note that another experiment we wanted to attempt was to remove pinned memory alltogether, which cuDF already has a flag for LIBCUDF_IO_PREFER_PAGEABLE_TMP_MEMORY=1, but we ran into issues for parquet only (#14311). Before I found this flag, I had tried replacing cudaMallocHost and cudaFreeHost with malloc and free and I ran into the same issue, so I think the parquet code is dependent on some sort of synchronization in the CUDA host pinned memory allocator.

@GregoryKimball
Copy link
Contributor

The new design for cuda::mr::memory_resource, which expands support in RMM to include pinned host memory pools, is tracked in this branch rapidsai/rmm#1095.

@harrism
Copy link
Member

harrism commented Oct 25, 2023

Note that cuda::mr::memory_resource doesn't directly expand support to include pinned host memory pools, it just enables us to more easily reuse the implementation of stream-ordered device memory pools for other kinds of device-accessible memory.

@harrism
Copy link
Member

harrism commented Oct 25, 2023

@abellina It's really hard to tell in the image how much time is saved by this change. Can you provide comparable benchmark results?

@abellina
Copy link
Contributor Author

Sure, this is for NDS @ 3TB in our performance cluster, the benchmark was executed 3 times for baseline vs test. We see around 6% improvement. I need to look at query95 more because it found a regression there, but overall this was a win:

--------------------------------------------------------------------
Name = query2
Means = 3088.3333333333335, 2494.6666666666665
Time diff = 593.666666666667
Speedup = 1.2379743452699092
T-Test (test statistic, p value, df) = 7.8686774190832445, 0.0014098464938042464, 4.0
T-Test Confidence Interval = 384.1927204113131, 803.1406129220209
ALERT: significant change has been detected (p-value < 0.05)
ALERT: improvement in performance has been observed
--------------------------------------------------------------------
Name = query9
Means = 11011.666666666666, 8408.0
Time diff = 2603.666666666666
Speedup = 1.3096653980336188
T-Test (test statistic, p value, df) = 8.329919755519573, 0.0011349387959612939, 4.0
T-Test Confidence Interval = 1735.8386701912746, 3471.4946631420576
ALERT: significant change has been detected (p-value < 0.05)
ALERT: improvement in performance has been observed
--------------------------------------------------------------------
Name = query23_part2
Means = 23309.0, 22541.666666666668
Time diff = 767.3333333333321
Speedup = 1.0340406654343808
T-Test (test statistic, p value, df) = 3.5857045809141983, 0.023049979254582895, 4.0
T-Test Confidence Interval = 173.17984709011603, 1361.4868195765482
ALERT: significant change has been detected (p-value < 0.05)
ALERT: improvement in performance has been observed
--------------------------------------------------------------------
Name = query28
Means = 8749.666666666666, 7877.0
Time diff = 872.6666666666661
Speedup = 1.1107866785155092
T-Test (test statistic, p value, df) = 3.675737490963056, 0.021283509364083526, 4.0
T-Test Confidence Interval = 213.50341001605773, 1531.8299233172743
ALERT: significant change has been detected (p-value < 0.05)
ALERT: improvement in performance has been observed
--------------------------------------------------------------------
Name = query31
Means = 3846.3333333333335, 3217.6666666666665
Time diff = 628.666666666667
Speedup = 1.1953796747125247
T-Test (test statistic, p value, df) = 3.476522797723856, 0.02543224663133342, 4.0
T-Test Confidence Interval = 126.59646864855881, 1130.7368646847751
ALERT: significant change has been detected (p-value < 0.05)
ALERT: improvement in performance has been observed
--------------------------------------------------------------------
Name = query74
Means = 5813.333333333333, 5295.0
Time diff = 518.333333333333
Speedup = 1.0978910922253697
T-Test (test statistic, p value, df) = 4.145944413928419, 0.014307288000811205, 4.0
T-Test Confidence Interval = 171.21723564533386, 865.4494310213322
ALERT: significant change has been detected (p-value < 0.05)
ALERT: improvement in performance has been observed
--------------------------------------------------------------------
Name = query75
Means = 8002.0, 7139.0
Time diff = 863.0
Speedup = 1.120885278050147
T-Test (test statistic, p value, df) = 6.9462130566740035, 0.0022563913623439248, 4.0
T-Test Confidence Interval = 518.0534649259677, 1207.9465350740325
ALERT: significant change has been detected (p-value < 0.05)
ALERT: improvement in performance has been observed
--------------------------------------------------------------------
Name = query83
Means = 11889.0, 10728.333333333334
Time diff = 1160.666666666666
Speedup = 1.1081870436538759
T-Test (test statistic, p value, df) = 7.983215413800722, 0.0013345137828189723, 4.0
T-Test Confidence Interval = 757.0038418026304, 1564.3294915307017
ALERT: significant change has been detected (p-value < 0.05)
ALERT: improvement in performance has been observed
--------------------------------------------------------------------
Name = query88
Means = 8288.333333333334, 6815.666666666667
Time diff = 1472.666666666667
Speedup = 1.2160708172348023
T-Test (test statistic, p value, df) = 14.785515131088996, 0.00012180808800452909, 4.0
T-Test Confidence Interval = 1196.1272209995159, 1749.206112333818
ALERT: significant change has been detected (p-value < 0.05)
ALERT: improvement in performance has been observed
--------------------------------------------------------------------
Name = query95
Means = 8023.0, 9527.333333333334
Time diff = -1504.333333333334
Speedup = 0.8421034217339584
T-Test (test statistic, p value, df) = -5.098045536832821, 0.006992118575497059, 4.0
T-Test Confidence Interval = -2323.6078748695036, -685.0587917971645
ALERT: significant change has been detected (p-value < 0.05)
ALERT: regression in performance has been observed
--------------------------------------------------------------------
Name = benchmark
Means = 418333.3333333333, 394333.3333333333
Time diff = 24000.0
Speedup = 1.0608622147083686
T-Test (test statistic, p value, df) = 5.622255427989818, 0.004920931820960854, 4.0
T-Test Confidence Interval = 12148.051368670785, 35851.948631329215
ALERT: significant change has been detected (p-value < 0.05)
ALERT: improvement in performance has been observed

@harrism
Copy link
Member

harrism commented Oct 26, 2023

So what are the other ones with speedups of 21%, 19%, etc.?

@abellina
Copy link
Contributor Author

abellina commented Oct 26, 2023

So what are the other ones with speedups of 21%, 19%, etc.?

Sorry I should describe the benchmark better. This is NDS https://github.com/NVIDIA/spark-rapids-benchmarks/tree/dev/nds where we are running it in "power run" mode. In this case we are running queries in series one after the other in a cluster running spark rapids and 8xA100 GPUs. The results above show when one of the queries has significant regressions or speedups, and it also has a "benchmark" section at the end for the overall (sum of all query times compared between baseline and test). I ran these sets of tests three times, so the comparison tool is looking at the means and the variance and figuring out what is noise and what isn't.

From the queries that have significant speedup, query9 and query88 are two queries we know are parquet/scan bound. Looking at traces for these, you'll mostly fine unsnap and parquet decode kernels. We see a lot of benefit here.

Is this helping with your question @harrism, or am I missing it?

@harrism
Copy link
Member

harrism commented Oct 26, 2023

Yes. So the 6% is overall average benefit. Good to know.

@GregoryKimball GregoryKimball added libcudf Affects libcudf (C++/CUDA) code. 0 - Blocked Cannot progress due to external reasons Performance Performance related issue Spark Functionality that helps Spark RAPIDS and removed Needs Triage Need team to review and classify labels Nov 9, 2023
@GregoryKimball GregoryKimball moved this to To be revisited in libcudf Nov 9, 2023
@harrism
Copy link
Member

harrism commented Jan 25, 2024

As of 24.02 you can create a pool_memory_resource<pinned_host_memory_resource>. I suggest that libcudf / cuIO adds a cudf::get/set_current_pinned_host_memory_resource to use for this. This could also be added to RMM, but

a) The static resource RMM currently stores for get/set_current_device_memory_resource is one of the reasons RMM is not Windows compatible, so I'm averse to adding a second instance of that. Whereas libcudf has a binary library component so it can maintain that state without affecting platform compatibility.
b) I think it's good to try out this feature in libcudf and then move it lower in the stack later if it would be useful to other clients.

There is some design work around how and where to put this, and how to wire up configuration knobs for the initial / maximum pool size. Also how to expose that to Python and Spark.

There are other places where this pool could be useful but let's start here.

@GregoryKimball
Copy link
Contributor

Thank you @harrism for kicking off the design discussion.
a) would you please clarify the Windows compatibility issue? Is it the presence of a static object with certain properties in a header, or some other issue?
b) If libcudf added cudf::get/set_current_pinned_host_memory_resource, would you please share a bit more about the components in RMM this would be modeled after? I'm not familiar enough with RMM to understand the scope of this suggestion.

Thank you for your help

@harrism
Copy link
Member

harrism commented Jan 31, 2024

This issue describes a): rapidsai/rmm#826

Basically the function local static works on Linux but not on windows because each binary gets a it's own instance and so they won't be the same across DLLs.

Although I suppose adding a duplicate of the incompatibility reason doesn't make it more incompatible, it's just more of the same.

I still think it should be tested in libcudf first.

For b) the answer is to model it after get/set_current_device_resource(), which is the function with the function local static.

But mostly I was suggesting that libcudf should implement it the way libcudf wants it to be for libcudf.

Thank you too!

@bdice
Copy link
Contributor

bdice commented Jan 31, 2024

I just discussed this idea a bit more with @abellina, @mattahrens, @GregoryKimball, and @vuule.

We're leaning towards adding a parameter for host_mr to the Parquet reading APIs. We're leaning away from using a global/static host allocator because we want this behavior to be friendly/simple if dealing with multiple threads like Spark does. Also callers may want to use different host allocators.

The host_mr would be optional, and would use a non-pooled allocator by default. This avoids some of the concerns about determining a default host pool size, and how that would be handled by Python cudf and other consumers of the library. The host_mr should allow any of: a pinned host allocator (which is effectively the current behavior), a pooling pinned host allocator, or possibly even an adapter that implements an "opportunistically pinned pool" (Spark's desired use case; this would attempt allocating pinned memory, and fall back to allocating paged memory rather than failing to allocate).

@abellina is going to investigate a bit further and pursue an implementation for review.

@harrism
Copy link
Member

harrism commented Feb 1, 2024

Sounds good. Because rmm::mr::pinned_host_memory_resource is not a device_memory_resource, you will need to use the new rmm::device_async_resource_ref for the host_mr parameter in the Parquet reading APIs. We will be transitioning to that in all of RAPIDS, and in the interim, this will make the parameter compatible with both legacy device_memory_resource/host_memory_resource MRs AND the newer MRs like pinned_host_memory_resource that just implement the cuda::async_memory_resource policy.

For an example, see #14873

@harrism
Copy link
Member

harrism commented Feb 1, 2024

One thing to be aware of for deciding your initial pool size is that if you make it too small, you will get unnecessary fragmentation. The reason is that the RMM pool MR grows by just allocating a new chunk. This isn't TOO bad because it uses a geometric growth strategy, but allocations that don't fit will cause new allocations that can't be merged with the previous pool chunks, hence fragmentation.

See https://github.com/harrism/rmm/blob/6bb0ef2973821c2a1b8f952298046e7b406dc9d2/include/rmm/mr/device/pool_memory_resource.hpp#L352

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Feb 1, 2024

Thank you for the meeting today and thank you @bdice for sharing this summary.

  • Adding a host_mr parameter to the Parquet reading APIs would be consistent with the way we use the mr parameter today. For this approach I would love to sketch out the design changes to hostdevice_vector and pinned_host_vector that would be required to make use of an input host_mr.
  • Using a static memory resource in pinned_host_vector as in the prototype by @abellina seems to have the advantage of not adding additional plumbing to the IO modules that use the hostdevice_vector utility.

I'm curious also to hear from @vuule. Plus @nvdbaranec and @etseidl if you would like to weigh in.

@vuule
Copy link
Contributor

vuule commented Feb 1, 2024

The introduction of the host_mr option would require substantial code changes to pass it to all places where the hostdevice_vectors are created. And this would still be limited the PQ reader, until we do the same for other components.
I don't think we know at this point if this would bring additional benefit compared to a global API to set the resource. My proposal is to start with a global resource (that keeps the default behavior) and consider modifying the APIs once we see the effect of the global resource in production. Development would be pretty similar to the stream support (AFAICT), with a global default, and recently included stream parameters in most APIs.

I have a separate concern about having a host_mr parameter that is not an equivalent to the existing mr param. The pinned mem allocator does not change the output for the user, it's purely an internal optimization. But this is maybe just a naming issue, very secondary at this point.

@nvdbaranec
Copy link
Contributor

I am more of a fan of having a global function that sets an rmm allocator to be used for pinned allocations. For now we could limit it to cuIO only (which would really mean just hostdevice_vector). Something similar to the way we initialize rmm (rmm::mr::set_current_device_resource(resource.get());)

Maybe cudf::io::set_current_pinned_memory_resource

@bdice
Copy link
Contributor

bdice commented Feb 2, 2024

The pinned mem allocator does not change the output for the user, it's purely an internal optimization

This is a good observation. My primary concern was about ensuring thread safety for the global/static approach (what happens if that allocator is changed while in use / before its previous allocations are freed?). If that's not a problem, then I'm okay with using a global/static pinned (pool) host memory resource. I thought the host_mr argument might give us more flexibility for our potential long term needs, but I don't want to overengineer for that case given how little we rely on pinned memory today.

@vuule
Copy link
Contributor

vuule commented Feb 2, 2024

Just measured the pinned memory peak use in the Parquet reader. In the benchmarks, which create tables with 512MB of data, the largest peak pinned memory use I saw was about 3.8MB, with integer columns. Many cases use around 1MB. So we should be able to make great use of a <1GB pool even with many threads.

@harrism
Copy link
Member

harrism commented Feb 2, 2024

But isn't 512MB much smaller than real world use cases?

@harrism
Copy link
Member

harrism commented Feb 2, 2024

BTW after @vuule and @nvdbaranec pointed it out I also think starting with a global config for an allocator that should be used internally in cuIO is a much better place to start than plumbing it in everywhere for flexibility.

@abellina
Copy link
Contributor Author

abellina commented Feb 2, 2024

The introduction of the host_mr option would require substantial code changes to pass it to all places where the hostdevice_vectors are created. And this would still be limited the PQ reader, until we do the same for other components. I don't think we know at this point if this would bring additional benefit compared to a global API to set the resource. My proposal is to start with a global resource (that keeps the default behavior) and consider modifying the APIs once we see the effect of the global resource in production. Development would be pretty similar to the stream support (AFAICT), with a global default, and recently included stream parameters in most APIs.

I have a separate concern about having a host_mr parameter that is not an equivalent to the existing mr param. The pinned mem allocator does not change the output for the user, it's purely an internal optimization. But this is maybe just a naming issue, very secondary at this point.

@vuule could spark-rapids still provide its own memory pool for this? We wouldn't want a default pinned allocator to initialize, then uninitialized because it is going to be replaced by our own allocator.

@vuule
Copy link
Contributor

vuule commented Feb 2, 2024

But isn't 512MB much smaller than real world use cases?

Sure, but the measurements show that pinned memory requirement are less than 1% of the device memory required to read a PQ file. Even when fully using a 50GB GPU we won't fill a 1GB pinned pool.

@harrism
Copy link
Member

harrism commented Feb 6, 2024

Oh, I guess I don't know how parquet reading works. :)

rapids-bot bot pushed a commit that referenced this issue Mar 7, 2024
…mo of pooled-pinned allocation. (#15079)

This PR adds a new interface to cuIO which controls where host memory allocations come from. It adds two core functions:

Addresses #14314

```
rmm::host_async_resource_ref set_host_memory_resource(rmm::host_async_resource_ref mr);
rmm::host_async_resource_ref get_host_memory_resource();
```

`cudf::io::hostdevice_vector` was currently implemented in terms of a `thrust::host_vector<>` that explicitly uses an allocator called `pinned_host_vector`.  I copied that and made a new class called `rmm_host_vector` which takes any host_resource_ref.  This probably makes `pinned_host_vector` obsolete.  

Parquet benchmarks have a new commandline option which lets you toggle between 3 modes:

```
--cuio_host_mem pinned              (the default, an unpooled, pinned memory source)
--cuio_host_mem pinned_pool         (the pooled/pinned resource)
```

The ultimate intent here is to reduce the cpu-side overhead of the setup code that comes before the decode kernels in the parquet reader.  The wins are pretty significant for our faster kernels (that is, where we are less dominated by gpu time)

Edit: Updated to use newly minted resource ref types from rmm itself.  I also switched the type to be `host_async_resource_ref` even though in this case the user (`thrust::host_vector`) doesn't explicitly go through the async path.  In addition, the pageable memory path (an experimental feature) has been removed.

Pinned
```
| data_type |    io_type    | cardinality | run_length | Samples | CPU Time  | Noise | GPU Time  | Noise | bytes_per_second | peak_memory_usage | encoded_file_size |
|-----------|---------------|-------------|------------|---------|-----------|-------|-----------|-------|------------------|-------------------|-------------------|
|  INTEGRAL | DEVICE_BUFFER |           0 |          1 |     25x | 20.443 ms | 0.45% | 20.438 ms | 0.45% |      26268890178 |         1.072 GiB |       498.123 MiB |
|  INTEGRAL | DEVICE_BUFFER |        1000 |          1 |     26x | 19.571 ms | 0.42% | 19.565 ms | 0.42% |      27440146729 |       756.210 MiB |       161.438 MiB |
|  INTEGRAL | DEVICE_BUFFER |           0 |         32 |     28x | 18.150 ms | 0.18% | 18.145 ms | 0.18% |      29587789525 |       602.424 MiB |        27.720 MiB |
|  INTEGRAL | DEVICE_BUFFER |        1000 |         32 |     29x | 17.306 ms | 0.37% | 17.300 ms | 0.37% |      31032523423 |       597.181 MiB |        14.403 MiB |
```


Pooled/pinned
```
| data_type |    io_type    | cardinality | run_length | Samples | CPU Time  | Noise | GPU Time  | Noise | bytes_per_second | peak_memory_usage | encoded_file_size |
|-----------|---------------|-------------|------------|---------|-----------|-------|-----------|-------|------------------|-------------------|-------------------|
|  INTEGRAL | DEVICE_BUFFER |           0 |          1 |    117x | 17.258 ms | 0.50% | 17.254 ms | 0.50% |      31115706389 |         1.072 GiB |       498.123 MiB |
|  INTEGRAL | DEVICE_BUFFER |        1000 |          1 |     31x | 16.413 ms | 0.43% | 16.408 ms | 0.43% |      32719609450 |       756.210 MiB |       161.438 MiB |
|  INTEGRAL | DEVICE_BUFFER |           0 |         32 |    576x | 14.885 ms | 0.58% | 14.881 ms | 0.58% |      36077859564 |       602.519 MiB |        27.720 MiB |
|  INTEGRAL | DEVICE_BUFFER |        1000 |         32 |     36x | 14.069 ms | 0.48% | 14.065 ms | 0.48% |      38171646940 |       597.243 MiB |        14.403 MiB |
```

Authors:
  - https://github.com/nvdbaranec

Approvers:
  - Mark Harris (https://github.com/harrism)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #15079
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Blocked Cannot progress due to external reasons feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue Spark Functionality that helps Spark RAPIDS
Projects
Status: In Progress
Status: To be revisited
Development

No branches or pull requests

6 participants