-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evaluate further serialization performance improvements #106
Comments
A move from 1.5-2GB to 3GB is still pretty nice. Are there ways that we can deliver this to users soon while we talk about memory pools and such? How should a novice user set things appropriately? Can we set it for them when we create a new worker process? |
Setting |
Alright, I think I've found the reason for that, please take a look at numpy/numpy#14177 (comment). |
I did some more benchmarking with NumPy 1.16 and 1.17 (I made it sure that hugepage was compiled in for the latter). I added also a simple idea of how we could setup a NumPy pool to copy data from the device to host, of course we would need a somewhat complex pool manager, but it's worth consideration. https://gist.github.com/pentschev/7b348f8409ba35341a3f646093e61336 |
It would be good to reassess how long this is taking. While there haven't been a lot of low-level changes, there are some notable ones like ensuring hugepages are used ( numpy/numpy#14216 ). There have been a lot of high-level changes in Dask, Dask-CUDA, and RAPIDS since this issue was raised. For example CUDA objects have become |
As CuPy has a pinned memory pool, I tried comparing it to typical Python allocations (appears to be using hugepages as well) and NumPy hugepages allocations. Unfortunately I didn't find a significant difference amongst them. They all took ~8.33 GB/s to transfer. Would be curious to know if I'm just doing something wrong here. In [1]: import numpy
...: import cupy
...: import rmm
In [2]: rmm.reinitialize(pool_allocator=True,
...: initial_pool_size=int(30 * 2**30))
...: cupy.cuda.set_allocator(rmm.rmm_cupy_allocator)
In [3]: a = numpy.asarray(memoryview(2**29 * b"ab"))
...:
...: a_hugepages = numpy.copy(a)
...:
...: a_pinned = numpy.asarray(
...: cupy.cuda.alloc_pinned_memory(a.nbytes),
...: dtype=a.dtype
...: )[:a.nbytes]
...: a_pinned[...] = a
...:
...: a_cuda = rmm.DeviceBuffer(size=a.nbytes)
In [4]: %timeit a_cuda.copy_from_host(a)
117 ms ± 40.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [5]: %timeit a_cuda.copy_from_host(a_hugepages)
120 ms ± 72.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [6]: %timeit a_cuda.copy_from_host(a_pinned)
120 ms ± 86.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) cc @leofang (who may also have thoughts here 😉) |
@jakirkham What's your expectation for this test? Maybe you were hoping to see the first two copies to outperform that from pinned memory? I don't have rmm set up in my env, so I instead used CuPy's default pool, ran this test, and I also got the same timing for the three different copies. |
Would have expected copying to pinned memory would be faster. Though it seems comparable to using hugepages, which just uses larger page sizes. Yeah my guess is this isn't RMM specific. |
Could it be that pinned memory was actually implemented using hugepages too? Perhaps this is proprietary information...? 😁 |
I don't think so. My understanding of pinned memory is there are not pages in the traditional sense (as they cannot be paged). Whereas hugepages still have pages (and can be paged), they are just larger. |
Based on a suggestion from @kkraus14 offline, I tried building and running the It didn't quite build out-of-the-box. This appears to be due to some changes that require CUDA 11.0, which I don't have on this machine, and the usage of a C compiler when a C++ compiler should be used. Was able to fix these with this small patch. When I ran the
The DGX-1 uses PCIe Gen3 x16, which according to Wikipedia has a theoretical maximum throughput of To get an apples-to-apples comparison, I modified the Python code to work with the same amount of memory (as shown below). In [1]: import numpy
...: import cupy
...: import rmm
In [2]: rmm.reinitialize(pool_allocator=True,
...: initial_pool_size=int(30 * 2**30))
...: cupy.cuda.set_allocator(rmm.rmm_cupy_allocator)
In [3]: a = numpy.asarray(memoryview(32000000 * bytearray(b"a")))
...:
...: a_hugepages = numpy.copy(a)
...:
...: a_pinned = numpy.asarray(
...: cupy.cuda.alloc_pinned_memory(a.nbytes),
...: dtype=a.dtype
...: )[:a.nbytes]
...: a_pinned[...] = a
...:
...: a_cuda = rmm.DeviceBuffer(size=a.nbytes)
In [4]: %timeit a_cuda.copy_from_host(a)
3.07 ms ± 30.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [5]: %timeit a_cuda.copy_from_host(a_hugepages)
3.06 ms ± 3.89 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [6]: %timeit a_cuda.copy_from_host(a_pinned)
3.05 ms ± 377 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [7]: %timeit a_cuda.copy_to_host(a)
2.71 ms ± 28.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [8]: %timeit a_cuda.copy_to_host(a_hugepages)
2.69 ms ± 1.31 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [9]: %timeit a_cuda.copy_to_host(a_pinned)
2.69 ms ± 892 ns per loop (mean ± std. dev. of 7 runs, 100 loops each) Crunching the numbers, we are seeing |
The other thing that is of interest here is allocation time. Allocation time on host is pretty slow. Device allocations are notably faster. Though it is worth noting that allocating pinned memory is an order of magnitude slower than allocating memory through Python builtin objects (like In [1]: import numpy
...: import cupy
...: import rmm
In [2]: rmm.reinitialize(pool_allocator=True,
...: initial_pool_size=int(30 * 2**30))
...: cupy.cuda.set_allocator(rmm.rmm_cupy_allocator)
In [3]: %timeit bytearray(32000000)
1.42 ms ± 4.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [4]: %timeit numpy.ones((32000000,), dtype="u1")
1.43 ms ± 3.44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [5]: %%timeit
...: nbytes = 32000000
...: a_pinned = numpy.asarray(
...: cupy.cuda.alloc_pinned_memory(nbytes),
...: dtype="u1"
...: )[:nbytes]
...:
...:
11.6 ms ± 21.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [6]: %timeit cupy.ones((32000000,), dtype="u1")
355 µs ± 63.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) Edit: Would add writing naive code in Cython to allocate the memory from from libc.stdint cimport uint8_t
from libc.stdlib cimport malloc, free
cpdef ones(size_t nbytes):
cdef uint8_t* data = <uint8_t*>malloc(nbytes)
cdef size_t i
cdef uint8_t v = 1
for i in range(nbytes):
data[i] = v
free(data) In [3]: %timeit ones(32000000)
1.52 ms ± 3.04 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) |
I'm having some difficulty reproducing pinned memory for proving out near-peak host->device transfers. This is on an Azure V100 w/ RAPIDS 0.14, and indirected via docker. Our ultimate goal is setting something up for a streaming workloading where we feed these at network line rate, so ideally 1-10GB/s, and we're happy to reuse these buffers etc. As-is, we're only getting ~2 GB/s on a toy version, and these cards are 16GB PCIE each way afaict. Thoughts? Edit: 96ms for 800MB => ~8 GB/s :) Though still mysteries as not seeing #'s like ^^^^ Setup CPU data
Setup CPU pinned bufers
Benchmark CPU -> GPU
|
Would you be willing to raise that as a new issue, @lmeyerov? 🙂 |
... rmm, cudf? |
Maybe RMM? We can always transfer the issue if we decide it belongs somewhere else 🙂 |
Going to go ahead and close this as it seems we are getting as much out of this as we can currently. |
Should add we made a number of other performance improvements related to spilling upstream, which are listed below. dask/distributed#3960 |
@jakirkham Out of curiosity I changed my benchmark script slightly (to not use import numpy, cupy, ctypes
from cupyx.time import repeat
a = numpy.asarray(memoryview(4 * 2**29 * b"ab"))
a_hugepages = numpy.copy(a)
a_pinned = numpy.ndarray(a.shape,
buffer=cupy.cuda.alloc_pinned_memory(a.nbytes),
dtype=a.dtype)
a_pinned[...] = a
a_cuda = cupy.cuda.memory.alloc(a.nbytes)
assert a.nbytes == a_hugepages.nbytes == a_pinned.nbytes
print(a.nbytes)
print(repeat(a_cuda.copy_from_host, (ctypes.c_void_p(a.ctypes.data), a.nbytes), n_repeat=100))
print(repeat(a_cuda.copy_from_host, (ctypes.c_void_p(a_hugepages.ctypes.data), a.nbytes), n_repeat=100))
print(repeat(a_cuda.copy_from_host, (ctypes.c_void_p(a_pinned.ctypes.data), a.nbytes), n_repeat=100)) Output:
But I am not always able to get this performance across runs. Occasionally they're on par. (Could be that I'm not the only user on the system, but this alone would not explain the variation...) |
-- Can there be some sort of random byte/page boundary misalignment and/or compressed/null page handling? and anything docker specific? -- I missed the first comment on enabling -- I'm still fuzzy on how far off from peek we are for our particular config (azure v100's), if there's a good way to check, esp. given asymmetric nature |
Just a note on further optimization that might be worth looking at here, there's an old NumPy PR ( numpy/numpy#8783 ), which adds the ability to cache large allocations. Not sure how much it helps. Figure it is worth being aware of though. |
Device to host serialization currently runs at around 1.5-2.0 GB/s. The major bottleneck is host memory allocation.
All serialization is now done by copying a Numba device array back to host as a NumPy array. Recently, hugepage support was introduced to NumPy and we should see benefits automatically if
/sys/kernel/mm/transparent_hugepage/enabled
is set tomadvise
oralways
, but I was only able to see benefits when it's set to the latter. Even with that, host memory is being allocated at about 5 GB/s on a DGX-1, and copying happening at about 10 GB/s, which accounts for about 3 GB/s of copying data back to host (since both operations happen in sequence). Some details were discussed in #98, starting at #98 (comment).One alternative is to hold a host memory pool where we can transfer data to. That would require some custom memory copying function, since Numba requires the destination NumPy array to keep the same format (shape, dtype, etc.) as the device array, thus making it impossible to keep a pool for arrays of arbitrary formats.
cc @mrocklin @madsbk
The text was updated successfully, but these errors were encountered: