Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate further serialization performance improvements #106

Closed
pentschev opened this issue Aug 2, 2019 · 22 comments
Closed

Evaluate further serialization performance improvements #106

pentschev opened this issue Aug 2, 2019 · 22 comments

Comments

@pentschev
Copy link
Member

Device to host serialization currently runs at around 1.5-2.0 GB/s. The major bottleneck is host memory allocation.

All serialization is now done by copying a Numba device array back to host as a NumPy array. Recently, hugepage support was introduced to NumPy and we should see benefits automatically if /sys/kernel/mm/transparent_hugepage/enabled is set to madvise or always, but I was only able to see benefits when it's set to the latter. Even with that, host memory is being allocated at about 5 GB/s on a DGX-1, and copying happening at about 10 GB/s, which accounts for about 3 GB/s of copying data back to host (since both operations happen in sequence). Some details were discussed in #98, starting at #98 (comment).

One alternative is to hold a host memory pool where we can transfer data to. That would require some custom memory copying function, since Numba requires the destination NumPy array to keep the same format (shape, dtype, etc.) as the device array, thus making it impossible to keep a pool for arrays of arbitrary formats.

cc @mrocklin @madsbk

@mrocklin
Copy link
Contributor

mrocklin commented Aug 2, 2019

A move from 1.5-2GB to 3GB is still pretty nice. Are there ways that we can deliver this to users soon while we talk about memory pools and such? How should a novice user set things appropriately? Can we set it for them when we create a new worker process?

@mrocklin
Copy link
Contributor

mrocklin commented Aug 2, 2019

cc @randerzander

@pentschev
Copy link
Member Author

A move from 1.5-2GB to 3GB is still pretty nice. Are there ways that we can deliver this to users soon while we talk about memory pools and such? How should a novice user set things appropriately? Can we set it for them when we create a new worker process?

Setting /sys/kernel/mm/transparent_hugepage/enabled to always requires root access and may have unintended consequences (e.g., much higher resident memory usage by the OS and other processes), so there's no way users/dask can set that easily. I will investigate further why madvise doesn't seem to be working as expected.

@pentschev
Copy link
Member Author

Alright, I think I've found the reason for that, please take a look at numpy/numpy#14177 (comment).

@pentschev
Copy link
Member Author

I did some more benchmarking with NumPy 1.16 and 1.17 (I made it sure that hugepage was compiled in for the latter). I added also a simple idea of how we could setup a NumPy pool to copy data from the device to host, of course we would need a somewhat complex pool manager, but it's worth consideration.

https://gist.github.com/pentschev/7b348f8409ba35341a3f646093e61336

@jakirkham
Copy link
Member

It would be good to reassess how long this is taking. While there haven't been a lot of low-level changes, there are some notable ones like ensuring hugepages are used ( numpy/numpy#14216 ). There have been a lot of high-level changes in Dask, Dask-CUDA, and RAPIDS since this issue was raised. For example CUDA objects have become "dask" serializable ( dask/distributed#3482 ) ( rapidsai/cudf#4153 ), which Dask-CUDA leverages for spilling ( #256 ). Distributed learned how to serialize collections of CUDA objects regardless of size ( dask/distributed#3689 ), which has simplified things in Dask-CUDA further ( #307 ). A bug fix to Distributed's spilling logic ( dask/distributed#3639 ) and better memoryview serialization ( dask/distributed#3743 ) has allowed us to perform fewer serialization passes ( #309 ). We've also generalized, streamlined, and improved the robustnesses of serialization in RAPIDS through multiple PRs though most recently PR ( rapidsai/cudf#5139 ). Believe there are probably more high-level improvements we can make here. Also we can make low-level improvements still like using pinned memory ( rapidsai/rmm#260 ) and/or using different RMM memory resources together (like UVM typically and device memory for UCX communication). Additionally things like packing/unpacking ( rapidsai/cudf#5025 ) would allow us to transfer a single buffer (instead of multiple) between host and device.

@jakirkham
Copy link
Member

jakirkham commented Jul 21, 2020

As CuPy has a pinned memory pool, I tried comparing it to typical Python allocations (appears to be using hugepages as well) and NumPy hugepages allocations. Unfortunately I didn't find a significant difference amongst them. They all took ~8.33 GB/s to transfer. Would be curious to know if I'm just doing something wrong here.

In [1]: import numpy 
   ...: import cupy 
   ...: import rmm                                                              

In [2]: rmm.reinitialize(pool_allocator=True, 
   ...:                  initial_pool_size=int(30 * 2**30)) 
   ...: cupy.cuda.set_allocator(rmm.rmm_cupy_allocator)                         

In [3]: a = numpy.asarray(memoryview(2**29 * b"ab")) 
   ...:  
   ...: a_hugepages = numpy.copy(a) 
   ...:  
   ...: a_pinned = numpy.asarray( 
   ...:     cupy.cuda.alloc_pinned_memory(a.nbytes), 
   ...:     dtype=a.dtype 
   ...: )[:a.nbytes] 
   ...: a_pinned[...] = a 
   ...:  
   ...: a_cuda = rmm.DeviceBuffer(size=a.nbytes)                                

In [4]: %timeit a_cuda.copy_from_host(a)                                        
117 ms ± 40.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [5]: %timeit a_cuda.copy_from_host(a_hugepages)                              
120 ms ± 72.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [6]: %timeit a_cuda.copy_from_host(a_pinned)                                 
120 ms ± 86.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

cc @leofang (who may also have thoughts here 😉)

@leofang
Copy link
Member

leofang commented Jul 27, 2020

Unfortunately I didn't find a significant difference amongst them.

@jakirkham What's your expectation for this test? Maybe you were hoping to see the first two copies to outperform that from pinned memory?

I don't have rmm set up in my env, so I instead used CuPy's default pool, ran this test, and I also got the same timing for the three different copies.

@jakirkham
Copy link
Member

Would have expected copying to pinned memory would be faster. Though it seems comparable to using hugepages, which just uses larger page sizes.

Yeah my guess is this isn't RMM specific. copy_from_host is a pretty thin wrapper around cudaMemcpyAsync.

@leofang
Copy link
Member

leofang commented Jul 27, 2020

Could it be that pinned memory was actually implemented using hugepages too? Perhaps this is proprietary information...? 😁

@jakirkham
Copy link
Member

I don't think so. My understanding of pinned memory is there are not pages in the traditional sense (as they cannot be paged). Whereas hugepages still have pages (and can be paged), they are just larger.

@jakirkham
Copy link
Member

jakirkham commented Jul 27, 2020

Based on a suggestion from @kkraus14 offline, I tried building and running the bandwidthTest from CUDA Samples on a DGX-1.

It didn't quite build out-of-the-box. This appears to be due to some changes that require CUDA 11.0, which I don't have on this machine, and the usage of a C compiler when a C++ compiler should be used. Was able to fix these with this small patch.

When I ran the bandwidthTest binary produced, I got the following results:

$ ../../bin/x86_64/linux/release/bandwidthTest 
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: Tesla V100-SXM2-32GB
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(GB/s)
   32000000			11.1

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(GB/s)
   32000000			12.5

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(GB/s)
   32000000			732.1

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

The DGX-1 uses PCIe Gen3 x16, which according to Wikipedia has a theoretical maximum throughput of 15.75 GB/s. What we are measuring here is 11.1 GB/s with host-to-device and 12.5 GB/s with device-to-host, which it seems is reasonable based on prior discussion.

To get an apples-to-apples comparison, I modified the Python code to work with the same amount of memory (as shown below).

In [1]: import numpy 
   ...: import cupy 
   ...: import rmm                                                              

In [2]: rmm.reinitialize(pool_allocator=True, 
   ...:                  initial_pool_size=int(30 * 2**30)) 
   ...: cupy.cuda.set_allocator(rmm.rmm_cupy_allocator)                         

In [3]: a = numpy.asarray(memoryview(32000000 * bytearray(b"a"))) 
   ...:  
   ...: a_hugepages = numpy.copy(a) 
   ...:  
   ...: a_pinned = numpy.asarray( 
   ...:     cupy.cuda.alloc_pinned_memory(a.nbytes), 
   ...:     dtype=a.dtype 
   ...: )[:a.nbytes] 
   ...: a_pinned[...] = a 
   ...:  
   ...: a_cuda = rmm.DeviceBuffer(size=a.nbytes)                                

In [4]: %timeit a_cuda.copy_from_host(a)                                        
3.07 ms ± 30.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [5]: %timeit a_cuda.copy_from_host(a_hugepages)                              
3.06 ms ± 3.89 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [6]: %timeit a_cuda.copy_from_host(a_pinned)                                 
3.05 ms ± 377 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [7]: %timeit a_cuda.copy_to_host(a)                                          
2.71 ms ± 28.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [8]: %timeit a_cuda.copy_to_host(a_hugepages)                                
2.69 ms ± 1.31 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [9]: %timeit a_cuda.copy_to_host(a_pinned)                                   
2.69 ms ± 892 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)

Crunching the numbers, we are seeing 9.74 GB/s with host-to-device and 11.1 GB/s with device-to-host. This is a bit more than 10% slower, which is actually pretty good. Note again it does not seem to matter if hugepages or pinned memory is used.

@jakirkham
Copy link
Member

jakirkham commented Jul 28, 2020

The other thing that is of interest here is allocation time. Allocation time on host is pretty slow. Device allocations are notably faster. Though it is worth noting that allocating pinned memory is an order of magnitude slower than allocating memory through Python builtin objects (like bytearray) or NumPy, both of which trigger pagefaults as part of their benchmarks. This can be seen in the code below:

In [1]: import numpy 
   ...: import cupy 
   ...: import rmm                                                              

In [2]: rmm.reinitialize(pool_allocator=True, 
   ...:                  initial_pool_size=int(30 * 2**30)) 
   ...: cupy.cuda.set_allocator(rmm.rmm_cupy_allocator)                         

In [3]: %timeit bytearray(32000000)                                             
1.42 ms ± 4.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [4]: %timeit numpy.ones((32000000,), dtype="u1")                             
1.43 ms ± 3.44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [5]: %%timeit 
   ...: nbytes = 32000000 
   ...: a_pinned = numpy.asarray( 
   ...:     cupy.cuda.alloc_pinned_memory(nbytes), 
   ...:     dtype="u1" 
   ...: )[:nbytes] 
   ...:  
   ...:                                                                         
11.6 ms ± 21.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [6]: %timeit cupy.ones((32000000,), dtype="u1")                              
355 µs ± 63.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Edit: Would add writing naive code in Cython to allocate the memory from malloc and assign a value to each element takes a similar amount of time to bytearray and NumPy (if anything they go a bit faster). So in general I don't think there is much more to gain here.

from libc.stdint cimport uint8_t
from libc.stdlib cimport malloc, free

cpdef ones(size_t nbytes):
    cdef uint8_t* data = <uint8_t*>malloc(nbytes)
    
    cdef size_t i
    cdef uint8_t v = 1
    for i in range(nbytes):
        data[i] = v

    free(data)
In [3]: %timeit ones(32000000)                                                  
1.52 ms ± 3.04 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

@lmeyerov
Copy link

lmeyerov commented Jul 28, 2020

I'm having some difficulty reproducing pinned memory for proving out near-peak host->device transfers. This is on an Azure V100 w/ RAPIDS 0.14, and indirected via docker. Our ultimate goal is setting something up for a streaming workloading where we feed these at network line rate, so ideally 1-10GB/s, and we're happy to reuse these buffers etc. As-is, we're only getting ~2 GB/s on a toy version, and these cards are 16GB PCIE each way afaict.

Thoughts?

Edit: 96ms for 800MB => ~8 GB/s :) Though still mysteries as not seeing #'s like ^^^^

Setup CPU data


import cudf, cupy as cp, numpy as np, rmm

if True:
    RMM_INIT_SIZE = 2 << 32
    RMM_ALLOCATOR = "managed"
    RMM_POOL = True
    RMM_ENABLE_LOGGING = False
    
    rmm.reinitialize(pool_allocator=True,
                     initial_pool_size=RMM_INIT_SIZE)
    cp.cuda.set_allocator(rmm.rmm_cupy_allocator) 
### where df['x'] is 100M sample gdf data
x = df['x'].to_array()
xb = x.view('uint8')

Setup CPU pinned bufers

x_hugepages = np.frombuffer(xb, dtype=xb.dtype)

#WARNING: x_pinned is padded bigger than xb (1073741824 > 800000000)
x_pinned = np.asarray(cp.cuda.alloc_pinned_memory(len(xb)), dtype=xb.dtype)
x_pinned[0:len(xb)] = xb

x_cuda = rmm.DeviceBuffer(size=x_pinned.nbytes) #include padding over xb

Benchmark CPU -> GPU

%%time
x_cuda.copy_from_host(xb)
=> 90.3ms

%%time
x_cuda.copy_from_host(x_hugepages)
=> 90.6ms

%%time
x_cuda.copy_from_host(x_pinned)
122ms

@jakirkham
Copy link
Member

Would you be willing to raise that as a new issue, @lmeyerov? 🙂

@lmeyerov
Copy link

... rmm, cudf?

@jakirkham
Copy link
Member

Maybe RMM? We can always transfer the issue if we decide it belongs somewhere else 🙂

@jakirkham
Copy link
Member

Going to go ahead and close this as it seems we are getting as much out of this as we can currently.

@jakirkham
Copy link
Member

Should add we made a number of other performance improvements related to spilling upstream, which are listed below.

dask/distributed#3960
dask/distributed#3961
dask/distributed#3973
dask/distributed#3980

@leofang
Copy link
Member

leofang commented Jul 28, 2020

@jakirkham Out of curiosity I changed my benchmark script slightly (to not use %timeit), and enlarged the size to 4GB. I was able to see better performance with pinned memory.

import numpy, cupy, ctypes
from cupyx.time import repeat


a = numpy.asarray(memoryview(4 * 2**29 * b"ab"))
a_hugepages = numpy.copy(a)
a_pinned = numpy.ndarray(a.shape,
    buffer=cupy.cuda.alloc_pinned_memory(a.nbytes),
    dtype=a.dtype)
a_pinned[...] = a
a_cuda = cupy.cuda.memory.alloc(a.nbytes)

assert a.nbytes == a_hugepages.nbytes == a_pinned.nbytes
print(a.nbytes)
print(repeat(a_cuda.copy_from_host, (ctypes.c_void_p(a.ctypes.data), a.nbytes), n_repeat=100))
print(repeat(a_cuda.copy_from_host, (ctypes.c_void_p(a_hugepages.ctypes.data), a.nbytes), n_repeat=100))
print(repeat(a_cuda.copy_from_host, (ctypes.c_void_p(a_pinned.ctypes.data), a.nbytes), n_repeat=100))

Output:

$ CUDA_VISIBLE_DEVICES=1 python test_hugepage.py 
4294967296
/home/leofang/cupy/cupyx/time.py:56: FutureWarning: cupyx.time.repeat is experimental. The interface can change in the future.
  util.experimental('cupyx.time.repeat')
copy_from_host      :    CPU:463085.093 us   +/-174.826 (min:462799.658 / max:463892.217) us     GPU-0:463165.147 us   +/-174.704 (min:462879.517 / max:463972.321) us
copy_from_host      :    CPU:464054.178 us   +/-158.124 (min:463666.467 / max:464470.700) us     GPU-0:464135.328 us   +/-158.027 (min:463749.512 / max:464549.164) us
copy_from_host      :    CPU:354186.861 us   +/-23.496 (min:354148.254 / max:354238.851) us     GPU-0:354191.216 us   +/-23.393 (min:354152.100 / max:354242.249) us

But I am not always able to get this performance across runs. Occasionally they're on par. (Could be that I'm not the only user on the system, but this alone would not explain the variation...)

@lmeyerov
Copy link

lmeyerov commented Jul 28, 2020

-- Can there be some sort of random byte/page boundary misalignment and/or compressed/null page handling? and anything docker specific?

-- I missed the first comment on enabling transparent_hugepage, which would explain why hugepages wasn't seeing speedups in mine (tho sounds like not expected anyways based on others?), but don't think that explained pinned not speeding up. will test tmw.

-- I'm still fuzzy on how far off from peek we are for our particular config (azure v100's), if there's a good way to check, esp. given asymmetric nature

@jakirkham
Copy link
Member

Just a note on further optimization that might be worth looking at here, there's an old NumPy PR ( numpy/numpy#8783 ), which adds the ability to cache large allocations. Not sure how much it helps. Figure it is worth being aware of though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants