Evaluate further serialization performance improvements #106

pentschev · 2019-08-02T22:05:08Z

Device to host serialization currently runs at around 1.5-2.0 GB/s. The major bottleneck is host memory allocation.

All serialization is now done by copying a Numba device array back to host as a NumPy array. Recently, hugepage support was introduced to NumPy and we should see benefits automatically if /sys/kernel/mm/transparent_hugepage/enabled is set to madvise or always, but I was only able to see benefits when it's set to the latter. Even with that, host memory is being allocated at about 5 GB/s on a DGX-1, and copying happening at about 10 GB/s, which accounts for about 3 GB/s of copying data back to host (since both operations happen in sequence). Some details were discussed in #98, starting at #98 (comment).

One alternative is to hold a host memory pool where we can transfer data to. That would require some custom memory copying function, since Numba requires the destination NumPy array to keep the same format (shape, dtype, etc.) as the device array, thus making it impossible to keep a pool for arrays of arbitrary formats.

cc @mrocklin @madsbk

The text was updated successfully, but these errors were encountered:

mrocklin · 2019-08-02T22:10:43Z

A move from 1.5-2GB to 3GB is still pretty nice. Are there ways that we can deliver this to users soon while we talk about memory pools and such? How should a novice user set things appropriately? Can we set it for them when we create a new worker process?

mrocklin · 2019-08-02T22:10:52Z

cc @randerzander

pentschev · 2019-08-05T07:17:04Z

A move from 1.5-2GB to 3GB is still pretty nice. Are there ways that we can deliver this to users soon while we talk about memory pools and such? How should a novice user set things appropriately? Can we set it for them when we create a new worker process?

Setting /sys/kernel/mm/transparent_hugepage/enabled to always requires root access and may have unintended consequences (e.g., much higher resident memory usage by the OS and other processes), so there's no way users/dask can set that easily. I will investigate further why madvise doesn't seem to be working as expected.

pentschev · 2019-08-05T09:59:45Z

Alright, I think I've found the reason for that, please take a look at numpy/numpy#14177 (comment).

pentschev · 2019-08-05T15:56:39Z

I did some more benchmarking with NumPy 1.16 and 1.17 (I made it sure that hugepage was compiled in for the latter). I added also a simple idea of how we could setup a NumPy pool to copy data from the device to host, of course we would need a somewhat complex pool manager, but it's worth consideration.

https://gist.github.com/pentschev/7b348f8409ba35341a3f646093e61336

jakirkham · 2020-07-01T19:06:53Z

It would be good to reassess how long this is taking. While there haven't been a lot of low-level changes, there are some notable ones like ensuring hugepages are used ( numpy/numpy#14216 ). There have been a lot of high-level changes in Dask, Dask-CUDA, and RAPIDS since this issue was raised. For example CUDA objects have become "dask" serializable ( dask/distributed#3482 ) ( rapidsai/cudf#4153 ), which Dask-CUDA leverages for spilling ( #256 ). Distributed learned how to serialize collections of CUDA objects regardless of size ( dask/distributed#3689 ), which has simplified things in Dask-CUDA further ( #307 ). A bug fix to Distributed's spilling logic ( dask/distributed#3639 ) and better memoryview serialization ( dask/distributed#3743 ) has allowed us to perform fewer serialization passes ( #309 ). We've also generalized, streamlined, and improved the robustnesses of serialization in RAPIDS through multiple PRs though most recently PR ( rapidsai/cudf#5139 ). Believe there are probably more high-level improvements we can make here. Also we can make low-level improvements still like using pinned memory ( rapidsai/rmm#260 ) and/or using different RMM memory resources together (like UVM typically and device memory for UCX communication). Additionally things like packing/unpacking ( rapidsai/cudf#5025 ) would allow us to transfer a single buffer (instead of multiple) between host and device.

jakirkham · 2020-07-21T00:25:44Z

As CuPy has a pinned memory pool, I tried comparing it to typical Python allocations (appears to be using hugepages as well) and NumPy hugepages allocations. Unfortunately I didn't find a significant difference amongst them. They all took ~8.33 GB/s to transfer. Would be curious to know if I'm just doing something wrong here.

In [1]: import numpy 
   ...: import cupy 
   ...: import rmm                                                              

In [2]: rmm.reinitialize(pool_allocator=True, 
   ...:                  initial_pool_size=int(30 * 2**30)) 
   ...: cupy.cuda.set_allocator(rmm.rmm_cupy_allocator)                         

In [3]: a = numpy.asarray(memoryview(2**29 * b"ab")) 
   ...:  
   ...: a_hugepages = numpy.copy(a) 
   ...:  
   ...: a_pinned = numpy.asarray( 
   ...:     cupy.cuda.alloc_pinned_memory(a.nbytes), 
   ...:     dtype=a.dtype 
   ...: )[:a.nbytes] 
   ...: a_pinned[...] = a 
   ...:  
   ...: a_cuda = rmm.DeviceBuffer(size=a.nbytes)                                

In [4]: %timeit a_cuda.copy_from_host(a)                                        
117 ms ± 40.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [5]: %timeit a_cuda.copy_from_host(a_hugepages)                              
120 ms ± 72.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [6]: %timeit a_cuda.copy_from_host(a_pinned)                                 
120 ms ± 86.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

cc @leofang (who may also have thoughts here 😉)

leofang · 2020-07-27T15:02:06Z

Unfortunately I didn't find a significant difference amongst them.

@jakirkham What's your expectation for this test? Maybe you were hoping to see the first two copies to outperform that from pinned memory?

I don't have rmm set up in my env, so I instead used CuPy's default pool, ran this test, and I also got the same timing for the three different copies.

jakirkham · 2020-07-27T19:04:44Z

Would have expected copying to pinned memory would be faster. Though it seems comparable to using hugepages, which just uses larger page sizes.

Yeah my guess is this isn't RMM specific. copy_from_host is a pretty thin wrapper around cudaMemcpyAsync.

leofang · 2020-07-27T19:39:32Z

Could it be that pinned memory was actually implemented using hugepages too? Perhaps this is proprietary information...? 😁

jakirkham · 2020-07-27T19:51:17Z

I don't think so. My understanding of pinned memory is there are not pages in the traditional sense (as they cannot be paged). Whereas hugepages still have pages (and can be paged), they are just larger.

jakirkham · 2020-07-27T22:46:55Z

Based on a suggestion from @kkraus14 offline, I tried building and running the bandwidthTest from CUDA Samples on a DGX-1.

It didn't quite build out-of-the-box. This appears to be due to some changes that require CUDA 11.0, which I don't have on this machine, and the usage of a C compiler when a C++ compiler should be used. Was able to fix these with this small patch.

When I ran the bandwidthTest binary produced, I got the following results:

$ ../../bin/x86_64/linux/release/bandwidthTest 
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: Tesla V100-SXM2-32GB
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(GB/s)
   32000000			11.1

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(GB/s)
   32000000			12.5

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(GB/s)
   32000000			732.1

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

The DGX-1 uses PCIe Gen3 x16, which according to Wikipedia has a theoretical maximum throughput of 15.75 GB/s. What we are measuring here is 11.1 GB/s with host-to-device and 12.5 GB/s with device-to-host, which it seems is reasonable based on prior discussion.

To get an apples-to-apples comparison, I modified the Python code to work with the same amount of memory (as shown below).

In [1]: import numpy 
   ...: import cupy 
   ...: import rmm                                                              

In [2]: rmm.reinitialize(pool_allocator=True, 
   ...:                  initial_pool_size=int(30 * 2**30)) 
   ...: cupy.cuda.set_allocator(rmm.rmm_cupy_allocator)                         

In [3]: a = numpy.asarray(memoryview(32000000 * bytearray(b"a"))) 
   ...:  
   ...: a_hugepages = numpy.copy(a) 
   ...:  
   ...: a_pinned = numpy.asarray( 
   ...:     cupy.cuda.alloc_pinned_memory(a.nbytes), 
   ...:     dtype=a.dtype 
   ...: )[:a.nbytes] 
   ...: a_pinned[...] = a 
   ...:  
   ...: a_cuda = rmm.DeviceBuffer(size=a.nbytes)                                

In [4]: %timeit a_cuda.copy_from_host(a)                                        
3.07 ms ± 30.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [5]: %timeit a_cuda.copy_from_host(a_hugepages)                              
3.06 ms ± 3.89 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [6]: %timeit a_cuda.copy_from_host(a_pinned)                                 
3.05 ms ± 377 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [7]: %timeit a_cuda.copy_to_host(a)                                          
2.71 ms ± 28.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [8]: %timeit a_cuda.copy_to_host(a_hugepages)                                
2.69 ms ± 1.31 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [9]: %timeit a_cuda.copy_to_host(a_pinned)                                   
2.69 ms ± 892 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)

Crunching the numbers, we are seeing 9.74 GB/s with host-to-device and 11.1 GB/s with device-to-host. This is a bit more than 10% slower, which is actually pretty good. Note again it does not seem to matter if hugepages or pinned memory is used.

jakirkham · 2020-07-28T00:22:07Z

The other thing that is of interest here is allocation time. Allocation time on host is pretty slow. Device allocations are notably faster. Though it is worth noting that allocating pinned memory is an order of magnitude slower than allocating memory through Python builtin objects (like bytearray) or NumPy, both of which trigger pagefaults as part of their benchmarks. This can be seen in the code below:

In [1]: import numpy 
   ...: import cupy 
   ...: import rmm                                                              

In [2]: rmm.reinitialize(pool_allocator=True, 
   ...:                  initial_pool_size=int(30 * 2**30)) 
   ...: cupy.cuda.set_allocator(rmm.rmm_cupy_allocator)                         

In [3]: %timeit bytearray(32000000)                                             
1.42 ms ± 4.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [4]: %timeit numpy.ones((32000000,), dtype="u1")                             
1.43 ms ± 3.44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [5]: %%timeit 
   ...: nbytes = 32000000 
   ...: a_pinned = numpy.asarray( 
   ...:     cupy.cuda.alloc_pinned_memory(nbytes), 
   ...:     dtype="u1" 
   ...: )[:nbytes] 
   ...:  
   ...:                                                                         
11.6 ms ± 21.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [6]: %timeit cupy.ones((32000000,), dtype="u1")                              
355 µs ± 63.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Edit: Would add writing naive code in Cython to allocate the memory from malloc and assign a value to each element takes a similar amount of time to bytearray and NumPy (if anything they go a bit faster). So in general I don't think there is much more to gain here.

from libc.stdint cimport uint8_t
from libc.stdlib cimport malloc, free

cpdef ones(size_t nbytes):
    cdef uint8_t* data = <uint8_t*>malloc(nbytes)
    
    cdef size_t i
    cdef uint8_t v = 1
    for i in range(nbytes):
        data[i] = v

    free(data)

In [3]: %timeit ones(32000000)                                                  
1.52 ms ± 3.04 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

lmeyerov · 2020-07-28T01:05:10Z

I'm having some difficulty reproducing pinned memory for proving out near-peak host->device transfers. This is on an Azure V100 w/ RAPIDS 0.14, and indirected via docker. Our ultimate goal is setting something up for a streaming workloading where we feed these at network line rate, so ideally 1-10GB/s, and we're happy to reuse these buffers etc. As-is, we're only getting ~2 GB/s on a toy version, and these cards are 16GB PCIE each way afaict.

Thoughts?

Edit: 96ms for 800MB => ~8 GB/s :) Though still mysteries as not seeing #'s like ^^^^

Setup CPU data


import cudf, cupy as cp, numpy as np, rmm

if True:
    RMM_INIT_SIZE = 2 << 32
    RMM_ALLOCATOR = "managed"
    RMM_POOL = True
    RMM_ENABLE_LOGGING = False
    
    rmm.reinitialize(pool_allocator=True,
                     initial_pool_size=RMM_INIT_SIZE)
    cp.cuda.set_allocator(rmm.rmm_cupy_allocator) 
### where df['x'] is 100M sample gdf data
x = df['x'].to_array()
xb = x.view('uint8')

Setup CPU pinned bufers

x_hugepages = np.frombuffer(xb, dtype=xb.dtype)

#WARNING: x_pinned is padded bigger than xb (1073741824 > 800000000)
x_pinned = np.asarray(cp.cuda.alloc_pinned_memory(len(xb)), dtype=xb.dtype)
x_pinned[0:len(xb)] = xb

x_cuda = rmm.DeviceBuffer(size=x_pinned.nbytes) #include padding over xb

Benchmark CPU -> GPU

%%time
x_cuda.copy_from_host(xb)
=> 90.3ms

%%time
x_cuda.copy_from_host(x_hugepages)
=> 90.6ms

%%time
x_cuda.copy_from_host(x_pinned)
122ms

jakirkham · 2020-07-28T01:25:08Z

Would you be willing to raise that as a new issue, @lmeyerov? 🙂

lmeyerov · 2020-07-28T01:29:52Z

... rmm, cudf?

jakirkham · 2020-07-28T01:32:18Z

Maybe RMM? We can always transfer the issue if we decide it belongs somewhere else 🙂

jakirkham · 2020-07-28T02:09:33Z

Going to go ahead and close this as it seems we are getting as much out of this as we can currently.

jakirkham · 2020-07-28T02:27:27Z

Should add we made a number of other performance improvements related to spilling upstream, which are listed below.

dask/distributed#3960
dask/distributed#3961
dask/distributed#3973
dask/distributed#3980

leofang · 2020-07-28T03:54:51Z

@jakirkham Out of curiosity I changed my benchmark script slightly (to not use %timeit), and enlarged the size to 4GB. I was able to see better performance with pinned memory.

import numpy, cupy, ctypes
from cupyx.time import repeat


a = numpy.asarray(memoryview(4 * 2**29 * b"ab"))
a_hugepages = numpy.copy(a)
a_pinned = numpy.ndarray(a.shape,
    buffer=cupy.cuda.alloc_pinned_memory(a.nbytes),
    dtype=a.dtype)
a_pinned[...] = a
a_cuda = cupy.cuda.memory.alloc(a.nbytes)

assert a.nbytes == a_hugepages.nbytes == a_pinned.nbytes
print(a.nbytes)
print(repeat(a_cuda.copy_from_host, (ctypes.c_void_p(a.ctypes.data), a.nbytes), n_repeat=100))
print(repeat(a_cuda.copy_from_host, (ctypes.c_void_p(a_hugepages.ctypes.data), a.nbytes), n_repeat=100))
print(repeat(a_cuda.copy_from_host, (ctypes.c_void_p(a_pinned.ctypes.data), a.nbytes), n_repeat=100))

Output:

$ CUDA_VISIBLE_DEVICES=1 python test_hugepage.py 
4294967296
/home/leofang/cupy/cupyx/time.py:56: FutureWarning: cupyx.time.repeat is experimental. The interface can change in the future.
  util.experimental('cupyx.time.repeat')
copy_from_host      :    CPU:463085.093 us   +/-174.826 (min:462799.658 / max:463892.217) us     GPU-0:463165.147 us   +/-174.704 (min:462879.517 / max:463972.321) us
copy_from_host      :    CPU:464054.178 us   +/-158.124 (min:463666.467 / max:464470.700) us     GPU-0:464135.328 us   +/-158.027 (min:463749.512 / max:464549.164) us
copy_from_host      :    CPU:354186.861 us   +/-23.496 (min:354148.254 / max:354238.851) us     GPU-0:354191.216 us   +/-23.393 (min:354152.100 / max:354242.249) us

But I am not always able to get this performance across runs. Occasionally they're on par. (Could be that I'm not the only user on the system, but this alone would not explain the variation...)

lmeyerov · 2020-07-28T05:44:51Z

-- Can there be some sort of random byte/page boundary misalignment and/or compressed/null page handling? and anything docker specific?

-- I missed the first comment on enabling transparent_hugepage, which would explain why hugepages wasn't seeing speedups in mine (tho sounds like not expected anyways based on others?), but don't think that explained pinned not speeding up. will test tmw.

-- I'm still fuzzy on how far off from peek we are for our particular config (azure v100's), if there's a good way to check, esp. given asymmetric nature

jakirkham · 2020-07-31T03:27:35Z

Just a note on further optimization that might be worth looking at here, there's an old NumPy PR ( numpy/numpy#8783 ), which adds the ability to cache large allocations. Not sure how much it helps. Figure it is worth being aware of though.

pentschev mentioned this issue Jul 1, 2020

[FEA] Allow spilling from device memory to host memory by default #334

Closed

lmeyerov mentioned this issue Jul 28, 2020

[QST] Unclear how to saturate PCI for H->D and D->H rapidsai/rmm#451

Closed

jakirkham closed this as completed Jul 28, 2020

jakirkham mentioned this issue Aug 19, 2020

[NO MRG] Parallelize host serialization/deserialization rapidsai/cudf#6036

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate further serialization performance improvements #106

Evaluate further serialization performance improvements #106

pentschev commented Aug 2, 2019

mrocklin commented Aug 2, 2019

mrocklin commented Aug 2, 2019

pentschev commented Aug 5, 2019

pentschev commented Aug 5, 2019

pentschev commented Aug 5, 2019

jakirkham commented Jul 1, 2020

jakirkham commented Jul 21, 2020 •

edited

Loading

leofang commented Jul 27, 2020

jakirkham commented Jul 27, 2020

leofang commented Jul 27, 2020

jakirkham commented Jul 27, 2020

jakirkham commented Jul 27, 2020 •

edited

Loading

jakirkham commented Jul 28, 2020 •

edited

Loading

lmeyerov commented Jul 28, 2020 •

edited

Loading

jakirkham commented Jul 28, 2020

lmeyerov commented Jul 28, 2020

jakirkham commented Jul 28, 2020

jakirkham commented Jul 28, 2020

jakirkham commented Jul 28, 2020

leofang commented Jul 28, 2020

lmeyerov commented Jul 28, 2020 •

edited

Loading

jakirkham commented Jul 31, 2020

Evaluate further serialization performance improvements #106

Evaluate further serialization performance improvements #106

Comments

pentschev commented Aug 2, 2019

mrocklin commented Aug 2, 2019

mrocklin commented Aug 2, 2019

pentschev commented Aug 5, 2019

pentschev commented Aug 5, 2019

pentschev commented Aug 5, 2019

jakirkham commented Jul 1, 2020

jakirkham commented Jul 21, 2020 • edited Loading

leofang commented Jul 27, 2020

jakirkham commented Jul 27, 2020

leofang commented Jul 27, 2020

jakirkham commented Jul 27, 2020

jakirkham commented Jul 27, 2020 • edited Loading

jakirkham commented Jul 28, 2020 • edited Loading

lmeyerov commented Jul 28, 2020 • edited Loading

Setup CPU data

Setup CPU pinned bufers

Benchmark CPU -> GPU

jakirkham commented Jul 28, 2020

lmeyerov commented Jul 28, 2020

jakirkham commented Jul 28, 2020

jakirkham commented Jul 28, 2020

jakirkham commented Jul 28, 2020

leofang commented Jul 28, 2020

lmeyerov commented Jul 28, 2020 • edited Loading

jakirkham commented Jul 31, 2020

jakirkham commented Jul 21, 2020 •

edited

Loading

jakirkham commented Jul 27, 2020 •

edited

Loading

jakirkham commented Jul 28, 2020 •

edited

Loading

lmeyerov commented Jul 28, 2020 •

edited

Loading

lmeyerov commented Jul 28, 2020 •

edited

Loading