Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] Unclear how to saturate PCI for H->D and D->H #451

Closed
lmeyerov opened this issue Jul 28, 2020 · 26 comments
Closed

[QST] Unclear how to saturate PCI for H->D and D->H #451

lmeyerov opened this issue Jul 28, 2020 · 26 comments
Labels
question Further information is requested

Comments

@lmeyerov
Copy link

lmeyerov commented Jul 28, 2020

What is your question?

Following up on a cudf-dask experiment on peek IO, we tried to reproduce for an Azure env, and failed to achieve similar speedups. (This is for proving out line-rate stream processing.) Any pointers, and any thoughts on why the repro failed?

Original

rapidsai/dask-cuda#106

... examples seeing results in us instead of ms ...

I'm having some difficulty reproducing pinned memory for proving out near-peak host->device transfers. This is on an Azure V100 w/ RAPIDS 0.14, and indirected via docker. Our ultimate goal is setting something up for a streaming workload where we feed these at network line rate, and we're happy to reuse these buffers etc. As-is, we're only getting ~2 GB/s on a toy version, and these cards are 16GB PCIE each way afaict.

Thoughts?

Edit: 96ms for 800MB => ~8 GB/s ? Though still mysteries as not seeing #'s like ^^^^, and I think the Azure cards are rated for 16-32GB/s, can't tell. (V100's, nc6s_v3)

Setup CPU data


import cudf, cupy as cp, numpy as np, rmm

if True:
    RMM_INIT_SIZE = 2 << 32
    RMM_ALLOCATOR = "managed"
    RMM_POOL = True
    RMM_ENABLE_LOGGING = False
    
    rmm.reinitialize(pool_allocator=True,
                     initial_pool_size=RMM_INIT_SIZE)
    cp.cuda.set_allocator(rmm.rmm_cupy_allocator) 
### where df['x'] is 100M sample int64 gdf data
x = df['x'].to_array()
xb = x.view('uint8')

Setup CPU pinned bufers

x_hugepages = np.frombuffer(xb, dtype=xb.dtype)

#WARNING: x_pinned is padded bigger than xb (1073741824 > 800000000)
x_pinned = np.asarray(cp.cuda.alloc_pinned_memory(len(xb)), dtype=xb.dtype)
x_pinned[0:len(xb)] = xb

x_cuda = rmm.DeviceBuffer(size=x_pinned.nbytes) #include padding over xb

Benchmark CPU -> GPU

%%time
x_cuda.copy_from_host(xb)
=> 90.3ms

%%time
x_cuda.copy_from_host(x_hugepages)
=> 90.6ms

%%time
x_cuda.copy_from_host(x_pinned)
122ms
@lmeyerov lmeyerov added ? - Needs Triage Need team to review and classify question Further information is requested labels Jul 28, 2020
@harrism
Copy link
Member

harrism commented Jul 30, 2020

I'm not sure I understand what "the repro" is or what the code is trying to do. In general, the performance of the memory accesses and copies has nothing to do with RMM, it only allocates the memory.

@lmeyerov
Copy link
Author

lmeyerov commented Jul 30, 2020

The goal is getting RAPIDS to move data at PCI bus speeds, such as for achieving line-rate on streaming workloads. The original thread found memory type to be an issue here, and explores aspects of that.

I think the request to x-post here is this is not clearly in any repo and may touch on the allocator for part of the solution, such as solving the variance & x_pinned perf, and folks here may have insight to the ultimate goal. Current benchmarks seem only ~25%-50% of what the hardware is rated for, with high variance across runs, and pinned memory is all over the place, and we don't know why. The original thread experimented with different H<>D allocation types, and I'm guessing there may also be flags needed around alignment, priority, type, and whatever else may be stalling at the HW/OS level.

I can't speak to the repro of others. For mine, x is cudf.DataFrame({'x': cudf.Series([x for x in range(n)]}, dtype='int64'])}), on an azure NC6s_v3 (V100), run rapids 0.14 in docker (source activate rapids in https://hub.docker.com/r/graphistry/graphistry-forge-base. The rest is as above. Conversely, happy to experiment if you have thoughts.

@kkraus14
Copy link
Contributor

kkraus14 commented Jul 31, 2020

To be clear, copy_from_host is a super thin wrapper around cudaMemcpyAsync and we can't do anything to speed that up.

If you're only getting ~2 GB/s then that's because you're including the time of allocating host memory which is limited by the host kernel and typically will limit things to around 2 - 2.5 GB/s.

If you're getting ~8 GB/s then that's the limitations of the system that you're on. Seems like you're in the cloud which adds a virtualization layer that typically impacts performance as well as makes the PCIe topology / traffic a bit of an unknown.

John's test in the linked PR was on a bare metal DGX-1 where he was the only user while running which gives us no virtualization and an understood topology (full 16x PCIe 3.0 lane from CPU to GPU).

Closing this as there's no action we can take here.

@lmeyerov
Copy link
Author

lmeyerov commented Jul 31, 2020

@kkraus14 This sounds like Nvidia thinks Azure V100's can only do 2GB/s. I'm ok investigating further, such as trying alternate policies, but not sure why the rush to not have top of line cloud GPUs do more than 2GB/s.

@harrism
Copy link
Member

harrism commented Jul 31, 2020

That is not what NVIDIA thinks at all. It's also not what Keith wrote. He wrote that the 2GB/s case is probably counting host memory allocation time in the throughput measurement.

Your next step should probably be to run your test on bare metal where PCIe is definitely not virtualized. If you can't achieve the same performance between bare metal and Azure, then it's either a configuration difference, a virtualization cost, or something similar.

@lmeyerov
Copy link
Author

lmeyerov commented Jul 31, 2020

Sorry, above left a pretty bad taste in my mouth - this immediately impacts one of our big co-customers trying to prove out speed @ scale on a bare bones "testing the wheels" setup, so not even the real version - so trying to think through how to be constructive vs. closing without solving.

Ideas:

-- Allocation: I'm going to try flipping hugepages stuff. Anything else to experiment with? (note that the "setup" step already had some preallocation.)

-- Virtualization: I can try running the benchmark in the host to eliminate docker as a concern, though ofc still Azure stuff. I don't think multitenancy is an issue here, I ran multiple times.

-- PCI expectation: It seems worth identifying the expected rate. The host lspci reports NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] (rev a1), with nothing on asymmetric speed: any preferred PCI benchmark? I'll revisit the math on my side.

After a real effort, it sounds like not much obvious to folks on this thread, so may be time to ping other Nvidia folks or Azure GPU folks. (I'm still fuzzy if others have achieved 75%+ utilization with RAPIDS on either of these issue threads.)

@kkraus14
Copy link
Contributor

-- Allocation: I'm going to try flipping hugepages stuff. Anything else to experiment with? (note that the "setup" step already had some preallocation.)

@jakirkham experimented with this and it made no material difference in his testing. He also tested with pinned vs unpinned host memory which had minimal impact as well. The only thing I believe he didn't get around to testing is if gdrcopy would make a material difference.

-- Virtualization: I can try running the benchmark in the host to eliminate docker as a concern, though ofc still Azure stuff. I don't think multitenancy is an issue here, I ran multiple times.

Docker doesn't run the hypervisor here, Azure does and the Azure hypervisor virtualizing the GPU + system is likely what is causing the performance degradation you're seeing here from my guess. I'm by no means a cloud systems expert though.

-- PCI expectation: It seems worth identifying the expected rate. The host lspci reports NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] (rev a1), with nothing on asymmetric speed: any preferred PCI benchmark? I'll revisit the math on my side.

Even if lspci reports a 16x lane, it's likely a 16x lane through a PLX chip to the CPU and you don't necessarily have visibility into the other devices on the PLX which could be sharing the PCIe bandwidth with your GPU.

After a real effort, it sounds like not much obvious to folks on this thread, so may be time to ping other Nvidia folks or Azure GPU folks. (I'm still fuzzy if others have achieved 75%+ utilization with RAPIDS on either of these issue threads.)

Apologies if I came off as impolite in the previous reply, but ultimately this isn't an RMM issue as we just wrap standard CUDA APIs for these things.

@harrism
Copy link
Member

harrism commented Jul 31, 2020

@lmeyerov Here's what I get on my local (local as in the PC that my feet are resting on as I type this) V100:

(base) mharris@canopus:~/NVIDIA_CUDA-10.2_Samples/1_Utilities/bandwidthTest$ ./bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: Quadro GV100
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(GB/s)
   32000000			13.1

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(GB/s)
   32000000			13.6

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(GB/s)
   32000000			540.0

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

Suggest you run the same test, as it eliminates RAPIDS as a variable. You can run it on your local footrest PC first, and then run on an equivalent cloud instance with various configuration options, and compare.

This CUDA sample is included in the CUDA toolkit. I built it with

cd ~/NVIDIA_CUDA-10.2_Samples/1_Utilities/bandwidthTest
make

@lmeyerov
Copy link
Author

Yeah I'm going to start with investigating PCI tests for a sense of target.

I found bandwidthTest in the cuda toolkit via other forums, so will start from verifying host+container speeds: this should narrow down whether it is azure (and we can chase with them) or somewhere on the nvidia-docker/rapids path.

I haven't found a clear spec sheet for Azure V100 expected perf, we'll see

@kkraus14
Copy link
Contributor

I haven't found a clear spec sheet for Azure V100 expected perf, we'll see

I'd guess you're in a bit of moving sand territory here where the hardware doesn't change, but the virtualization layer on top of the GPU / CPU / PCIe Bus / Memory / etc. is changed over time which can change the performance impact vs bare metal.

@jakirkham
Copy link
Member

jakirkham commented Jul 31, 2020

@jakirkham experimented with this and it made no material difference in his testing.

Just to clarify, I think the machine I was testing on used hugepages in all cases (system config). Would expect using hugepages does help. (Though also don't know much about what Azure is doing with virtualization)

@lmeyerov
Copy link
Author

Hm, in-docker bandwidthTest gives H->D 10.0 GB/s- 10.3 GB/s, will keep digging.


 Device 0: Tesla V100-PCIE-16GB
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(GB/s)
   32000000			10.1

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(GB/s)
   32000000			11.3

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(GB/s)
   32000000			734.7

-- Docker: bandwidthTest container is a bit diff from orig benchmark due to bandwidthTest failing without cuda11 (vs. cuda 10.2 originals)

-- Hardware + Host: Wasn't able to progress on theoretical+expected as bandwidth test / nvcc were being tricky to setup on the current instances.

For next steps, I'm thinking to drill into achieved vs expected:

(A) revisit the RAPIDS test to double check returned GB/s to understand the gap within docker on python vs bandwidthTest, if any

(B) see about expected host #'s on a fresh V100 host-level bandwidthTest + maybe ping someone at Azure for their expectations, to see about issues at azure / docker level

(C) try the huge pages thing as part of (A)

@kkraus14
Copy link
Contributor

@lmeyerov one other thing to try is don't use the "managed" allocator, as UVM can add overhead as well.

@kkraus14 kkraus14 reopened this Jul 31, 2020
@harrism
Copy link
Member

harrism commented Jul 31, 2020

Ah right. And the performance depends on where the data is initially and where you are trying to copy it / access it.

@lmeyerov
Copy link
Author

lmeyerov commented Jul 31, 2020

Yeah, switching from managed -> default (unmanaged?) gets from ~123ms -> ~119ms, so helps but not the main thing.

Can you be a bit more concrete on 'where you are trying to access it'? See below for current.

Overall, seems to be 6.77 GB/s for moving 800MB in a cuda 10.2 docker under the X_PINNED rapids python code variant, and still ~same when doubling to 1.6GB. Still trying to get test envs uniform (we have limited quota and no local hw, sorry =/), but on another azure v100, the docker bandwithTest gave 10.2 GB/s, which suggests the python env is leaving 1/3rd of bandwidth unused.

RE:src location, I'm copying the original reference:

n = 100000000
df = cudf.DataFrame({'x': cudf.Series([x for x in range(n)], dtype='int64')})
x = df['x'].to_array()
xb = x.view('uint8')
x_pinned = np.asarray(cp.cuda.alloc_pinned_memory(len(xb)), dtype=xb.dtype)
x_pinned[0:len(xb)] = xb
x_cuda = rmm.DeviceBuffer(size=x_pinned.nbytes) #include padding over xb

%%time
x_cuda.copy_from_host(x_pinned)

Note one oddity: len(xb) => 800... while len(x_pinned) => 1342... . I guessed that np.asarray or cp.cuda.alloc_pinned_memory does automatic padding or introduces a mask somewhere, and was unable to get it to line up. If I include padded size, python's 6.77 GB/s is instead 8.77 GB/s, which is that much closer to the 10.2 GB / s of bandwidthTest. (though I suspect the hardware can do 12+ GB / s, so still trying to understand hw roofline.)

@kkraus14
Copy link
Contributor

Can you use %timeit instead of just %%time to try to reduce noise / factor out cold start related issues?

If bandwidthTest is reporting 10.2 GB/s that's likely the source of truth we need to work with from the rmm side.

@lmeyerov
Copy link
Author

%timeit -n 20 x_cuda.copy_from_host(x_pinned)

t_ms = 126
print(
    1000 * (len(x_pinned) / 1000000.0) / t_ms, #
    1000 * (len(xb) / 1000000.0) / t_ms)

=>

126 ms ± 1.35 ms per loop (mean ± std. dev. of 7 runs, 20 loops each)

8521.760507936508
6349.206349206349

@harrism
Copy link
Member

harrism commented Jul 31, 2020

I don't know what that copy_from_host maps to, but I guess it must be more than a cudaMemcpy.

@kkraus14 kkraus14 removed the ? - Needs Triage Need team to review and classify label Jul 31, 2020
@kkraus14
Copy link
Contributor

Implementation of copy_from_host: https://github.com/rapidsai/rmm/blob/branch-0.15/python/rmm/_lib/device_buffer.pyx#L172-L200

Implementation of copy_host_to_ptr: https://github.com/rapidsai/rmm/blob/branch-0.15/python/rmm/_lib/device_buffer.pyx#L374-L423

Maybe the bit of control flow of using buffer protocol and checking error states is slowing things down? We'd need to get into some low level profiling at this point, but this should all be basically free.

Maybe try with larger buffers to try to amortize more of these costs to see if the numbers improve?

@harrism
Copy link
Member

harrism commented Jul 31, 2020

Copy_ptr_to_host synchronizes the whole device!

So if the benchmark runs in a loop, it is counting device sync.

Not sure whether bandwidthTest does the same.

@jakirkham
Copy link
Member

Yeah as we would run into this issue otherwise.

@lmeyerov
Copy link
Author

lmeyerov commented Jul 31, 2020

I believe we do have the whole device, so the pinned write would be the only delta for the GPU

  • bandwidthTest: https://github.com/NVIDIA/cuda-samples/blob/master/Samples/bandwidthTest/bandwidthTest.cu#L692

  • For streaming users triggering all this, the scenario seems something like 5+ MB chunks every 1ms coming off of 40+Gb/s commodity networks, with or without CPU processing (can be GPU if RAPIDS handles). Ultimate interest is doing the same in commercial clouds + more premium hw, hence Azure was an OK start. If there's something smarter, happy to look at that too. I was going to benchmark microbatch streams next: I started with 800MB because bulk seemed the easier case / upperbound for proving RAPIDS can saturate PCI

@lmeyerov
Copy link
Author

Also interesting:

@harrism
Copy link
Member

harrism commented Oct 19, 2020

@lmeyerov no updates in a while, is this still an issue?

@lmeyerov
Copy link
Author

The particular project is on pause so haven't been pushing.

I still suspect we're only at 25%-50% of what the hw can do, but without confirmation from azure gpu staff around what to expect, don't see how to make progress

@harrism
Copy link
Member

harrism commented Oct 19, 2020

Let's close and if you determine that this is still an RMM issue and not an Azure issue, please reopen.

@harrism harrism closed this as completed Oct 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants