Gpu optimizations #743

rhenry-nv · 2020-10-10T02:48:56Z

Description

This PR represents a single branch containing all of the GPU optimizations done. It is mainly to get a wholistic view of the performance improvements from the main branch. It is being split into several smaller PRs and will be closed once the smaller PRs are merged into master.

Performance numbers from a proxy model. The times are the total time to decode an input file by splitting the file into different batch sizes. This was done since it is strange to compare the time per batch with multiple streams. All times were collected on a Titan V.

Times using one stream.

Batch	Initial Time (s)	Current Time(s)	% Runtime reduction	Speedup factor
1	204.41	92.15	0.549190353	2.218231145
2	136.77	66.29	0.515317687	2.06320712
4	84.76	41.23	0.513567721	2.055784623
8	50.32	25.05	0.50218601	2.008782435
16	29.94	15	0.498997996	1.996
32	18.51	9.26	0.499729876	1.998920086
64	12.09	6.03	0.501240695	2.004975124
128	8.19	4.13	0.495726496	1.983050847
256	6.02	3.03	0.496677741	1.98679868

Times with two streams

Batch	Initial Time (s)	Current Time(s)	% Runtime reduction	Speedup factor
1	156.74	67.18	0.571392114	2.333134862
2	106.13	49.07	0.537642514	2.162828612
4	65.1	30.18	0.53640553	2.157057654
8	38.46	18.04	0.530941238	2.131929047
16	22.68	10.73	0.526895944	2.113699907
32	14.31	6.65	0.535290007	2.151879699
64	9.51	4.42	0.535226078	2.15158371
128	6.66	3.15	0.527027027	2.114285714
256	5.05	2.34	0.536633663	2.158119658

List of changes:

Compute Sinusoidal Embeddings on GPU when compiled with CUDA instead of doing work on the CPU. This causes small float differences in the final result of some sentences.
Reduces the number of times indices need to be transferred to the GPU when updating transformer and rnn states
Issues cublas and cusparse calls to the per thread default stream instead of the global default stream
Adds a operator to fuse bias addition with relu during inference. No longer issues a gemm to broadcast the bias across the tensor during inference.
Caches the triangle mask in the transformer during inference to reduce communication
Batches the retrieval of n-best logits from the device so one memcpy is issued to get all logits instead of 1 memcpy per logit
Adds a new operator that fuses bias addition with the skip connection and layer normalization
Removes memcpy from ProdBatched when the number of batches in the input matrices are the same by using cublasStridedBatchedGemm.
Reduces the number of memcpys in ProdBatched from 3 to 1 in the general case
Adds a cache by name function to the graph
Moves the lemmaHasFactorGroup tensor to the GPU for factored vocabulary models to allow lemmas to be masked out on the GPU in one kernel for group 0.
Ports the topk implementation from NVIDIA's faster transformer to nth_element.cu and make changes so that it is compatible with marian.

Added dependencies:

cub - used by the new topk implementation

How to test

I ran the regression tests. They will fail due to some differences with float computations. Not sure of the protocol for updating the tests.

OS: Ubuntu 18.04.3 LTS
Compiler gcc 7.5.0
nvcc version: 10.1.243

cmake command:

cmake .. -DCOMPILE_CPU=on -DCOMPILE_CUDA=on -DUSE_SENTENCEPIECE=on -DUSE_STATIC_LIBS=off -DCOMPILE_SERVER=off -DUSE_FBGEMM=on -DCOMPILE_CUDA_SM35=off -DCOMPILE_CUDA_SM50=off -DCOMPILE_CUDA_SM60=off -DCOMPILE_CUDA_SM70=on

Checklist

I have tested the code manually
I have run regression tests - Some regression tests fail from float differences
I have read and followed CONTRIBUTING.md
I have updated CHANGELOG.md

…n. Reduced the memcpys from 3 to 1 for the general case.

… default stream. This have issue each thread issue calls to their own stream instead of the global default stream when marian is compiled to use a default stream per thread

…y RELU for inference. When upgrading to cuda 11, the bias and relu can be fused into the matrix multiply with cublasLt.

…logits instead of one memcpy per logit.

…uces a new operator to compute the sinusoidal embeddings on the GPU instead of doing CPU work. This is the reason for the differences since there are small differences in the float results for the positional embeddings. It also caches the triangle count mask for the transformer when doing inference to reduce device-hsot communication. This change should not affect the transofrmer output.

…ion and layerNormailization for inference. Duplicates the layerNorm kernel since the LayerNormGrad kernel would no longer correspond to LayerNorm kernel if I extended the LayerNorm kernel.

ykim362 · 2020-10-10T03:36:04Z

Thanks @rhenry-nv ! It would be great if you could attach the performance improvement numbers.

…defined

rhenry-nv · 2020-10-10T09:52:00Z

@ykim362 done! This is in draft status since there is another WIP optimization I plan to add for factored vocab models. End to end with preliminary testing on the WIP commit is 2.1x for batch 1 and 1.28x for batch 256 over the initial code using a proxy transformer model.

I also need to sort out the failing checks. I can compile locally but will look into replicating the exact build steps so I can fix the checks.

DavidLangworthy · 2020-10-10T14:53:41Z

Nice gains!

…CPU backend. Should be more performant + the code is more readable.

snukky · 2020-10-12T09:51:47Z

I ran the regression tests. They will fail due to some differences with float computations. Not sure of the protocol for updating the tests.

I will take care of updating regression tests if needed. Please note that test outputs have been generated for old GTX Titan Blacks that we still use to run regression tests, and it seems that none tests fail with your changes on these GPUs. The float differences that you are observing may be due to a different GPU type, which I think is expected for some tests. Do the same regression tests fail with the master branch on your Titan V? I don't have access to Titan V at the moment, but will test your changes on newer GPUs.

rhenry-nv · 2020-10-12T17:43:57Z

Yes, the same tests fail on my Titan V from the master branch. As a workaround, I increased the diff sensitivity in the python script to something like 0.2 to avoid spamming with failing regressions.

The output changes on my Titan V from commit 5b889da and shows up in the following regression tests (these are the ones I was referring to):

marian-dev/regression-tests/tests/models/transformer/test_nbest.sh
marian-dev/regression-tests/tests/models/ape/test_nbest.sh
marian-dev/regression-tests/tests/training/validation/test_compare_decoding_with_transscript_output.sh.log

A small number of sentences have slight output differences.

…or to the GPU. Implements an inference operator to mask out the lemmas of a factored vocab on GPU.

XapaJIaMnu · 2020-10-20T21:02:07Z

I need add_definitions(-DTHRUST_IGNORE_DEPRECATED_CPP_DIALECT=1) In order to compile nth_element.cu. CUDA 11.0.

kpu · 2020-10-20T21:03:25Z

@XapaJIaMnu with gcc 5.4?

XapaJIaMnu · 2020-10-20T21:06:07Z

@kpu gcc 8.4.0
Also, Cub version redefinition:

In file included from /usr/local/cuda/include/cub/util_namespace.cuh:36,
[build]                  from /usr/local/cuda/include/thrust/system/cuda/config.h:76,
[build]                  from /usr/local/cuda/include/thrust/system/cuda/detail/execution_policy.h:33,
[build]                  from /usr/local/cuda/include/thrust/iterator/detail/device_system_tag.h:23,
[build]                  from /usr/local/cuda/include/thrust/iterator/iterator_traits.h:111,
[build]                  from /usr/local/cuda/include/thrust/system/cuda/detail/util.h:31,
[build]                  from /usr/local/cuda/include/thrust/system/cuda/detail/core/alignment.h:21,
[build]                  from /usr/local/cuda/include/thrust/system/cuda/detail/core/triple_chevron_launch.h:30,
[build]                  from /home/s1031254/marian-dev-8bitgpu/src/3rd_party/cub/cub/device/dispatch/dispatch_histogram.cuh:48,
[build]                  from /home/s1031254/marian-dev-8bitgpu/src/3rd_party/cub/cub/device/device_histogram.cuh:41,
[build]                  from /home/s1031254/marian-dev-8bitgpu/src/3rd_party/cub/cub/cub.cuh:52,
[build]                  from /home/s1031254/marian-dev-8bitgpu/src/3rd_party/topk.cuh:26,
[build]                  from /home/s1031254/marian-dev-8bitgpu/src/translator/nth_element.cu:9:
[build] /usr/local/cuda/include/cub/version.cuh:46: error: "CUB_VERSION" redefined [-Werror]
[build]  #define CUB_VERSION 100909
[build]  
[build] In file included from /home/s1031254/marian-dev-8bitgpu/src/3rd_party/cub/cub/util_namespace.cuh:36,
[build]                  from /home/s1031254/marian-dev-8bitgpu/src/3rd_party/cub/cub/util_arch.cuh:37,
[build]                  from /home/s1031254/marian-dev-8bitgpu/src/3rd_party/cub/cub/config.cuh:35,
[build]                  from /home/s1031254/marian-dev-8bitgpu/src/3rd_party/cub/cub/cub.cuh:37,
[build]                  from /home/s1031254/marian-dev-8bitgpu/src/3rd_party/topk.cuh:26,
[build]                  from /home/s1031254/marian-dev-8bitgpu/src/translator/nth_element.cu:9:
[build] /home/s1031254/marian-dev-8bitgpu/src/3rd_party/cub/cub/version.cuh:46: note: this is the location of the previous definition
[build]  #define CUB_VERSION 101100

rhenry-nv · 2020-10-20T21:19:28Z

I have not compiled with CUDA 11 but the redefinition is because CUDA 11 now ships with CUB causing competing versions to exist.

I was not aware that Marian compiled with CUDA 11. I will fix these issues in a future commit.

XapaJIaMnu · 2020-10-20T21:22:55Z

@rhenry-nv we support up to CUDA 11.1, although not officially.

rhenry-nv · 2020-10-20T21:26:25Z

I see. I will fix when I jump back into the code.

Sorry about this!

src/3rd_party/topk.cuh

emjotde · 2020-10-26T19:59:09Z

Just a heads-up, there is a bigger merge coming from me that adds FP16 training. Might or might not touch on a lot of things that have changes here. Should appear within a few days.

…se call instead of deprecated gemmi. Implements cublasLt affine operation with optional relu fusion. Not fully tested.

emjotde · 2020-11-04T19:44:23Z

This is marked as a draft, so there is still active work going on, right?
Due to that I was currently ignoring that, but I assume you want intermediate feedback?

emjotde · 2020-11-04T19:50:31Z

BTW, the merge I mentioned above is in branch mjd/fp16.3.

emjotde · 2020-11-04T19:57:36Z

src/graph/expression_operators.cpp

@@ -1,3 +1,8 @@
+/* All or part of this file was contributed by NVIDIA under license:


I don't mean to diminish NVIDIA's contribution, but would NVIDIA be fine with removing the "All or" phrasing from this message?

On the other hand maybe I am overreacting, what are others' opinions on this? @snukky @ykim362 @kpu ?

And @ykim362 just told me that technically we are also supposed to add a Microsoft notice like that everywhere. So when Intel starts adding theirs we will have a full screen of notices before we get to any code :)

https://www.youtube.com/watch?v=mahTGNIk4q4

@kpu .so... full screen of contribution notices it is? :)

@emjotde I also had reservations about including this notice in some files where I made small changes. I was told that I had to include those notices with that phrasing even where I made one line changes. I will ask again if I can modify the wording at least.

Sorry about this :(

No reason to be sorry, I know how it works.

emjotde · 2020-11-04T20:06:10Z

@rhenry-nv Are you planning to keep adding new features to this PR, or should we try to get in what's here currently? There is so much stuff to review and test that I am a bit afraid it will overwhelm us a bit in a single PR.

rhenry-nv · 2020-11-04T21:56:49Z

This is marked as a draft, so there is still active work going on, right?
Due to that I was currently ignoring that, but I assume you want intermediate feedback?

Yes, the intention was to get intermediate feedback but I should've been clearer about that.

@rhenry-nv Are you planning to keep adding new features to this PR, or should we try to get in what's here currently? There is so much stuff to review and test that I am a bit afraid it will overwhelm us a bit in a single PR.

I can try to split into separate PRs or just stop adding stuff to this PR. There are a few more changes that I have not added in here that I can make a separate PR for. Let me know what would be the best way to ensure that you're not overwhelmed.

emjotde · 2020-11-04T22:07:06Z

I am partial to splitting this up. I looked through it a bit, and there is at least a couple of separate "themes". I am seeing:

small refactoring changes that give you a lot of bang for the buck, like: reusing index vectors, per-thread streams, and similar things. These could be one very non-controversial PR
Fused operators. That also looks alright, especially when we manage to push the forks a bit further down into the expression operators.
All kinds of other "hacks", like various kind of memoization and caching which seem to duplicate existing solutions or add more of something that we intended to actually get rid of.
Top-k operator and beam search also seems like a separate item for me, also we should unify with existing top-k stuff.

I will spend some more time on finding my way through these changes and then set up a call if that is OK?
Anyway, thanks a lot for this, there is a lot of good changes in here at a first glance.

rhenry-nv · 2020-11-04T23:00:32Z

@emjotde
These themes sound pretty much correct and it makes sense to spilt like this.

Regarding the "hacks" - I wasn't sure of the best way to do some of the memoization (especially regarding moving the lemmaHasFactorGroup vector to gpu and the triangle mask). I would love if you could tell me the proper way to do these over a call so that they can be fixed :)

How about I split the PR after we call? That way, I can incorporate your high level feedback while doing a second pass over the code and leave details to the individual PRs.

emjotde · 2020-11-04T23:05:35Z

No worry about the "hacks". We will figure this out together once we split this up. I currently don't even have a good answer for that. I will dig a bit deeper into the changes and take some more notes, then we can have a first call.

rhenry-nv · 2020-11-05T05:30:13Z

I'm going to keep pushing to this branch with the understanding that I will split this PR after our call. My most recent commit adds some support for CUDA 11 so that will probably be a 5th theme.

…ith is causes a misaligned read when the bias is not a multiple of 8 (this occurs during slicing). This case is handled specially. This bug will be fixed in a subsequent release of cuda.

…l. Fixes buf in AddFactorMaxes

rhenry-nv · 2021-03-05T19:10:53Z

Closing since this branch is obsolete

rhenry-nv added 10 commits October 9, 2020 15:27

Removed memcpy when batchA == batchB for batched matrix multiplicatio…

9855247

…n. Reduced the memcpys from 3 to 1 for the general case.

Issue matrix multiplications in cublas and cusparse to the per thread…

2de1a24

… default stream. This have issue each thread issue calls to their own stream instead of the global default stream when marian is compiled to use a default stream per thread

Added new inference operator that peforms bias addition and optionall…

3f59ed8

…y RELU for inference. When upgrading to cuda 11, the bias and relu can be fused into the matrix multiply with cublasLt.

Batches logit retrieval from device to once memcpy is issued for all …

0ff56ea

…logits instead of one memcpy per logit.

Reduced the number of times indices need to be copied to the GPU

201dc0a

Add licenses and change new class variable to have trailing underscore

a119c9a

Introduces new operator that performs the bias addition, skip connect…

fdf2576

…ion and layerNormailization for inference. Duplicates the layerNorm kernel since the LayerNormGrad kernel would no longer correspond to LayerNorm kernel if I extended the LayerNorm kernel.

Update license in prod.cu

bd25b61

update regression test repo

4355678

Fix compilation for CPU only. Missed DISPATCH10 call when CUDA was un…

8243a42

…defined

rhenry-nv added 2 commits October 10, 2020 14:47

Changed getherFromIndices to use pointers instead of get and set for …

37f68f9

…CPU backend. Should be more performant + the code is more readable.

Fix compile errors on Windows and MacOS.

90eb4a1

rhenry-nv added 6 commits October 13, 2020 13:09

Add synchronize before free in gpu/prod.cpp

6f6d3c3

Adds a cache by name to the graph. Moves the lemmaHasFactorGroup tens…

4a2ef0f

…or to the GPU. Implements an inference operator to mask out the lemmas of a factored vocab on GPU.

Adds cub as a submodule

48bb895

Redo topk implementation

e9b67ba

Fix compile error on windows

f66be82

Adds bigobj flag to windows builds

9c1cb52

rhenry-nv added 2 commits October 23, 2020 11:10

Fixes topk when all elements are the lowest value of the input type

60b5c8e

Adds license to nth_element.cu

2619e15

XapaJIaMnu reviewed Oct 26, 2020

View reviewed changes

src/3rd_party/topk.cuh Outdated Show resolved Hide resolved

rhenry-nv added 3 commits October 26, 2020 11:09

Removes unneccessary cub include outside of header guards

db82ec1

Handles not found in topk stage 2

5786659

Removes unneeded include

dfa518f

rhenry-nv added 3 commits October 30, 2020 12:36

Adds fp16 support for AddFactorMaxes

a7d89fd

Patch for multiRows to work with fp16

e98e0d7

Implement new features from cuda 11. Implements spmm using new cuspar…

5c72094

…se call instead of deprecated gemmi. Implements cublasLt affine operation with optional relu fusion. Not fully tested.

emjotde reviewed Nov 4, 2020

View reviewed changes

rhenry-nv marked this pull request as ready for review November 4, 2020 21:47

rhenry-nv added 4 commits November 4, 2020 21:34

Updates CUDA 11 affine support and SpMM support. CublasLt has a bug w…

e3871ce

…ith is causes a misaligned read when the bias is not a multiple of 8 (this occurs during slicing). This case is handled specially. This bug will be fixed in a subsequent release of cuda.

Adds guard for cuda 11 around spmm call

abdb037

Adds check to cublas and does reductions in AddFactorMaxes in paralle…

0ef2d0d

…l. Fixes buf in AddFactorMaxes

Changes alignment requirement back to 8

e25bd5f

This was referenced Dec 2, 2020

Small optimizations #768

Open

Beam search logit refactor #771

Open

Refactor of beam search to process factor groups in parallel #772

Closed

rhenry-nv mentioned this pull request Dec 15, 2020

Adds better Affine support for GPUs when using CUDA 11. Introduces a new bias addition kernel for CUDA < 11 #778

Merged

4 tasks

rhenry-nv closed this Mar 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gpu optimizations #743

Gpu optimizations #743

rhenry-nv commented Oct 10, 2020 •

edited

Loading

ykim362 commented Oct 10, 2020

rhenry-nv commented Oct 10, 2020

DavidLangworthy commented Oct 10, 2020

snukky commented Oct 12, 2020

rhenry-nv commented Oct 12, 2020 •

edited

Loading

XapaJIaMnu commented Oct 20, 2020

kpu commented Oct 20, 2020

XapaJIaMnu commented Oct 20, 2020

rhenry-nv commented Oct 20, 2020

XapaJIaMnu commented Oct 20, 2020

rhenry-nv commented Oct 20, 2020

emjotde commented Oct 26, 2020

emjotde commented Nov 4, 2020

emjotde commented Nov 4, 2020

emjotde Nov 4, 2020 •

edited

Loading

emjotde Nov 4, 2020

kpu Nov 4, 2020

emjotde Nov 4, 2020

rhenry-nv Nov 4, 2020

emjotde Nov 4, 2020

emjotde commented Nov 4, 2020

rhenry-nv commented Nov 4, 2020

emjotde commented Nov 4, 2020 •

edited

Loading

rhenry-nv commented Nov 4, 2020 •

edited

Loading

emjotde commented Nov 4, 2020

rhenry-nv commented Nov 5, 2020 •

edited

Loading

rhenry-nv commented Mar 5, 2021

		@@ -1,3 +1,8 @@
		/* All or part of this file was contributed by NVIDIA under license:

Gpu optimizations #743

Gpu optimizations #743

Conversation

rhenry-nv commented Oct 10, 2020 • edited Loading

Description

How to test

Checklist

ykim362 commented Oct 10, 2020

rhenry-nv commented Oct 10, 2020

DavidLangworthy commented Oct 10, 2020

snukky commented Oct 12, 2020

rhenry-nv commented Oct 12, 2020 • edited Loading

XapaJIaMnu commented Oct 20, 2020

kpu commented Oct 20, 2020

XapaJIaMnu commented Oct 20, 2020

rhenry-nv commented Oct 20, 2020

XapaJIaMnu commented Oct 20, 2020

rhenry-nv commented Oct 20, 2020

emjotde commented Oct 26, 2020

emjotde commented Nov 4, 2020

emjotde commented Nov 4, 2020

emjotde Nov 4, 2020 • edited Loading

Choose a reason for hiding this comment

emjotde Nov 4, 2020

Choose a reason for hiding this comment

kpu Nov 4, 2020

Choose a reason for hiding this comment

emjotde Nov 4, 2020

Choose a reason for hiding this comment

rhenry-nv Nov 4, 2020

Choose a reason for hiding this comment

emjotde Nov 4, 2020

Choose a reason for hiding this comment

emjotde commented Nov 4, 2020

rhenry-nv commented Nov 4, 2020

emjotde commented Nov 4, 2020 • edited Loading

rhenry-nv commented Nov 4, 2020 • edited Loading

emjotde commented Nov 4, 2020

rhenry-nv commented Nov 5, 2020 • edited Loading

rhenry-nv commented Mar 5, 2021

rhenry-nv commented Oct 10, 2020 •

edited

Loading

rhenry-nv commented Oct 12, 2020 •

edited

Loading

emjotde Nov 4, 2020 •

edited

Loading

emjotde commented Nov 4, 2020 •

edited

Loading

rhenry-nv commented Nov 4, 2020 •

edited

Loading

rhenry-nv commented Nov 5, 2020 •

edited

Loading