Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gpu optimizations #743

Closed
wants to merge 34 commits into from
Closed

Conversation

rhenry-nv
Copy link
Contributor

@rhenry-nv rhenry-nv commented Oct 10, 2020

Description

This PR represents a single branch containing all of the GPU optimizations done. It is mainly to get a wholistic view of the performance improvements from the main branch. It is being split into several smaller PRs and will be closed once the smaller PRs are merged into master.

Performance numbers from a proxy model. The times are the total time to decode an input file by splitting the file into different batch sizes. This was done since it is strange to compare the time per batch with multiple streams. All times were collected on a Titan V.

Times using one stream.

Batch Initial Time (s) Current Time(s) % Runtime reduction Speedup factor
1 204.41 92.15 0.549190353 2.218231145
2 136.77 66.29 0.515317687 2.06320712
4 84.76 41.23 0.513567721 2.055784623
8 50.32 25.05 0.50218601 2.008782435
16 29.94 15 0.498997996 1.996
32 18.51 9.26 0.499729876 1.998920086
64 12.09 6.03 0.501240695 2.004975124
128 8.19 4.13 0.495726496 1.983050847
256 6.02 3.03 0.496677741 1.98679868

Times with two streams

Batch Initial Time (s) Current Time(s) % Runtime reduction Speedup factor
1 156.74 67.18 0.571392114 2.333134862
2 106.13 49.07 0.537642514 2.162828612
4 65.1 30.18 0.53640553 2.157057654
8 38.46 18.04 0.530941238 2.131929047
16 22.68 10.73 0.526895944 2.113699907
32 14.31 6.65 0.535290007 2.151879699
64 9.51 4.42 0.535226078 2.15158371
128 6.66 3.15 0.527027027 2.114285714
256 5.05 2.34 0.536633663 2.158119658

List of changes:

  • Compute Sinusoidal Embeddings on GPU when compiled with CUDA instead of doing work on the CPU. This causes small float differences in the final result of some sentences.
  • Reduces the number of times indices need to be transferred to the GPU when updating transformer and rnn states
  • Issues cublas and cusparse calls to the per thread default stream instead of the global default stream
  • Adds a operator to fuse bias addition with relu during inference. No longer issues a gemm to broadcast the bias across the tensor during inference.
  • Caches the triangle mask in the transformer during inference to reduce communication
  • Batches the retrieval of n-best logits from the device so one memcpy is issued to get all logits instead of 1 memcpy per logit
  • Adds a new operator that fuses bias addition with the skip connection and layer normalization
  • Removes memcpy from ProdBatched when the number of batches in the input matrices are the same by using cublasStridedBatchedGemm.
  • Reduces the number of memcpys in ProdBatched from 3 to 1 in the general case
  • Adds a cache by name function to the graph
  • Moves the lemmaHasFactorGroup tensor to the GPU for factored vocabulary models to allow lemmas to be masked out on the GPU in one kernel for group 0.
  • Ports the topk implementation from NVIDIA's faster transformer to nth_element.cu and make changes so that it is compatible with marian.

Added dependencies:

  1. cub - used by the new topk implementation

How to test

I ran the regression tests. They will fail due to some differences with float computations. Not sure of the protocol for updating the tests.

OS: Ubuntu 18.04.3 LTS
Compiler gcc 7.5.0
nvcc version: 10.1.243

cmake command:

cmake .. -DCOMPILE_CPU=on -DCOMPILE_CUDA=on -DUSE_SENTENCEPIECE=on -DUSE_STATIC_LIBS=off -DCOMPILE_SERVER=off -DUSE_FBGEMM=on -DCOMPILE_CUDA_SM35=off -DCOMPILE_CUDA_SM50=off -DCOMPILE_CUDA_SM60=off -DCOMPILE_CUDA_SM70=on

Checklist

  • I have tested the code manually
  • I have run regression tests - Some regression tests fail from float differences
  • I have read and followed CONTRIBUTING.md
  • I have updated CHANGELOG.md

…n. Reduced the memcpys from 3 to 1 for the general case.
… default stream. This have issue each thread issue calls to their own stream instead of the global default stream when marian is compiled to use a default stream per thread
…y RELU for inference. When upgrading to cuda 11, the bias and relu can be fused into the matrix multiply with cublasLt.
…uces a new operator to compute the sinusoidal embeddings on the GPU instead of doing CPU work. This is the reason for the differences since there are small differences in the float results for the positional embeddings. It also caches the triangle count mask for the transformer when doing inference to reduce device-hsot communication. This change should not affect the transofrmer output.
…ion and layerNormailization for inference. Duplicates the layerNorm kernel since the LayerNormGrad kernel would no longer correspond to LayerNorm kernel if I extended the LayerNorm kernel.
@ykim362
Copy link
Member

ykim362 commented Oct 10, 2020

Thanks @rhenry-nv ! It would be great if you could attach the performance improvement numbers.

@rhenry-nv
Copy link
Contributor Author

@ykim362 done! This is in draft status since there is another WIP optimization I plan to add for factored vocab models. End to end with preliminary testing on the WIP commit is 2.1x for batch 1 and 1.28x for batch 256 over the initial code using a proxy transformer model.

I also need to sort out the failing checks. I can compile locally but will look into replicating the exact build steps so I can fix the checks.

@DavidLangworthy
Copy link

Nice gains!

@snukky
Copy link
Member

snukky commented Oct 12, 2020

I ran the regression tests. They will fail due to some differences with float computations. Not sure of the protocol for updating the tests.

I will take care of updating regression tests if needed. Please note that test outputs have been generated for old GTX Titan Blacks that we still use to run regression tests, and it seems that none tests fail with your changes on these GPUs. The float differences that you are observing may be due to a different GPU type, which I think is expected for some tests. Do the same regression tests fail with the master branch on your Titan V? I don't have access to Titan V at the moment, but will test your changes on newer GPUs.

@rhenry-nv
Copy link
Contributor Author

rhenry-nv commented Oct 12, 2020

Yes, the same tests fail on my Titan V from the master branch. As a workaround, I increased the diff sensitivity in the python script to something like 0.2 to avoid spamming with failing regressions.

The output changes on my Titan V from commit 5b889da and shows up in the following regression tests (these are the ones I was referring to):

marian-dev/regression-tests/tests/models/transformer/test_nbest.sh
marian-dev/regression-tests/tests/models/ape/test_nbest.sh
marian-dev/regression-tests/tests/training/validation/test_compare_decoding_with_transscript_output.sh.log

A small number of sentences have slight output differences.

@XapaJIaMnu
Copy link
Contributor

I need add_definitions(-DTHRUST_IGNORE_DEPRECATED_CPP_DIALECT=1) In order to compile nth_element.cu. CUDA 11.0.

@kpu
Copy link
Member

kpu commented Oct 20, 2020

@XapaJIaMnu with gcc 5.4?

@XapaJIaMnu
Copy link
Contributor

@kpu gcc 8.4.0
Also, Cub version redefinition:

In file included from /usr/local/cuda/include/cub/util_namespace.cuh:36,
[build]                  from /usr/local/cuda/include/thrust/system/cuda/config.h:76,
[build]                  from /usr/local/cuda/include/thrust/system/cuda/detail/execution_policy.h:33,
[build]                  from /usr/local/cuda/include/thrust/iterator/detail/device_system_tag.h:23,
[build]                  from /usr/local/cuda/include/thrust/iterator/iterator_traits.h:111,
[build]                  from /usr/local/cuda/include/thrust/system/cuda/detail/util.h:31,
[build]                  from /usr/local/cuda/include/thrust/system/cuda/detail/core/alignment.h:21,
[build]                  from /usr/local/cuda/include/thrust/system/cuda/detail/core/triple_chevron_launch.h:30,
[build]                  from /home/s1031254/marian-dev-8bitgpu/src/3rd_party/cub/cub/device/dispatch/dispatch_histogram.cuh:48,
[build]                  from /home/s1031254/marian-dev-8bitgpu/src/3rd_party/cub/cub/device/device_histogram.cuh:41,
[build]                  from /home/s1031254/marian-dev-8bitgpu/src/3rd_party/cub/cub/cub.cuh:52,
[build]                  from /home/s1031254/marian-dev-8bitgpu/src/3rd_party/topk.cuh:26,
[build]                  from /home/s1031254/marian-dev-8bitgpu/src/translator/nth_element.cu:9:
[build] /usr/local/cuda/include/cub/version.cuh:46: error: "CUB_VERSION" redefined [-Werror]
[build]  #define CUB_VERSION 100909
[build]  
[build] In file included from /home/s1031254/marian-dev-8bitgpu/src/3rd_party/cub/cub/util_namespace.cuh:36,
[build]                  from /home/s1031254/marian-dev-8bitgpu/src/3rd_party/cub/cub/util_arch.cuh:37,
[build]                  from /home/s1031254/marian-dev-8bitgpu/src/3rd_party/cub/cub/config.cuh:35,
[build]                  from /home/s1031254/marian-dev-8bitgpu/src/3rd_party/cub/cub/cub.cuh:37,
[build]                  from /home/s1031254/marian-dev-8bitgpu/src/3rd_party/topk.cuh:26,
[build]                  from /home/s1031254/marian-dev-8bitgpu/src/translator/nth_element.cu:9:
[build] /home/s1031254/marian-dev-8bitgpu/src/3rd_party/cub/cub/version.cuh:46: note: this is the location of the previous definition
[build]  #define CUB_VERSION 101100

@rhenry-nv
Copy link
Contributor Author

I have not compiled with CUDA 11 but the redefinition is because CUDA 11 now ships with CUB causing competing versions to exist.

I was not aware that Marian compiled with CUDA 11. I will fix these issues in a future commit.

@XapaJIaMnu
Copy link
Contributor

@rhenry-nv we support up to CUDA 11.1, although not officially.

@rhenry-nv
Copy link
Contributor Author

I see. I will fix when I jump back into the code.

Sorry about this!

src/3rd_party/topk.cuh Outdated Show resolved Hide resolved
@emjotde
Copy link
Member

emjotde commented Oct 26, 2020

Just a heads-up, there is a bigger merge coming from me that adds FP16 training. Might or might not touch on a lot of things that have changes here. Should appear within a few days.

…se call instead of deprecated gemmi. Implements cublasLt affine operation with optional relu fusion. Not fully tested.
@emjotde
Copy link
Member

emjotde commented Nov 4, 2020

This is marked as a draft, so there is still active work going on, right?
Due to that I was currently ignoring that, but I assume you want intermediate feedback?

@emjotde
Copy link
Member

emjotde commented Nov 4, 2020

BTW, the merge I mentioned above is in branch mjd/fp16.3.

@@ -1,3 +1,8 @@
/* All or part of this file was contributed by NVIDIA under license:
Copy link
Member

@emjotde emjotde Nov 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mean to diminish NVIDIA's contribution, but would NVIDIA be fine with removing the "All or" phrasing from this message?

On the other hand maybe I am overreacting, what are others' opinions on this? @snukky @ykim362 @kpu ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And @ykim362 just told me that technically we are also supposed to add a Microsoft notice like that everywhere. So when Intel starts adding theirs we will have a full screen of notices before we get to any code :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kpu .so... full screen of contribution notices it is? :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emjotde I also had reservations about including this notice in some files where I made small changes. I was told that I had to include those notices with that phrasing even where I made one line changes. I will ask again if I can modify the wording at least.

Sorry about this :(

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reason to be sorry, I know how it works.

@emjotde
Copy link
Member

emjotde commented Nov 4, 2020

@rhenry-nv Are you planning to keep adding new features to this PR, or should we try to get in what's here currently? There is so much stuff to review and test that I am a bit afraid it will overwhelm us a bit in a single PR.

@rhenry-nv rhenry-nv marked this pull request as ready for review November 4, 2020 21:47
@rhenry-nv
Copy link
Contributor Author

This is marked as a draft, so there is still active work going on, right?
Due to that I was currently ignoring that, but I assume you want intermediate feedback?

Yes, the intention was to get intermediate feedback but I should've been clearer about that.

@rhenry-nv Are you planning to keep adding new features to this PR, or should we try to get in what's here currently? There is so much stuff to review and test that I am a bit afraid it will overwhelm us a bit in a single PR.

I can try to split into separate PRs or just stop adding stuff to this PR. There are a few more changes that I have not added in here that I can make a separate PR for. Let me know what would be the best way to ensure that you're not overwhelmed.

@emjotde
Copy link
Member

emjotde commented Nov 4, 2020

I am partial to splitting this up. I looked through it a bit, and there is at least a couple of separate "themes". I am seeing:

  • small refactoring changes that give you a lot of bang for the buck, like: reusing index vectors, per-thread streams, and similar things. These could be one very non-controversial PR
  • Fused operators. That also looks alright, especially when we manage to push the forks a bit further down into the expression operators.
  • All kinds of other "hacks", like various kind of memoization and caching which seem to duplicate existing solutions or add more of something that we intended to actually get rid of.
  • Top-k operator and beam search also seems like a separate item for me, also we should unify with existing top-k stuff.

I will spend some more time on finding my way through these changes and then set up a call if that is OK?
Anyway, thanks a lot for this, there is a lot of good changes in here at a first glance.

@rhenry-nv
Copy link
Contributor Author

rhenry-nv commented Nov 4, 2020

@emjotde
These themes sound pretty much correct and it makes sense to spilt like this.

Regarding the "hacks" - I wasn't sure of the best way to do some of the memoization (especially regarding moving the lemmaHasFactorGroup vector to gpu and the triangle mask). I would love if you could tell me the proper way to do these over a call so that they can be fixed :)

How about I split the PR after we call? That way, I can incorporate your high level feedback while doing a second pass over the code and leave details to the individual PRs.

@emjotde
Copy link
Member

emjotde commented Nov 4, 2020

No worry about the "hacks". We will figure this out together once we split this up. I currently don't even have a good answer for that. I will dig a bit deeper into the changes and take some more notes, then we can have a first call.

@rhenry-nv
Copy link
Contributor Author

rhenry-nv commented Nov 5, 2020

I'm going to keep pushing to this branch with the understanding that I will split this PR after our call. My most recent commit adds some support for CUDA 11 so that will probably be a 5th theme.

…ith is causes a misaligned read when the bias is not a multiple of 8 (this occurs during slicing). This case is handled specially. This bug will be fixed in a subsequent release of cuda.
@rhenry-nv
Copy link
Contributor Author

Closing since this branch is obsolete

@rhenry-nv rhenry-nv closed this Mar 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants