-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gpu optimizations #743
Gpu optimizations #743
Conversation
…n. Reduced the memcpys from 3 to 1 for the general case.
… default stream. This have issue each thread issue calls to their own stream instead of the global default stream when marian is compiled to use a default stream per thread
…y RELU for inference. When upgrading to cuda 11, the bias and relu can be fused into the matrix multiply with cublasLt.
…logits instead of one memcpy per logit.
…uces a new operator to compute the sinusoidal embeddings on the GPU instead of doing CPU work. This is the reason for the differences since there are small differences in the float results for the positional embeddings. It also caches the triangle count mask for the transformer when doing inference to reduce device-hsot communication. This change should not affect the transofrmer output.
…ion and layerNormailization for inference. Duplicates the layerNorm kernel since the LayerNormGrad kernel would no longer correspond to LayerNorm kernel if I extended the LayerNorm kernel.
Thanks @rhenry-nv ! It would be great if you could attach the performance improvement numbers. |
@ykim362 done! This is in draft status since there is another WIP optimization I plan to add for factored vocab models. End to end with preliminary testing on the WIP commit is 2.1x for batch 1 and 1.28x for batch 256 over the initial code using a proxy transformer model. I also need to sort out the failing checks. I can compile locally but will look into replicating the exact build steps so I can fix the checks. |
Nice gains! |
…CPU backend. Should be more performant + the code is more readable.
I will take care of updating regression tests if needed. Please note that test outputs have been generated for old GTX Titan Blacks that we still use to run regression tests, and it seems that none tests fail with your changes on these GPUs. The float differences that you are observing may be due to a different GPU type, which I think is expected for some tests. Do the same regression tests fail with the master branch on your Titan V? I don't have access to Titan V at the moment, but will test your changes on newer GPUs. |
Yes, the same tests fail on my Titan V from the master branch. As a workaround, I increased the diff sensitivity in the python script to something like 0.2 to avoid spamming with failing regressions. The output changes on my Titan V from commit 5b889da and shows up in the following regression tests (these are the ones I was referring to): marian-dev/regression-tests/tests/models/transformer/test_nbest.sh A small number of sentences have slight output differences. |
…or to the GPU. Implements an inference operator to mask out the lemmas of a factored vocab on GPU.
I need |
@XapaJIaMnu with gcc 5.4? |
@kpu gcc 8.4.0
|
I have not compiled with CUDA 11 but the redefinition is because CUDA 11 now ships with CUB causing competing versions to exist. I was not aware that Marian compiled with CUDA 11. I will fix these issues in a future commit. |
@rhenry-nv we support up to CUDA 11.1, although not officially. |
I see. I will fix when I jump back into the code. Sorry about this! |
Just a heads-up, there is a bigger merge coming from me that adds FP16 training. Might or might not touch on a lot of things that have changes here. Should appear within a few days. |
…se call instead of deprecated gemmi. Implements cublasLt affine operation with optional relu fusion. Not fully tested.
This is marked as a draft, so there is still active work going on, right? |
BTW, the merge I mentioned above is in branch |
@@ -1,3 +1,8 @@ | |||
/* All or part of this file was contributed by NVIDIA under license: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And @ykim362 just told me that technically we are also supposed to add a Microsoft notice like that everywhere. So when Intel starts adding theirs we will have a full screen of notices before we get to any code :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kpu .so... full screen of contribution notices it is? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@emjotde I also had reservations about including this notice in some files where I made small changes. I was told that I had to include those notices with that phrasing even where I made one line changes. I will ask again if I can modify the wording at least.
Sorry about this :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No reason to be sorry, I know how it works.
@rhenry-nv Are you planning to keep adding new features to this PR, or should we try to get in what's here currently? There is so much stuff to review and test that I am a bit afraid it will overwhelm us a bit in a single PR. |
Yes, the intention was to get intermediate feedback but I should've been clearer about that.
I can try to split into separate PRs or just stop adding stuff to this PR. There are a few more changes that I have not added in here that I can make a separate PR for. Let me know what would be the best way to ensure that you're not overwhelmed. |
I am partial to splitting this up. I looked through it a bit, and there is at least a couple of separate "themes". I am seeing:
I will spend some more time on finding my way through these changes and then set up a call if that is OK? |
@emjotde Regarding the "hacks" - I wasn't sure of the best way to do some of the memoization (especially regarding moving the lemmaHasFactorGroup vector to gpu and the triangle mask). I would love if you could tell me the proper way to do these over a call so that they can be fixed :) How about I split the PR after we call? That way, I can incorporate your high level feedback while doing a second pass over the code and leave details to the individual PRs. |
No worry about the "hacks". We will figure this out together once we split this up. I currently don't even have a good answer for that. I will dig a bit deeper into the changes and take some more notes, then we can have a first call. |
I'm going to keep pushing to this branch with the understanding that I will split this PR after our call. My most recent commit adds some support for CUDA 11 so that will probably be a 5th theme. |
…ith is causes a misaligned read when the bias is not a multiple of 8 (this occurs during slicing). This case is handled specially. This bug will be fixed in a subsequent release of cuda.
…l. Fixes buf in AddFactorMaxes
Closing since this branch is obsolete |
Description
This PR represents a single branch containing all of the GPU optimizations done. It is mainly to get a wholistic view of the performance improvements from the main branch. It is being split into several smaller PRs and will be closed once the smaller PRs are merged into master.
Performance numbers from a proxy model. The times are the total time to decode an input file by splitting the file into different batch sizes. This was done since it is strange to compare the time per batch with multiple streams. All times were collected on a Titan V.
Times using one stream.
Times with two streams
List of changes:
Added dependencies:
How to test
I ran the regression tests. They will fail due to some differences with float computations. Not sure of the protocol for updating the tests.
OS: Ubuntu 18.04.3 LTS
Compiler gcc 7.5.0
nvcc version: 10.1.243
cmake command:
cmake .. -DCOMPILE_CPU=on -DCOMPILE_CUDA=on -DUSE_SENTENCEPIECE=on -DUSE_STATIC_LIBS=off -DCOMPILE_SERVER=off -DUSE_FBGEMM=on -DCOMPILE_CUDA_SM35=off -DCOMPILE_CUDA_SM50=off -DCOMPILE_CUDA_SM60=off -DCOMPILE_CUDA_SM70=on
Checklist