-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
beam search #186
Comments
Hi,
This will translate 64 sentences at once, preload 100 mini-batches and bucket by source sentence length for batch packing. When you use the CPU, you can set (the = is required)
To translate sentences in parallel. |
Actually, let me correct that. When using multiple-GPUs they will be used in parallel for translating multiple mini-batches. |
on GPU, can you be more specific ? |
We have a GPU and a CPU mode. The GPU mode is mostly happening on the GPU, we only record the indices to select hypotheses for the next step in CPU memory. Top-K search is happening on the GPU. This is one of the reasons we are so much faster at translation than anyone else. |
Do you need any more information or can I close this? |
We are doing also the topk search on GPU but obviously we are missing something. |
what toolkit are you using? Some comparison numbers would be good, out of curiosity |
Hi Hieu,
|
it would be good to put flesh on the bone with some numbers & details. Then maybe we can exchange tips |
sure. For DE to EN training on 4 GPU I am at about 30K tok/sec (src or tgt about the same, since I use sentence piece) |
What kind of model? |
@hieuhoang It would also be nice if the exchange would not be one-sided :) |
transformer_base |
Rather extending Hieu's comment with a small dose of snarkiness :) I believe our implementation of transformer-base is sub-optimal at training time. Still more to do. At least scaling across GPUs is decent enough. |
one tip: did you implement an "accumulated gradient" feature? so we compute loss on 2 mini batchs and do the update of params each 2. |
Yeah, we call it |
Oh, is that how you get faster? In that case we need to update benchmarks :) |
I'm just gonna leave this training gauntlet here: https://arxiv.org/pdf/1806.00187.pdf |
no no !!! :) |
yeah we used pytorch distributed too, so if I need it I will also implement the multi node, but no use for now. |
"Faster" as in faster than before. @kpu So, how is multi-node sync training going? :) |
I am still stuck so I need to ask again. |
Real with decoder self-attention replaced with an average attention network. Rest is the same. RNN students are worse in terms of BLEU. |
hmmm aan is not that fast for us and bleu perf degrades. need to rework the batched beam. but will never be that fast in pytorch. |
That's our whole story, meta-algorithms in C++ beat Python :) |
Just looked at the paper, the one where it says 12s is a normal transformer, the one with the AAN takes 7s. |
https://docs.google.com/spreadsheets/d/1wZQegK-9CKY378eAWRlahg23Fq155WTm4TQ8ikf8_6E/edit#gid=0 |
7s is Walltime minus empty, that's the actual time without start-up (empty). |
@vince62s Should be a good deal faster now after the weekend coding session. Possibly by a factor of 1.5-2. |
Hi guys,
quick question for you.
Are you using multi threads when decoding beam search to parallelize each segment ?
The text was updated successfully, but these errors were encountered: