beam search #186

vince62s · 2018-06-05T16:15:28Z

Hi guys,

quick question for you.
Are you using multi threads when decoding beam search to parallelize each segment ?

emjotde · 2018-06-05T16:18:22Z

Hi,
on the GPU no, there you can adjust batch size with for instance:

--mini-batch 64 --maxi-batch 100 --mini-batch-sort src

This will translate 64 sentences at once, preload 100 mini-batches and bucket by source sentence length for batch packing.

When you use the CPU, you can set (the = is required)

--cpu-threads=N

To translate sentences in parallel.

emjotde · 2018-06-05T16:19:16Z

Actually, let me correct that. When using multiple-GPUs they will be used in parallel for translating multiple mini-batches.

vince62s · 2018-06-05T16:40:54Z

on GPU, can you be more specific ?
You will batch sentences for the encoding/decoding/generator,
but the beam search itself on the output, is it done on CPU ? GPU ? parallel ?

emjotde · 2018-06-05T16:43:45Z

We have a GPU and a CPU mode. The GPU mode is mostly happening on the GPU, we only record the indices to select hypotheses for the next step in CPU memory. Top-K search is happening on the GPU. This is one of the reasons we are so much faster at translation than anyone else.

emjotde · 2018-06-07T17:26:41Z

Do you need any more information or can I close this?

vince62s · 2018-06-07T18:21:07Z

We are doing also the topk search on GPU but obviously we are missing something.
you're doing a great job.
Cheers.

hieuhoang · 2018-06-07T18:26:02Z

what toolkit are you using? Some comparison numbers would be good, out of curiosity

vince62s · 2018-06-07T18:32:28Z

Hi Hieu,
I am on https://github.com/Ubiqus/OpenNMT-py
we made it multi-gpu and training as fast as Marian (on 4GPU)
we are about to commit AAN, we need to fix 2 things:

cache on decoder
batched beam search which is too slow rigth now.
but even with those I doubt I can meet your decoding numbers.

hieuhoang · 2018-06-07T18:38:26Z

it would be good to put flesh on the bone with some numbers & details. Then maybe we can exchange tips

vince62s · 2018-06-07T18:43:17Z

sure. For DE to EN training on 4 GPU I am at about 30K tok/sec (src or tgt about the same, since I use sentence piece)
for decoding, it's too slow so I prefer to wait for our integration of caches and batched beam search, then I'll tell you.

emjotde · 2018-06-07T18:43:51Z

What kind of model?

emjotde · 2018-06-07T18:47:18Z

@hieuhoang It would also be nice if the exchange would not be one-sided :)

vince62s · 2018-06-07T18:51:20Z

transformer_base
are you asking Hieu to give me numbers or do you need more info from me ? :)

emjotde · 2018-06-07T18:53:13Z

Rather extending Hieu's comment with a small dose of snarkiness :)

I believe our implementation of transformer-base is sub-optimal at training time. Still more to do. At least scaling across GPUs is decent enough.

vince62s · 2018-06-07T19:02:33Z

one tip: did you implement an "accumulated gradient" feature?
basically it emulates several GPU on one.
for instance if I set accum=2 on 4GPU it will act as 8 GPU.
transformer is really sensitive to global batch size.
see my issue here: tensorflow/tensor2tensor#444

so we compute loss on 2 mini batchs and do the update of params each 2.

kpu · 2018-06-07T19:05:24Z

Yeah, we call it --optimizer-delay

emjotde · 2018-06-07T19:07:55Z

Oh, is that how you get faster? In that case we need to update benchmarks :)

kpu · 2018-06-07T19:10:10Z

I'm just gonna leave this training gauntlet here: https://arxiv.org/pdf/1806.00187.pdf

vince62s · 2018-06-07T19:10:24Z

no no !!! :)
It was just a Bleu perf trick but I see everyon eis doing the same thing.
Why are you saying we are faster ? I think we are about the same for training, but you're blazing faster for decoding.

vince62s · 2018-06-07T19:12:41Z

yeah we used pytorch distributed too, so if I need it I will also implement the multi node, but no use for now.

emjotde · 2018-06-07T19:14:28Z

"Faster" as in faster than before.

@kpu So, how is multi-node sync training going? :)

vince62s · 2018-06-15T06:38:02Z

I am still stuck so I need to ask again.
when it says 12 seconds for wall time decoding 3K segments under the line "base-transformer-aan".
is it a real transformer that decodes that fast ? or a RNN student distillated from a transformer teacher ?

emjotde · 2018-06-15T06:39:58Z

Real with decoder self-attention replaced with an average attention network. Rest is the same. RNN students are worse in terms of BLEU.

vince62s · 2018-06-15T06:44:38Z

hmmm aan is not that fast for us and bleu perf degrades. need to rework the batched beam. but will never be that fast in pytorch.

emjotde · 2018-06-15T06:45:45Z

That's our whole story, meta-algorithms in C++ beat Python :)

emjotde · 2018-06-15T07:01:20Z

Just looked at the paper, the one where it says 12s is a normal transformer, the one with the AAN takes 7s.

vince62s · 2018-06-15T07:03:10Z

https://docs.google.com/spreadsheets/d/1wZQegK-9CKY378eAWRlahg23Fq155WTm4TQ8ikf8_6E/edit#gid=0
line 46 says 12 seconds (7 is the empty timing, right ?)
anyway it's just not the same order range I'm looking at ....

emjotde · 2018-06-15T07:05:22Z

7s is Walltime minus empty, that's the actual time without start-up (empty).

emjotde · 2018-06-18T03:53:21Z

@vince62s Should be a good deal faster now after the weekend coding session. Possibly by a factor of 1.5-2.

vince62s closed this as completed Jun 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

beam search #186

beam search #186

vince62s commented Jun 5, 2018

emjotde commented Jun 5, 2018

emjotde commented Jun 5, 2018

vince62s commented Jun 5, 2018

emjotde commented Jun 5, 2018

emjotde commented Jun 7, 2018

vince62s commented Jun 7, 2018

hieuhoang commented Jun 7, 2018

vince62s commented Jun 7, 2018

hieuhoang commented Jun 7, 2018

vince62s commented Jun 7, 2018

emjotde commented Jun 7, 2018

emjotde commented Jun 7, 2018

vince62s commented Jun 7, 2018

emjotde commented Jun 7, 2018

vince62s commented Jun 7, 2018

kpu commented Jun 7, 2018

emjotde commented Jun 7, 2018

kpu commented Jun 7, 2018

vince62s commented Jun 7, 2018 •

edited

Loading

vince62s commented Jun 7, 2018

emjotde commented Jun 7, 2018

vince62s commented Jun 15, 2018

emjotde commented Jun 15, 2018

vince62s commented Jun 15, 2018

emjotde commented Jun 15, 2018

emjotde commented Jun 15, 2018

vince62s commented Jun 15, 2018

emjotde commented Jun 15, 2018

emjotde commented Jun 18, 2018

beam search #186

beam search #186

Comments

vince62s commented Jun 5, 2018

emjotde commented Jun 5, 2018

emjotde commented Jun 5, 2018

vince62s commented Jun 5, 2018

emjotde commented Jun 5, 2018

emjotde commented Jun 7, 2018

vince62s commented Jun 7, 2018

hieuhoang commented Jun 7, 2018

vince62s commented Jun 7, 2018

hieuhoang commented Jun 7, 2018

vince62s commented Jun 7, 2018

emjotde commented Jun 7, 2018

emjotde commented Jun 7, 2018

vince62s commented Jun 7, 2018

emjotde commented Jun 7, 2018

vince62s commented Jun 7, 2018

kpu commented Jun 7, 2018

emjotde commented Jun 7, 2018

kpu commented Jun 7, 2018

vince62s commented Jun 7, 2018 • edited Loading

vince62s commented Jun 7, 2018

emjotde commented Jun 7, 2018

vince62s commented Jun 15, 2018

emjotde commented Jun 15, 2018

vince62s commented Jun 15, 2018

emjotde commented Jun 15, 2018

emjotde commented Jun 15, 2018

vince62s commented Jun 15, 2018

emjotde commented Jun 15, 2018

emjotde commented Jun 18, 2018

vince62s commented Jun 7, 2018 •

edited

Loading