the role of mask in attention operation #3

Marcovaldong · 2019-11-01T03:21:14Z

I am reading torch implementation, your implementation and the pytorch implementation. I found that there are mask in your implementation and torch implementation, but there is no mask in pytorch implementation. Is the role of mask is to get the valid ones? If there is no mask, what will the performance and the result be like?

I am training the pytorch implementation on handwritten dataset, I found that there is a lot of repeat in the decoded result, as below shown. is is the reason that I didn't use mask in the procedure of attention operation?

groundtruth:  the^fragile^nature
prediction:  the^fragile^fragile^fragile^fragile^fragile^fragile^fragile^fragile^fragile^fragi

The text was updated successfully, but these errors were encountered:

Pay20Y · 2019-11-01T03:25:22Z

Yes, I did experiments about the mask in feature map and the mask softmax. They are both effective. But I'm not sure is it caused the repeat errors.

Marcovaldong · 2019-11-01T04:26:47Z

tks for your reply. I'll check how to append the mask in pytorch implementation.

Fix multiprocessing on Windows

Pay20Y pushed a commit that referenced this issue Dec 3, 2019

Merge pull request #3 from mikylucky/fix-multiprocessing

6be8bed

Fix multiprocessing on Windows

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

the role of mask in attention operation #3

the role of mask in attention operation #3

Marcovaldong commented Nov 1, 2019

Pay20Y commented Nov 1, 2019

Marcovaldong commented Nov 1, 2019

the role of mask in attention operation #3

the role of mask in attention operation #3

Comments

Marcovaldong commented Nov 1, 2019

Pay20Y commented Nov 1, 2019

Marcovaldong commented Nov 1, 2019