Skip to content

Latest commit

 

History

History
19 lines (14 loc) · 1.27 KB

effective-approaches-nmt-attention.md

File metadata and controls

19 lines (14 loc) · 1.27 KB

Effective Approaches to Attention-based Neural Machine Translation

TLDR; The authors test variations of the attention mechanism on Neural Machine Translation (NMT) tasks. The authors propose both "global" (attenting over all source words) and "local" (attending over a subset of source words) models. They evaluate their approach on WMT'14 and WMT'15 English <-> German translation and achieve state-of-the-art results.

Key Points

  • Softmax input is "attentional hidden state" which is computed as W dot(c_t, h_t) where c_t is the attention vector for the source sentence. How c_t is calculated is depends on the attention approach.
  • Global attention score types score(h_t, h_s):
    • dot: dot(h_t, h_s)
    • general: h_t^T * W * h_s
    • concat v_t * tanh(W_a * dot(h_t, h_s) (Bahdanau)
  • Local attention idea: Decoder computes aligned position p_t and attends over source hidden state in [p_t - D, p_t + D] where D is a hyperparameter.
  • Training details
    • 50k vocab
    • 4 layers, 1000-dimensional embeddings, LSTM cell, unidirectional encoder, gradient norm clipping at 5; SGD with decay schedule; dropout 0.2.
    • UNK replacement (gives +1.9 BLEU)
  • For global attention, dot scores (the simplest choice) seems to peform best.