2 动机

2.1 面临问题

This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples.

Recent work has achieved significant improvements in computational efficiency through factorization tricks [21] and conditional computation, while also improving model performance in case of the latter. The fundamental constraint of sequential computation, however, remains.

2.2 解决思路

To the best of our knowledge, however, the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence- aligned RNNs or convolution.

3 技术手段

3.1 self-Attention

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

We suspect that for large values of dk, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients.

It expands the model’s ability to focus on different positions. Yes, in the example above, z1 contains a little bit of every other encoding, but it could be dominated by the the actual word itself.

Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.

引用

[1] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need

[2] The Illustrated Transformer http://jalammar.github.io/illustrated-transformer/

[3] LANGUAGE TRANSLATION WITH TRANSFORMER https://pytorch.org/tutorials/beginner/translation_transformer.html

[4] The Annotated Transformer http://nlp.seas.harvard.edu/2018/04/03/attention.html

[5] SEQUENCE-TO-SEQUENCE MODELING WITH NN.TRANSFORMER AND TORCHTEXT https://pytorch.org/tutorials/beginner/transformer_tutorial.html