diff --git a/docs/_config.yml b/docs/_config.yml
index e46afeae8..5cea31658 100644
--- a/docs/_config.yml
+++ b/docs/_config.yml
@@ -688,6 +688,11 @@ ru:
         - path: ru/week01/01-1.md
         - path: ru/week01/01-2.md
         - path: ru/week01/01-3.md
+    - path: ru/week12/12.md
+      sections:
+        - path: ru/week12/12-1.md
+        - path: ru/week12/12-2.md
+        - path: ru/week12/12-3.md
 
 ################################## Vietnamese ##################################
 vi:
diff --git a/docs/en/week12/12-3.md b/docs/en/week12/12-3.md
index cdeab6fd6..6b7149760 100644
--- a/docs/en/week12/12-3.md
+++ b/docs/en/week12/12-3.md
@@ -284,7 +284,7 @@ Throughout the training of a transformer, many hidden representations are genera
 
 We will now see the blocks of transformers discussed above in a far more understandable format, code!
 
-The first module we will look at the multi-headed attention block. Depenending on query, key, and values entered into this block, it can either be used for self or cross attention.
+The first module we will look at the multi-headed attention block. Depending on query, key, and values entered into this block, it can either be used for self or cross attention.
 
 
 ```python
@@ -392,7 +392,7 @@ Recall that self attention by itself does not have any recurrence or convolution
 
 $$
 \begin{aligned}
-E(p, 2)    &= \sin(p / 10000^{2i / d}) \\
+E(p, 2i)    &= \sin(p / 10000^{2i / d}) \\
 E(p, 2i+1) &= \cos(p / 10000^{2i / d})
 \end{aligned}
 $$
diff --git a/docs/ru/index.md b/docs/ru/index.md
index 1dff0c5a6..197308036 100644
--- a/docs/ru/index.md
+++ b/docs/ru/index.md
@@ -325,6 +325,23 @@ lang: ru
         <a href="https://youtu.be/DL7iew823c0">🎥</a>
       </td>
     </tr>
+<!-- =============================== WEEK 15 =============================== -->
+    <tr>
+      <td rowspan="2" align="center"><a href="{{site.baseurl}}/ru/week15/15">⑮</a></td>
+      <td rowspan="2">Практикум</td>
+      <td><a href="{{site.baseurl}}/ru/week15/15-1">Вывод для энергетических моделей со скрытыми переменными</a></td>
+      <td rowspan="1">
+        <a href="https://github.com/Atcold/pytorch-Deep-Learning/blob/master/slides/12%20-%20EBM.pdf">🖥️</a>
+        <a href="https://youtu.be/sbhr2wjU1-I">🎥</a>
+      </td>
+    </tr>
+    <tr>
+      <td><a href="{{site.baseurl}}/ru/week15/15-2">Обучение энергетических моделей со скрытыми переменными</a></td>
+      <td rowspan="1">
+        <a href="https://github.com/Atcold/pytorch-Deep-Learning/blob/master/slides/12%20-%20EBM.pdf">🖥️</a>
+        <a href="https://youtu.be/XLSb1Cs1Jao">🎥</a>
+      </td>
+    </tr>
   </tbody>
 </table>
 
diff --git a/docs/ru/week12/12-1.md b/docs/ru/week12/12-1.md
new file mode 100644
index 000000000..0ccab0ef6
--- /dev/null
+++ b/docs/ru/week12/12-1.md
@@ -0,0 +1,431 @@
+---
+lang: ru
+lang-ref: ch.12-1
+title: Глубокое обучение для обработки естественного языка
+lecturer: Mike Lewis
+authors: Jiayu Qiu, Yuhong Zhu, Lyuang Fu, Ian Leefmans
+date: 20 Apr 2020
+translation-date: 01 Dec 2020
+translator: Evgeniy Pak
+---
+
+
+<!--
+## [Overview](https://www.youtube.com/watch?v=6D4EWKJgNn0&t=44s)
+
+* Amazing progress in recent years:
+  - Humans prefer machine translation to human translators for some languages
+  - Super-human performance on many question answering datasets
+  - Language models generate fluent paragraphs (e.g Radford et al. 2019)
+*  Minimal specialist techniques needed per task, can achieve these things with fairly generic models
+-->
+
+## [Обзор](https://www.youtube.com/watch?v=6D4EWKJgNn0&t=44s)
+
+* Поразительный прогресс за несколько последних лет :
+  - Люди предпочитают машинный перевод человеческому для некоторых языков
+  - Серхчеловеческая производительность на многих выборках данных с ответами на вопросы
+  - Модели языка генерируют плавные параграфы (например Radford и др. 2019)
+* Минимальные специальные техники, необходимые для задач, которые можно решить при помощи довольно общих моделей
+
+
+<!--
+## Language Models
+
+* Language models assign a probability to a text:
+  $p(x_0, \cdots, x_n)$
+* Many possible sentences so we can’t just train a classifier
+* Most popular method is to factorize distribution using chain rule:
+
+$$p(x_0,...x_n) = p(x_0)p(x_1 \mid x_0) \cdots p(x_n \mid x_{n-1})$$
+-->
+
+## Модели языка
+
+* Модели языка присваивают тексту вероятность:
+  $p(x_0, \cdots, x_n)$
+* Много возможных предложений, значит мы не можем просто обучить классификатор
+* Наиболее популярный метод заключается в факторизации распределения, используя цепное правило:
+
+$$p(x_0,...x_n) = p(x_0)p(x_1 \mid x_0) \cdots p(x_n \mid x_{n-1})$$
+
+<!--
+## Neural Language Models
+
+Basically we input the text into a neural network, the neural network will map all this context onto a vector. This vector represents the next word and we have some big word embedding matrix. The word embedding matrix contains a vector for every possible word the model can output. We then compute similarity by dot product of the context vector and each of the word vectors. We'll get a likelihood of predicting the next word, then train this model by maximum likelihood. The key detail here is that we don't deal with words directly, but we deal with things called sub-words or characters.
+
+$$p(x_0 \mid x_{0, \cdots, n-1}) = \text{softmax}(E f(x_{0, \cdots, n-1}))$$
+
+<figure>
+  <img src="{{site.baseurl}}/images/week12/12-1/fig1.jpg">
+  <center>  Fig.1: Neural language model</center>
+</figure>
+-->
+
+## Нейронные модели языка
+
+В основном мы вводим текст в нейронную сеть, нейронная сеть подбирает соответствующий вектор для всего контекста. Этот вектор представляет следующее слово и мы получаем некоторую большую матрицу характеристик слов. Эта матрица содержит по вектору для каждого слова, которое может выдать модель. Затем мы вычисляем сходство посредством скалярного произведения контекстного вектора и вектора для каждого слова. Мы получим вероятность предсказания следующего слова, затем обучим эту модель, максимизируя вероятность. Ключевой момент здесь: мы не работаем со словами напрямую, но имеем дело с сущностями, называемыми подсловами или символами.
+
+$$p(x_0 \mid x_{0, \cdots, n-1}) = \text{softmax}(E f(x_{0, \cdots, n-1}))$$
+
+<figure>
+  <img src="{{site.baseurl}}/images/week12/12-1/fig1.jpg">
+  <center> Рис.1 : Нейронная модель языка</center>
+</figure>
+
+<!--
+### Convolutional Language Models
+
+* The first neural language model
+* Embed each word as a vector, which is a lookup table to the embedding matrix, so the word will get the same vector no matter what context it appears in
+* Apply same feed forward network at each time step
+* Unfortunately, fixed length history means it can only condition on bounded context
+* These models do have the upside of being very fast
+
+<figure>
+  <img src="{{site.baseurl}}/images/week12/12-1/fig2.jpg">
+  <center>  Fig.2: Convolutional language model</center>
+</figure>
+-->
+
+
+### Свёрточные модели языка
+
+* Первая нейронная модель языка
+* Интерпретирует каждое слово как вектор, являясь таблицей поиска, по отношению к матрице характеристик, таким образом слово получит один и тот же вектор независимо от того, в каком контексте оно появляется
+* Применяет одну и ту же сеть с прямой связью на каждом временном шаге
+* К сожалению, история с фиксированной длиной означает, что она может быть обусловленна только ограниченным контекстом
+* У этих моделей есть преимущество быстродействия
+
+<figure>
+  <img src="{{site.baseurl}}/images/week12/12-1/fig2.jpg">
+  <center> Рис.2 : Свёрточная модель языка</center>
+</figure>
+
+
+<!--
+### Recurrent Language Models
+
+* The most popular approach until a couple years ago.
+* Conceptually straightforward: every time step we maintain some state (received from the previous time step), which represents what we've read so far. This is combined with current word being read and used at later state. Then we repeat this process for as many time steps as we need.
+* Uses unbounded context: in principle the title of a book would affect the hidden states of last word of the book.
+* Disadvantages:
+  - The whole history of the document reading is compressed into fixed-size vector at each time step, which is the bottleneck of this model
+  - Gradients tend to vanish with long contexts
+  - Not possible to parallelize over time-steps, so slow training
+
+<figure>
+  <img src="{{site.baseurl}}/images/week12/12-1/fig3.jpg">
+  <center>  Fig.3: Recurrent language model</center>
+</figure>
+-->
+
+### Рекуррентные модели языка
+
+* Наиболее популярный подход вплоть до недавних лет
+* Концептуально прямолинейны: на каждом временном шаге мы поддерживаем некоторое состояние (полученное из предыдущего временного шага), которое представляет то, что мы уже прочитали до сих пор. Это комбинируется с текущим прочитанным словом и используется в последующих состояниях. Затем мы повторяем этот процесс столько временных шагов, сколько нам необходимо.
+* Пользуется неограниченным контекстом: в принципе название книги повлияет на скрытое состояние последнего слова в книге.
+* Недостатки:
+  - Вся история чтения документа сжимается в вектор фиксированной размерности  на каждом временном шаге, что является узким местом этой модели
+  - Градиенты имеют тенденцию исчезать при длинном контексте
+  - Нет возможности параллелизации по временным шагам, отсюда медленное обучение.
+
+<figure>
+  <img src="{{site.baseurl}}/images/week12/12-1/fig3.jpg">
+  <center> Рис.3 : Рекуррентная модель языка</center>
+</figure>
+
+<!--
+### [Transformer Language Models](https://www.youtube.com/watch?v=6D4EWKJgNn0&t=828s)
+
+* Most recent model used in NLP
+* Revolutionized penalty
+* Three main stages
+    * Input stage
+    * $n$ times transformer blocks (encoding layers) with different parameters
+    * Output stage
+* Example with 6 transformer modules (encoding layers) in the original transformer paper:
+
+<figure>
+  <img src="{{site.baseurl}}/images/week12/12-1/fig4.jpg">
+  <center>  Fig.4:Transformer language model </center>
+</figure>
+
+Sub-layers are connected by the boxes labelled "Add&Norm". The "Add" part means it is a residual connection, which helps in stopping the gradient from vanishing. The norm here denotes layer normalization.
+
+<figure>
+  <img src="{{site.baseurl}}/images/week12/12-1/fig5.jpg">
+  <center>  Fig.5: Encoder Layer </center>
+</figure>
+
+It should be noted that transformers share weights across time-steps.
+-->
+
+### [Модель языка трансформер](https://www.youtube.com/watch?v=6D4EWKJgNn0&t=828s)
+
+* Новейшая модель, используемая в естественной обработке языка
+* Революционный штраф
+* Три основных этапа
+    * Входной этап
+    * $n$ блоков трансформеров (кодирующих слоёв) с различными параметрами
+    * Выходной этап
+* Пример с 6 модулями трансформерами (кодирующими слоями) в статье первоисточнике о трансформерах:
+
+<figure>
+  <img src="{{site.baseurl}}/images/week12/12-1/fig4.jpg">
+  <center> Рис.4: Модель языка трансформер </center>
+</figure>
+
+Подслои соединяются посредством элементов, отмеченных "Add&&Norm". Часть "Add" означает остаточное соединение, которое помогает остановить исчезание градиента. Норма здесь обозначает нормализацию слоя.
+
+<figure>
+  <img src="{{site.baseurl}}/images/week12/12-1/fig5.jpg">
+  <center> Рис.5: Кодирующий слой</center>
+</figure>
+
+Следует отметить, что трансформеры делятся весами между временными шагами
+
+
+<!--
+# Multi-headed attention
+
+<figure>
+<img src="{{site.baseurl}}/images/week12/12-1/fig6.png">
+<center>  Fig.6: Multi-headed Attention </center>
+</figure>
+
+
+For the words we are trying to predict, we compute values called **query(q)**. For all the previous words use to predict we call them **keys(k)**. Query is something that tells about the context, such as previous adjectives. Key is like a label containing information about the current word such as whether it's an adjective or not. Once q is computed, we can derive the distribution of previous words ($p_i$):
+
+$$p_i = \text{softmax}(q,k_i)$$
+
+Then we also compute quantities called **values(v)** for the previous words. Values represent the content of the words.
+
+Once we have the values, we compute the hidden states by maximizing the attention distribution:
+
+ $$h_i = \sum_{i}{p_i v_i}$$
+
+We compute the same thing with different queries, values, and keys multiple times in parallel. The reason for this is that we want to predict the next word using different things. For example, when we predict the word "unicorns" using three previous words "These" "horned" and "silver-white". We know it is a unicorn by "horned" "silver-white". However, we can know it is plural "unicorns" by "These". Therefore, we probably want to use all these three words to know what the next word should be. Multi-headed attention is a way of letting each word look at multiple previous words.
+
+One big advantage about the multi-headed attention is that it is very parallelisable. Unlike RNNs, it computes all heads of the multi-head attention modules and all the time-steps at once. One problem of computing all time-steps at once is that it could look at futures words too, while we only want to condition on previous words. One solution to that is what is called **self-attention masking**. The mask is an upper triangular matrix that have zeros in the lower triangle and negative infinity in the upper triangle. The effect of adding this mask to the output of the attention module is that every word  to the left has a much higher attention score than words to the right, so the model in practice only focuses on previous words. The application of the mask is crucial in language model because it makes it mathematically correct, however, in text encoders, bidirectional context can be helpful.
+
+One detail to make the transformer language model work is to add the positional embedding to the input. In language, some properties like the order are important to interpret. The technique used here is learning separate embeddings at different time-steps and adding these to the input, so the input now is the summation of word vector and the positional vector. This gives the order information.
+
+<figure>
+<img src="{{site.baseurl}}/images/week12/12-1/fig7.png">
+<center>  Fig.7: Transformer Architecture </center>
+</figure>
+
+**Why the model is so good:**
+
+1. It gives direct connections between each pair of words. Each word can directly access the hidden states of the previous words, mitigating vanishing gradients. It learns very expensive function very easily
+2. All time-steps are computed in parallel
+3. Self-attention is quadratic (all time-steps can attend to all others), limiting maximum sequence length
+-->
+
+
+# Многоголовое внимание
+
+<figure>
+<img src="{{site.baseurl}}/images/week12/12-1/fig6.png">
+<center> Рис.6: Многоголовое внимание</center>
+</figure>
+
+
+Для слов, которые мы пытаемся предсказать, мы вычисляем значения, называемые **query(q)**. Все предыдущие слова, используемые для предсказания, мы называем **keys(k)**. Запрос - это то, что говорит о контексте, например предыдущие прилагательные. Ключ - это что-то наподобие метки, содержащей информацию о текущем слове, такую как является ли оно прилагательным или нет. После вычисления q, мы можем получить распределение предыдущих слов ($p_i$):
+
+$$p_i = \text{softmax}(q,k_i)$$
+
+Затем мы также вычисляем величины, называемые **values(v)** для предыдущих слов. Значения представляют содержимое слов.
+
+Как только мы получили значения, вычисляем скрытые состояния, максимизируя распределение внимания:
+
+ $$h_i = \sum_{i}{p_i v_i}$$
+
+Мы вычисляем ту же самую вещь с различными запросами, значениями и ключами множество раз параллельно. Причина в том, что мы хотим предсказать следующее слово, используя различные вещи. Например, когда мы предсказываем слово "единороги", используя три предыдущих слова "Эти" "рогатые" и "серебристо-белые". Мы знаем, что это единорог по словам "рогатый" и "серебристо-белый". Однако, мы можем узнать о множественном числе "единороги" по "Эти". Поэтому мы, вероятно, захотим использовать все три слова,чтобы знать, каким должно быть следующее. Многоголовое внимание - это способ позволить каждому слову посмотреть на несколько предыдущих.
+
+Одним из больших преимуществ многоголового внимания является его хорошая параллелизуемость. В отличие от RNNs, оно вычисляет все головы модулей многоголового внимания и все временные шаги за раз. Одна из проблем одновременного вычисления всех временных шагов заключается в том, что также возможно смотреть на будущие слова, в то время как мы хотим учитывать только предыдущие. Одно из решений этой проблемы - это так называемая **self-attention маскировка**. Маска - это верхнетреугольная матрица, имеющая нули в нижнем треугольнике и минус бесконечность в верхнем. Эффект добавления этой маски к выходу модуля внимания состоит в том, что каждое слово слева имеет гораздо более высокую оценку внимания, чем слова справа, поэтому модель на практике фокусируется только на предыдущих словах. Применение маски имеет решающее значение в модели языка, поскольку оно делает её математически правильной, однако в кодировщиках текста двунаправленный контекст может быть полезным.
+
+Одна деталь, заставляющяя модель языка трансформер работать, - добавление позиционных характеристик ко входу. В языке некоторые свойства такие, как порядок важны для интерпретации. Используемая здесь техника заключается в обучении отдельных характеристик на различных временных шагах и добавлении их ко входу, так что теперь вход является суммой вектора слова и позиционного вектора. Это придаёт порядок информации.
+
+<figure>
+<img src="{{site.baseurl}}/images/week12/12-1/fig7.png">
+<center> Рис.7: Архитектура трансформер</center>
+</figure>
+
+**Почему модель так хороша:**
+
+1. Она даёт прямые соединения между каждой парой слов. Каждое слово может быть напрямую получить доступ к скрытому состоянию предыдущих слов, смягчая исчезание градиентов. Она довольно легко обучает очень дорогие функции.
+2. Все временные шаги вычисляются параллельно
+3. Self-attention квадратично (все временные шаги могут следить за всеми другими), ограничивая максимальную длину последовательности
+
+
+<!--
+## [Some tricks (especially for multi-head attention and positional encoding) and decoding Language Models](https://www.youtube.com/watch?v=6D4EWKJgNn0&t=1975s)
+-->
+
+## [Некоторые прёмы (особенно для многоголового внимания и позиционного кодирования) и декодирующие модели языка](https://www.youtube.com/watch?v=6D4EWKJgNn0&t=1975s)
+
+<!--
+### Trick 1: Extensive use of layer normalization to stabilize training is really helpful
+
+- Really important for transformers
+-->
+
+### Приём 1: Широкое применение нормализации слоёв действительно полезно для стабилизации обучения
+
+-	Действительно важно для трансформеров
+
+<!--
+### Trick 2: Warm-up + Inverse-square root training schedule
+
+- Make use of learning rate schedule: in order to make the transformers work well, you have to make your learning rate decay linearly from zero to thousandth steps.
+-->
+### Приём 2: Разогрев (Warm-up) + график обучения обратный квадратный корень 
+
+- Используйте график скорости обучения: чтобы трансформеры работали хорошо, вы должны сделать скорость обучения линейно-уменьшающейся от нуля до тысячных шагов.
+
+<!--
+### Trick 3: Careful initialization
+
+- Really helpful for a task like machine translation
+-->
+
+### Приём 3: Тщательная инициализация
+
+- Действительно полезна для таких задач, как машинный перевод
+
+<!--
+### Trick 4: Label smoothing
+
+- Really helpful for a task like machine translation
+
+The following are the results from some methods mentioned above. In these tests, the metric on the right called `ppl` was perplexity (the lower the `ppl` the better).
+
+<figure>
+<img src="{{site.baseurl}}/images/week12/12-1/fig8.png">
+<center>  Fig.8: Model Performance Comparison </center>
+</figure>
+
+You could see that when transformers were introduced, the performance was greatly improved.
+-->
+
+### Приём 4: Сглаживание меток 
+
+- Действительно полезно для таких задач, как машинный перевод
+
+Ниже приведены результаты некоторых методов, упомянутых выше. В этих тестах, метрикой справа, называемой `ppl` была перплексия (чем меньше `ppl`, тем лучше) 
+
+<figure>
+<img src="{{site.baseurl}}/images/week12/12-1/fig8.png">
+<center> Рис.8: Сравнение производительности моделей </center>
+</figure>
+
+Вы могли видеть, что с появлением трансформеров, производительность значительно улучшилась.
+
+<!--
+## Some important facts of Transformer Language Models
+
+ - Minimal inductive bias
+ - All words directly connected, which will mitigate vanishing gradients.
+ - All time-steps computed in parallel.
+
+
+Self attention is quadratic (all time-steps can attend to all others), limiting maximum sequence length.
+
+- As self attention is quadratic, its expense grows linearly in practice, which could cause a problem.
+
+<figure>
+<img src="{{site.baseurl}}/images/week12/12-1/fig9.png">
+<center>  Fig.9: Transformers *vs.* RNNs </center>
+</figure>
+-->
+
+## Некоторые важные факты о моделях языка трансформерах
+
+ - Минимальный индуктивный сдвиг.
+ - Все слова напрямую связаны, что смягчает исчезание градиентов.
+ - Все временные шаги вычисляются параллельно.
+
+
+Self-attention квадратично (все временные шаги могут следить за всеми другими), ограничивая максимальную длину последовательности.
+
+- Поскольку self-attention квадратично, его стоимость растёт линейно на практике, что может вызывать проблемы.
+
+<figure>
+<img src="{{site.baseurl}}/images/week12/12-1/fig9.png">
+<center> Рис.9: Трансформеры *против* RNNs </center>
+</figure>
+
+<!--
+### Transformers scale up very well
+
+1. Unlimited training data, even far more than you need
+2. GPT 2 used 2 billion parameters in 2019
+3. Recent models use up to 17B parameters in 2020
+-->
+
+### Трансформеры очень хорошо масштабируются
+
+1. Неограниченные обучающие данные, даже больше, чем вам нужно
+2. GPT 2 использовала 2 миллиарда параметров в 2019
+3. Последние модели используют до 17Млрд параметров в 2020
+
+<!--
+## Decoding Language Models
+
+We can now train a probability distribution over text - now essentially we could get exponentially many possible outputs, so we can’t compute the maximum. Whatever choice you make for your first word could affect all the other decisions.
+Thus, given that, the greedy decoding was introduced as follows.
+-->
+
+## Декодирующие модели языка
+
+Мы можем сейчас обучить вероятностное распределение по тексту - теперь, по сути, мы можем получить экспоненциально много различных выходов, поэтому мы не можем вычислить максимум. Какой бы выбор вы ни сделали для первого слова, оно может повлиять на все остальные решения. Таким образом, учитывая это, жадное декодирование было представлено следующим образом.
+
+<!--
+### Greedy Decoding does not work
+
+We take most likely word at each time step. However, no guarantee this gives most likely sequence because if you have to make that step at some point, then you get no way of back-tracking your search to undo any previous sessions.
+-->
+
+### Жадное декодирование не работает
+
+Мы берём наиболее вероятное слово на каждом временном шаге. Однако, нет никаких гарантий, что такой подход даст наиболее вероятную последовательность, потому что если вы сделали этот шаг в какой-то момент, у вас нет пути отслеживания предыдущих шагов, чтобы отменить предыдущие решения.
+
+<!--
+### Exhaustive search also not possible
+
+It requires computing all possible sequences and because of the complexity of $O(V^T)$, it will be too expensive
+-->
+
+### Полный перебор также невозможен
+
+Он требует вычисления всех возможных последовательностей и поскольку сложность порядка $O(V^T)$, это будет очень дорого
+
+
+<!--
+## Comprehension Questions and Answers
+
+1. What is the benefit of multi-headed attention as opposed to a single-headed attention model?
+
+    * To predict the next word you need to observe multiple separate things, in other words attention can be placed on multiple previous words in trying to understand the context necessary to predict the next word.
+
+2. How do transformers solve the informational bottlenecks of CNNs and RNNs?
+
+    * Attention models allow for direct connection between all words allowing for each word to be conditioned on all previous words, effectively removing this bottleneck.
+
+3. How do transformers differ from RNNs in the way they exploit GPU parallelization?
+
+    * The multi-headed attention modules in transformers are highly parallelisable whereas RNNs are not and therefore cannot take advantage of GPU technology. In fact transformers compute all time steps at once in single forward pass.
+-->
+
+## Вопросы и ответы для понимания 
+1. В чём преимущество многоголового внимания по сравнению с моделью одноголового внимания?
+    * Чтобы предсказать следующее слово, вам нужно наблюдать несколько различных вещей, другими словами внимание можно сосредоточить на нескольких предыдущих словах, пытаясь понять контекст, необходимый для предсказания следующего слова
+
+2. Как трансформеры решают информационно узкие места CNNs и RNNs ?
+    * Модели внимания позволяют установить прямую связь между всеми словами, позволяя каждому слову быть обусловленным всеми предыдущими, эффективно устраняя это узкое место.
+
+3. Чем трансформеры отличаются от RNN в смысле использования параллелизации GPU?
+    * Модули многоглового внимания в трансформерах хорошо параллелизуемы, тогда как RNNs - нет, и поэтому рекуррентные сети не могут использовать преимущество GPU технологий. По факту трансформеры вычисляют все временные шаги за раз в один прямой проход.
+
diff --git a/docs/ru/week12/12-2.md b/docs/ru/week12/12-2.md
new file mode 100644
index 000000000..27e206918
--- /dev/null
+++ b/docs/ru/week12/12-2.md
@@ -0,0 +1,638 @@
+---
+lang: ru
+lang-ref: ch.12-2
+title: Декодирующие модели языка
+lecturer: Mike Lewis
+authors: Trevor Mitchell, Andrii Dobroshynskyi, Shreyas Chandrakaladharan, Ben Wolfson
+date: 20 Apr 2020
+translation-date: 03 Dec 2020
+translator: Evgeniy Pak
+---
+
+
+<!-- ## [Beam Search](https://www.youtube.com/watch?v=6D4EWKJgNn0&t=2732s)
+
+Beam search is another technique for decoding a language model and producing text. At every step, the algorithm keeps track of the $k$ most probable (best) partial translations (hypotheses). The score of each hypothesis is equal to its log probability.
+
+The algorithm selects the best scoring hypothesis.
+
+<center>
+<img src="{{site.baseurl}}/images/week12/12-2/Beam_Decoding.png" width="60%"/><br>
+<b>Fig. 1</b>: Beam Decoding
+</center>
+
+How deep does the beam tree branch out ?
+
+The beam tree continues until it reaches the end of sentence token. Upon outputting the end of sentence token, the hypothesis is finished.
+
+Why (in NMT) do very large beam sizes often results in empty translations?
+
+At training time, the algorithm often does not use a beam, because it is very expensive. Instead it uses auto-regressive factorization (given previous correct outputs, predict the $n+1$ first words). The model is not exposed to its own mistakes during training, so it is possible for “nonsense” to show up in the beam.
+
+Summary: Continue beam search until all $k$ hypotheses produce end token or until the maximum decoding limit T is reached. -->
+
+## [Лучевой поиск](https://www.youtube.com/watch?v=6D4EWKJgNn0&t=2732s)
+
+Лучевой поиск - это ещё одна техника декодирования модели языка и генерации текста. На каждом шаге алгоритм отслеживает $k$ наиболее вероятных (наилучших) частичных переводов (гипотез). Оценка каждой гипотезы равна логарифму её вероятности.
+
+Алгоритм выбирает гипотезы с лучшей оценкой.
+
+<center>
+<img src="{{site.baseurl}}/images/week12/12-2/Beam_Decoding.png" width="60%"/><br>
+<b>Рис. 1</b>: Лучевое декодирование
+</center>
+
+Как глубоко разветвляется лучевое дерево ?
+
+Лучевое дерево продолжается, пока не достигнет конца предложения. После вывода конца предложения, гипотеза завершена.
+
+Почему (в нейронном машинном переводе) очень большие размерности луча часто приводят к пустому переводу?
+
+В момент обучения алгоритм часто не использует луч, поскольку это очень дорого. Вместо этого используется авторегрессивная факторизация (по данному предыдущему корректному выходу, предсказывает $n+1$ первых слов). Модель не отображает собственные ошибки в процессе обучения, так что возможно появление "бессмыслицы" в луче.
+
+Сводка: Продолжайте лучевой поиск, пока все $k$ гипотез порождают конечный токен или пока не достигнете максимального предела декодирования T.
+
+
+<!-- ### Sampling
+
+We may not want the most likely sequence. Instead we can sample from the model distribution.
+
+However, sampling from the model distribution poses its own problem. Once a "bad" choice is sampled, the model is in a state it never faced during training, increasing the likelihood of continued "bad" evaluation. The algorithm can therefore get stuck in horrible feedback loops. -->
+
+### Семплирование
+
+Нам может быть не нужна наиболее вероятная последовательность. Вместо этого мы можем семплировать из распределения модели.
+
+Однако выборка из распределения модели приносит свои проблемы. После "плохого" выбора, модель находится в состоянии, с которым никогда не сталкивалась в процессе обучения, возрастает вероятность продолжения "плохой" оценки. Алгоритм может затем застрять в ужасных циклах обратной связи.
+
+
+<!-- ### Top-K Sampling
+
+A pure sampling technique where you truncate the distribution to the $k$ best and then renormalise and sample from the distribution.
+
+<center>
+<img src="{{site.baseurl}}/images/week12/12-2/Top_K_Sampling.png" width="60%"/><br>
+<b>Fig. 2</b>: Top K Sampling
+</center> -->
+
+
+### Топ-K семлирование
+
+Чистая техника семплирования, где вы усекаете распределение до $k$ наилучших и затем перенормализуете и выбираете из распределения.
+
+<center>
+<img src="{{site.baseurl}}/images/week12/12-2/Top_K_Sampling.png" width="60%"/><br>
+<b>Рис. 2</b>: Топ K семлирование
+</center>
+
+
+<!-- #### Question: Why does Top-K sampling work so well?
+
+This technique works well because it essentially tries to prevent falling off of the manifold of good language when we sample something bad by only using the head of the distribution and chopping off the tail. -->
+
+
+#### Вопрос: Почему Топ-K семплирование работает так хорошо?
+
+Этот метод работает хорошо, поскольку он по сути пытается предотвратить выход за пределы многообразия хорошего языка, когда мы выбираем что-то плохое, используя только головную часть распределения и обрезая хвостовую часть.
+
+
+<!-- ## Evaluating Text Generation
+
+Evaluating the language model requires simply log likelihood of the held-out data. However, it is difficult to evaluate text. Commonly word overlap metrics with a reference (BLEU, ROUGE etc.) are used, but they have their own issues. -->
+
+
+## Оценка генерации текста
+
+Оценка модели языка требует просто вычислить логарифм вероятности выведенных данных. Однако, таким образом сложно оценить текст. Обычно используются метрики совпадения слов с упоминанем (BLEU, ROUGE etc.), но у них есть свои проблемы.
+
+
+<!-- ## Sequence-To-Sequence Models -->
+
+## Sequence-To-Sequence модели
+
+
+<!-- ### Conditional Language Models
+
+Conditional Language Models are not useful for generating random samples of English, but they are useful for generating a text given an input.
+
+Examples:
+
+- Given a French sentence, generate the English translation
+- Given a document, generate a summary
+- Given a dialogue, generate the next response
+- Given a question, generate the answer -->
+
+### Обусловленные модели языка
+
+Обусловленные модели языка не подходят для генерации случайых семплов на английском, но они полезны для генерации текста по заданному входу.
+
+Примеры:
+
+- По заданному предложению на французском сгенерируйте английский перевод
+- По заданному документу сгенерируйте краткое изложение
+- По заданному диалогу сгенерируйте следующий ответ
+- По заданному вопросу сгенерируйте ответ
+
+
+<!-- ### Sequence-To-Sequence Models
+
+Generally, the input text is encoded. This resulting embedding is known as a "thought vector", which is then passed to the decoder to generate tokens word by word.
+
+<center>
+<img src="{{site.baseurl}}/images/week12/12-2/s2s_Models.png" width="60%"/><br>
+<b>Fig. 3</b>: Thought Vector
+</center> -->
+
+
+### Sequence-To-Sequence модели
+
+Обычно входной текст закодирован. Эта результирующая характеристика известна как "thought vector", которая затем передаётся декодеру для генерации токенов слово за слово.
+
+<center>
+<img src="{{site.baseurl}}/images/week12/12-2/s2s_Models.png" width="60%"/><br>
+<b>Рис. 3</b>: Thought Vector
+</center>
+
+
+<!-- ### Sequence-To-Sequence Transformer
+
+The sequence-to-sequence variation of transformers has 2 stacks:
+
+1. Encoder Stack – Self-attention isn't masked so every token in the input can look at every other token in the input
+
+2. Decoder Stack – Apart from using attention over itself, it also uses attention over the complete inputs
+
+<center>
+<img src="{{site.baseurl}}/images/week12/12-2/s2s_transformers.png" width="60%"/><br>
+<b>Fig. 4</b>: Sequence to Sequence Transformer
+</center>
+
+Every token in the output has direct connection to every previous token in the output, and also to every word in the input. The connections make the models very expressive and powerful. These transformers have made improvements in translation score over previous recurrent and convolutional models. -->
+
+
+### Sequence-To-Sequence трансформер
+
+Sequence-to-sequence варианты трансформеров имеют 2 стека:
+
+1. Стек кодировщик – Self-attention не маскируется, так что каждый входной токен может смотреть на любой другой токен входа
+
+2. Стек декодировщик – Помимо использования внимания на себе, он также использует внимание по всему входу
+
+<center>
+<img src="{{site.baseurl}}/images/week12/12-2/s2s_transformers.png" width="60%"/><br>
+<b>Рис. 4</b>: Sequence to Sequence трансформер
+</center>
+
+Каждый токен на выходе имеет прямую связь с каждым предыдущим выходным токеном, а также с каждым входным словом. Связи делают модели очень выразительными и мощными. Эти трансформеры улучшили оценку в машинном переводе по сравнению с предыдущими рекуррентными и свёрточными моделями.
+
+
+<!-- ## [Back-translation](https://www.youtube.com/watch?v=6D4EWKJgNn0&t=3811s)
+
+When training these models, we typically rely on large amounts of labelled text. A good source of data is from European Parliament proceedings - the text is manually translated into different languages which we then can use as inputs and outputs of the model. -->
+
+
+## [Обратный перевод](https://www.youtube.com/watch?v=6D4EWKJgNn0&t=3811s)
+
+При обучении этих моделей мы обычно полагаемся на большие объёмы размеченного текста. Хороший источник данных - это отчёты Европейского Парламента - текст вручную переведён на несколько языков, который мы можем затем использовать как входы и выходы модели.
+
+
+<!-- ### Issues
+
+- Not all languages are represented in the European parliament, meaning that we will not get translation pair for all languages we might be interested in. How do we find text for training in a language we can't necessarily get the data for?
+- Since models like transformers do much better with more data, how do we use monolingual text efficiently, *i.e.* no input / output pairs?
+
+Assume we want to train a model to translate German into English. The idea of back-translation is to first train a reverse model of English to German
+
+- Using some limited bi-text we can acquire same sentences in 2 different languages
+- Once we have an English to German model, translate a lot of monolingual words from English to German.
+
+Finally, train the German to English model using the German words that have been 'back-translated' in the previous step. We note that:
+
+- It doesn't matter how good the reverse model is - we might have noisy German translations but end up translating to clean English.
+- We need to learn to understand English well beyond the data of English / German pairs (already translated) - use large amounts of monolingual English -->
+
+
+### Проблемы
+
+- Не все языки представлены в Европейском Парламенте, означая, что мы не получим пары переводов по всем языкам, в которых мы можем заинтересоваться. Как мы находим текст для обучения на языке, для которого мы не можем получить данные?
+- Поскольку модели как трансформеры намного более производительны при большем количестве данных, как мы используем монолингвистический текст эффективно, *т.е.* нет входных / выходных пар?
+
+Предположим, мы хотим обучить модель для перевода с немецкого языка на английский. Идея обратного перевода заключается в том, что сперва обучаем обратную модель с английского на немецкий.
+
+- Используя некоторый ограниченный парный текст, мы можем получать одинаковые предложения на двух различных языках
+- Как только мы имеем модель перевода с английского языка на немецкий, переводим множество монолингвистических слов с английского на немецкий.
+
+Наконец, обучаем модель перевода с немецкого языка на английский, используя немецкие слова, которые были 'обратно переведены' на предыдущем шаге. Отметим, что:
+
+- Неважно, насколько хороша обратная модель - мы можем иметь зашумленные немецкие переводы, но в итоге чисто перевести на английский.
+- Нам нужно обучить модель, чтобы понимать английский хорошо за пределами данных англиских/немецких пар (уже переведённых) - использовать большие объёмы монолингвистических данных на английском языке
+
+
+<!-- ### Iterated Back-translation
+
+- We can iterate the procedure of back-translation in order to generate even more bi-text data and reach much better performance - just keep training using monolingual data.
+- Helps a lot when not a lot of parallel data -->
+
+
+### Итеративный обратный перевод
+
+- Мы можем итерировать процедуру обратного перевода, чтобы генерировать ещё больше двунаправленных текстовых данных и достичь лучшей производительности - просто продолжая обучаться на монолингвистических данных.
+- Сильно помогает, когда немного параллельных данных
+
+
+<!-- ## Massive multilingual MT
+
+<center>
+<img src="{{site.baseurl}}/images/week12/12-2/multi-language-mt.png" width="60%"/><br>
+<b>Fig. 5</b>: Multilingual MT
+</center>
+
+- Instead of trying to learn a translation from one language to another, try to build a neural net to learn multiple language translations.
+- Model is learning some general language-independent information.
+
+<center>
+<img src="{{site.baseurl}}/images/week12/12-2/multi-mt-results.gif" width="60%"/><br>
+<b>Fig. 6</b>: Multilingual NN Results
+</center>
+
+Great results especially if we want to train a model to translate to a language that does not have a lot of available data for us (low resource language). -->
+
+
+## Массивный мультиязычный машинный перевод
+
+<center>
+<img src="{{site.baseurl}}/images/week12/12-2/multi-language-mt.png" width="60%"/><br>
+<b>Рис. 5</b>: мультиязычный машинный перевод
+</center>
+
+- Вместо того, чтобы пытаться обучить модель переводить с одого языка на другой, попытаться создать нейронную сеть для обучения переводам на несколько языков.
+- Модель изучает некоторую общую языко-независимую информацию.
+
+<center>
+<img src="{{site.baseurl}}/images/week12/12-2/multi-mt-results.gif" width="60%"/><br>
+<b>Рис. 6</b>: Результаты мультиязычной нейронной сети
+</center>
+
+Отличные результаты, особенно если мы хотим обучить модель переводить на язык, для которого у нас нет много доступных данных (низко ресурсный язык).
+
+
+<!-- ## Unsupervised Learning for NLP
+
+There are huge amounts of text without any labels and little of supervised data. How much can we learn about the language by just reading unlabelled text? -->
+
+
+## Обучение без учителя для естественной обработки языка
+
+Есть большое количество текстовых данных без какой-либо разметки и немного размеченных данных. Как много мы можем изучить о языке, просто читая неразмеченный текст?
+
+
+<!-- ### `word2vec`
+
+Intuition - if words appear close together in the text, they are likely to be related, so we hope that by just looking at unlabelled English text, we can learn what they mean.
+
+- Goal is to learn vector space representations for words (learn embeddings)
+
+Pretraining task - mask some word and use neighbouring words to fill in the blanks.
+
+<center>
+<img src="{{site.baseurl}}/images/week12/12-2/word2vec-masking.gif" width="60%"/><br>
+<b>Fig. 7</b>: word2vec masking visual
+</center>
+
+For instance, here, the idea is that "horned" and "silver-haired" are more likely to appear in the context of "unicorn" than some other animal.
+
+Take the words and apply a linear projection
+
+<center>
+<img src="{{site.baseurl}}/images/week12/12-2/word2vec-embeddings.png" width="60%"/><br>
+<b>Fig. 8</b>:  word2vec embeddings
+</center>
+
+Want to know
+
+$$
+p(\texttt{unicorn} \mid \texttt{These silver-haired ??? were previously unknown})
+$$
+
+$$
+p(x_n \mid x_{-n}) = \text{softmax}(\text{E}f(x_{-n})))
+$$
+
+Word embeddings hold some structure
+
+<center>
+<img src="{{site.baseurl}}/images/week12/12-2/embeddings-structure.png" width="60%"/><br>
+<b>Fig. 9</b>: Embedding structure example
+</center>
+
+- The idea is if we take the embedding for "king" after training and add the embedding for "female" we will get an embedding very close to that of "queen"
+- Shows some meaningful differences between vectors -->
+
+
+### `word2vec`
+
+Наитие - если слова появляются близко друг к другу в тексте, они вероятно связаны между собой, поэтому мы надеемся, что просто посмотрев на неразмеченный английский текст, мы можем обучиться, что они значат.
+
+- Целью является изучить векторное пространство представлений слов (изучить характеристики)
+
+Задача предобучения - замаскируем некоторые слова и используем соседние слова, чтобы заполнить пробелы.
+
+<center>
+<img src="{{site.baseurl}}/images/week12/12-2/word2vec-masking.gif" width="60%"/><br>
+<b>Рис. 7</b>: word2vec визуализация маскировки
+</center>
+
+Например, здесь, идея заключается в том, что "рогатый" и "седовласый" более вероятно появятся в контексте "единорога", чем какого-то другого животного.
+
+Возьмём слова и применим линейную проекцию
+
+<center>
+<img src="{{site.baseurl}}/images/week12/12-2/word2vec-embeddings.png" width="60%"/><br>
+<b>Рис. 8</b>:  word2vec характеристики
+</center>
+
+Хотим узнать
+
+$$
+p(\texttt{единороги} \mid \texttt{Эти седовласые ??? были ранее неизвестны})
+$$
+
+$$
+p(x_n \mid x_{-n}) = \text{softmax}(\text{E}f(x_{-n})))
+$$
+
+Характеристики слов придерживаются некоторой структуры
+
+<center>
+<img src="{{site.baseurl}}/images/week12/12-2/embeddings-structure.png" width="60%"/><br>
+<b>Рис. 9</b>: Пример структуры характеристик
+</center>
+
+- Идея заключается в том, что если мы возьмём характеристический вектор для слова "король" после обучения и добавим характеристический вектор для слова "женский" мы получим характеристику, очень близкую к слову "королева"
+- Показывает некоторые выразительные разности между векторами
+
+
+<!-- #### Question: Are the word representation dependent or independent of context?
+
+Independent and have no idea how they relate to other words -->
+
+#### Вопрос: Представления слов зависимы или независимы от контекста?
+
+Независимы и не имеют представления, как они зависят от других слов
+
+
+<!-- #### Question: What would be an example of a situation that this model would struggle in?
+
+Interpretation of words depends strongly on context. So in the instance of ambiguous words - words that may have multiple meanings - the model will struggle since the embeddings vectors won't capture the context needed to correctly understand the word. -->
+
+#### Вопрос: Какой может быть пример ситуации, затруднительный для данной модели?
+
+Интерпретация слов сильно зависит от контекста. Поэтому на примерах двусмысленных слов - слов, которые могут иметь множество значений - модель будет затрудняться, поскольку характеристические вектора не будут захватывать контекст, необходимый для корректного понимания слова.
+
+<!-- ### GPT
+
+To add context, we can train a conditional language model. Then given this language model, which predicts a word at every time step, replace each output of model with some other feature.
+
+- Pretraining - predict next word
+- Fine-tuning - change to a specific task. Examples:
+  - Predict whether noun or adjective
+  - Given some text comprising an Amazon review, predict the sentiment score for the review
+
+This approach is good because we can reuse the model. We pretrain one large model and can fine tune to other tasks. -->
+
+### GPT
+
+Чтобы добавить контекст, мы можем обучить обусловленную модель языка. Затем, по заданной модели языка, которая предсказывает слово на каждом временном шаге, заменяем каждый выход модели на некоторую другую характеристику.
+
+- Предобучение - предсказываем следующее слово
+- Тонкая настройка - заменяем на специфическую задачу. Например:
+  - Предсказываем где существительное, а где прилагательное
+  - По заданному некоторому тексту, содержащему обзор с Amazon, предсказать оценку настроения для обзора
+
+Этот подход хорош, поскольку мы можем использовать модель повторно. Мы предобучаем одну большую модель и можем тонко настраивать её для других задач.
+
+
+<!-- ### ELMo
+
+GPT only considers leftward context, which means the model can't depend on any future words - this limits what the model can do quite a lot.
+
+Here the approach is to train _two_ language models
+
+- One on the text left to right
+- One on the text right to left
+- Concatenate the output of the two models in order to get the word representation. Now can condition on both the rightward and leftward context.
+
+This is still a "shallow" combination, and we want some more complex interaction between the left and right context. -->
+
+
+### ELMo
+
+GPT рассматривает только левосторонний контекст, что означает модель не может зависеть от каких-либо будущих слов - это ограничивает, что модель не может делать довольно много.
+
+Подход заключается в обучении _двух_ моделей языка
+
+- Одну на тексте слева направо
+- Одну на тексте справа налево
+- Конкатенируем выходы двух моделей, чтобы получить представление слова. Теперь можно обусловливать на обоих: правостороннем и левостороннем контексте.
+
+Это до сих пор "поверхностная" комбинация, и мы хотим некоторое более сложное взаимодействие между левым и правым контекстом.
+
+
+<!-- ### BERT
+
+BERT is similar to word2vec in the sense that we also have a fill-in-a-blank task. However, in word2vec we had linear projections, while in BERT there is a large transformer that is able to look at more context. To train, we mask 15% of the tokens and try to predict the blank.
+
+Can scale up BERT (RoBERTa):
+
+- Simplify BERT pre-training objective
+- Scale up the batch size
+- Train on large amounts of GPUs
+- Train on even more text
+
+Even larger improvements on top of BERT performance - on question answering task performance is superhuman now. -->
+
+
+### BERT
+
+BERT похож на word2vec в том смысле, что у нас также есть задача заполнения пробелов. Однако, в word2vec у нас есть линейные проекции в то время, как в BERT есть большой трансформер, которые может посмотреть больше контекста. Для обучения мы маскируем 15% токенов и пытаемся предсказать пробелы.
+
+Можем увеличить масштаб BERT (RoBERTa):
+
+- Упростим задачу предобучения BERT
+- Увеличим размер батча
+- Обучим на большом количестве GPUs
+- Обучим на ещё большем количестве текста
+
+Ещё больше улучшений поверх BERT производительности - в задаче ответов на вопросы сейчас производительность свехрчеловека.
+
+
+<!-- ## [Pre-training for NLP](https://www.youtube.com/watch?v=6D4EWKJgNn0&t=4963s)
+
+Let us take a quick look at different self-supervised pre training approaches that have been researched for NLP.
+
+- XLNet:
+
+  Instead of predicting all the masked tokens conditionally independently, XLNet predicts masked tokens auto-regressively in random order
+
+- SpanBERT
+
+   Mask spans (sequence of consecutive words) instead of tokens
+
+- ELECTRA:
+
+  Rather than masking words we substitute tokens with similar ones.  Then we solve a binary classification problem by trying to predict whether the tokens have been substituted or not.
+
+- ALBERT:
+
+  A Lite Bert: We modify BERT and make it lighter by tying the weights across layers. This reduces the parameters of the model and the computations involved. Interestingly, the authors of ALBERT did not have to compromise much on accuracy.
+
+- XLM:
+
+  Multilingual BERT: Instead of feeding such English text, we feed in text from multiple languages. As expected, it learned cross lingual connections better.
+
+The key takeaways from the different models mentioned above are
+
+- Lot of different pre-training objectives work well!
+
+- Crucial to model deep, bidirectional interactions between words
+
+- Large gains from scaling up pre-training, with no clear limits yet
+
+
+Most of the models discussed above are engineered towards solving the text classification problem. However, in order to solve text generation problem, where we generate output sequentially much like the `seq2seq` model, we need a slightly different approach to pre training. -->
+
+
+## [Предобучение для обработки естественного языка](https://www.youtube.com/watch?v=6D4EWKJgNn0&t=4963s)
+
+Давайте кратко рассмотрим различные подходы предобучения самостоятельного обучения, разработанные для естественной обработки языка.
+
+- XLNet:
+
+  Вместо того, чтобы предсказывать все замаскированные токены условно-независимо, XLNet предсказывает замаскированные токены авторегрессивно в случайном порядке
+
+- SpanBERT
+
+   Маскирует диапазон (последовательность слов) вместо токенов
+
+- ELECTRA:
+
+  Вместо маскировки слов, мы заменяем токены на похожие. Затем мы решаем задачу бинарной классификации, пытаясь предсказать, где были заменены токены.
+
+- ALBERT:
+
+  Облегченный Bert: Мы модифицируем BERT и облегчаем его уменьшая количество весов  the weights в слоях. Это уменьшает количество параметров модели и сложные вычисления. Интересно, что авторам ALBERT не пришлось сильно жертвовать точностью.
+
+- XLM:
+
+  Мультиязычный BERT: Вместо подачи английского текста, мы подаём текст из множества языков. Как и ожидалось, она изучает межязыковые соединения лучше.
+
+Ключевые выводы из различных моделей, упомянутых выше:
+
+- Много различных задач предобучения работают хорошо!
+
+- Глубина модели критична, двунаправленные взаимодействия между словами
+
+- Большой выигрыш от увеличения масштабов предобучения, до сих пор без чётких ограничений.
+
+
+Большинство моделей, обсуждённых выше разработаны для решения задачи классификации текста. Однако, для решения задачи генерации текста, где мы генерируем выход последовательно, очень похоже на `seq2seq` модель, нам нужен немного другой подход для предобучения.
+
+
+<!-- #### Pre-training for Conditional Generation: BART and T5
+
+BART: pre-training `seq2seq` models by de-noising text
+
+In BART, for pretraining we take a sentence and corrupt it by masking tokens randomly. Instead of predicting the masking tokens (like in the BERT objective), we feed the entire corrupted sequence and try to predict the entire correct sequence.
+
+This `seq2seq` pretraining approach give us flexibility in designing our corruption schemes. We can shuffle the sentences, remove phrases, introduce new phrases, etc.
+
+BART was able to match RoBERTa on SQUAD and GLUE tasks. However, it was the new SOTA on summarization, dialogue and abstractive QA datasets. These results reinforce our motivation for BART, being better at text generation tasks than BERT/RoBERTa. -->
+
+
+#### Предобучение для обусловленной генерации: BART и T5
+
+BART: предобучение `seq2seq` модели посредством очищения текста от шумов
+
+В BART для предобучения мы берём последовательность и искажаем её, маскируя токены случайным образом. Вместо предсказания замаскированных токенов (как в задаче BERT), мы подаём целую искажённую последовательность и пытаемся предсказать искажённую последовательность целиком.
+
+Этот `seq2seq` подход предобучения даёт нам гибкость в дизайне наших искажённых схем. Мы можем перемешивать предложения, удалять фразы, вставлять новые фразы и т. д.
+
+BART сопоставим с RoBERTa на задачах SQUAD и GLUE. Однако, он был новым SOTA на обобщениях, диалогах и абстрактных вопросах/ответах выборках данных. Эти результаты усиливают нашу мотивацию для BART, быть лучше в задачах генерации текста, чем BERT/RoBERTa.
+
+
+<!-- ### Some open questions in NLP
+
+- How should we integrate world knowledge
+- How do we model long documents?  (BERT-based models typically use 512 tokens)
+- How do we best do multi-task learning?
+- Can we fine-tune with less data?
+- Are these models really understanding language? -->
+
+
+### Некоторые открытые вопросы в естественной обработке языка NLP
+
+- Как нам интегрировать мировые знания
+- Как нам моделировать длинные документы?  (модели на основе BERT обычно используют 512 токенов)
+- Как нам лучше всего выполнять многозадачное обучение?
+- Можем ли мы выполнять тонкую настройку с меньшим количеством данных?
+- Эти модели на самом деле понимают язык?
+
+
+<!-- ### Summary
+
+- Training models on lots of data beats explicitly modelling linguistic structure.
+
+From a bias variance perspective, Transformers are low bias (very expressive) models. Feeding these models lots of text is better than explicitly modelling linguistic structure (high bias). Architectures should be compressing sequences through bottlenecks
+
+- Models can learn a lot about language by predicting words in unlabelled text. This turns out to be a great unsupervised learning objective. Fine tuning for specific tasks is then easy
+
+- Bidirectional context is crucial -->
+
+
+### Резюме
+
+- Обучение моделей на большом количестве данных лучше, чем явное моделирование лингвистической структуры
+
+С точки зрения дисперсии смещения, трансформеры малосмещённые (очень выразительные) модели. Подавая этим моделям большое количество текста лучше явного моделирования лингвистических структур (сильно смещённых). Архитектуры должны сжимать последовательность дл прохождения через узкие места
+
+- Модели могут изучить много о языке, предсказывая слова в неразмеченном тексте. Это оказывается отличной задачей обучения без учителя. Тонкая настройка для специфических задач после проста.
+
+- Двунаправленность контекста критична
+
+
+<!-- ### Additional Insights from questions after class:
+
+What are some ways to quantify 'understanding language’? How do we know that these models are really understanding language?
+
+"The trophy did not fit into the suitcase because it was too big”: Resolving the reference of ‘it’ in this sentence is tricky for machines. Humans are good at this task. There is a dataset consisting of such difficult examples and humans achieved 95% performance on that dataset. Computer programs were able to achieve only around 60% before the revolution brought about by Transformers. The modern Transformer models are able to achieve more than 90% on that dataset. This suggests that these models are not just memorizing / exploiting the data but learning concepts and objects through the statistical patterns in the data.
+
+Moreover, BERT and RoBERTa achieve superhuman performance on SQUAD and Glue. The textual summaries generated by BART look very real to humans (high BLEU scores). These facts are evidence that the models do understand language in some way. -->
+
+
+### Дополнительные идеи из вопросов после лекции:
+
+Какими способами можно измерить 'понимание языка’? КАк мы можем узнать, что эти модели действительно понимают язык?
+
+"Трофей не поместился в чемодан, поскольку он был очень большим”: Разрешить ссылку ‘оно’ в этом предложении сложно для машин. Люди хороши в этой задаче. Есть выборка данных, состоящая из подобных сложных примеров и люди достигают 95% точности на этой выборке. Компьютерные программы были способны достичь лишь около 60% до революции, совершённой трансформерами. Современные модели транфсормеры способны достигать больше 90% на этой выборке данных. Это повзоляет предположить, что эти модели не просто запоминают / эксплуатируют данные, но изучают концепции и объекты посредством статистических шаблонов в данных.
+
+Более того, BERT и RoBERTa достигают сверхчеловеческой производительности на SQUAD и Glue. Текстовые сводки, сгенерированные BART,смотрятся очень реалистично для людей (высокие оценки BLEU). Эти факты свидетельства того, что модели понимают язык в каком-то плане.
+
+
+<!-- #### Grounded Language
+
+Interestingly, the lecturer (Mike Lewis, Research Scientist, FAIR) is working on a concept called ‘Grounded Language’. The aim of that field of research is to build conversational agents that are able to chit-chat or negotiate. Chit-chatting and negotiating are abstract tasks with unclear objectives as compared to text classification or text summarization. -->
+
+
+#### Приземлённый язык
+
+Интересно, что лектор (Майк Льюис, Учёный исследователь, FAIR) работает над концепцией, называемой ‘Grounded Language’. Цель этой области исследований создать разоговорных агентов, которые будут способны болтать или вести переговоры. Болтовня и переговоры абстрактные задачи с нечёткими целями по сравнению с классификацией текста или резюмирование текста.
+
+
+<!-- #### Can we evaluate whether the model already has world knowledge?
+
+‘World Knowledge’ is an abstract concept. We can test models, at the very basic level, for their world knowledge by asking them simple questions about the concepts we are interested in.  Models like BERT, RoBERTa and T5 have billions of parameters. Considering these models are trained on a huge corpus of informational text like Wikipedia, they would have memorized facts using their parameters and would be able to answer our questions. Additionally, we can also think of conducting the same knowledge test before and after fine-tuning a model on some task. This would give us a sense of how much information the model has ‘forgotten’. -->
+
+
+#### Можем ли мы оценить, когда модель уже обладает мировыми знаниями?
+
+‘Мировые знания’ это абстрактная концепция. Мы можем тестировать модели на очень базовом уровне на их мировые знания, спрашивая их простые вопросы о концепциях, которые нам интересны. Модели как BERT, RoBERTa и T5 имеют миллиарды параметров. Учитывая, что эти модели обучаются на большом своде информационного текста, как Википедия, они запомнили бы факты, используя их параметры и смогли бы ответить на наши вопросы. Более того, мы можем также подумать о проведении того же самого теста знаний до и после тонкой настройки модели для какой-либо задачи. Это даст нам представление о том, как много информации "забыла" модель.
diff --git a/docs/ru/week12/12-3.md b/docs/ru/week12/12-3.md
new file mode 100644
index 000000000..3e95e9603
--- /dev/null
+++ b/docs/ru/week12/12-3.md
@@ -0,0 +1,886 @@
+---
+lang: ru
+lang-ref: ch.12-3
+title: Внимание и Трансформер
+lecturer: Alfredo Canziani
+authors: Francesca Guiso, Annika Brundyn, Noah Kasmanoff, and Luke Martin
+date: 21 Apr 2020
+translation-date: 05 Dec 2020
+translator: Evgeniy Pak
+---
+
+
+<!-- ## [Attention](https://www.youtube.com/watch?v=f01J0Dri-6k&t=69s)
+
+We introduce the concept of attention before talking about the Transformer architecture. There are two main types of attention: self attention *vs.* cross attention, within those categories, we can have hard *vs.* soft attention.
+
+As we will later see, transformers are made up of attention modules, which are mappings between sets, rather than sequences, which means we do not impose an ordering to our inputs/outputs. -->
+
+
+## [Внимание](https://www.youtube.com/watch?v=f01J0Dri-6k&t=69s)
+
+Введём концепцию внимания перед тем, как говорить об архитектуре Трансформеров. Есть два основных типа внимания: self attention *против.* перекрёстного внимания, среди этих категорий мы можем выделить жёсткое *против.* мягкого внимания.
+
+Как мы увидим позже, трансформеры составлены из модулей внимания, которые являются отображениями множеств, скорее чем последовательностей, что значит мы не навязываем порядок нашим входам/выходам.
+
+
+<!-- ### Self Attention (I)
+
+Consider a set of $t$ input $\boldsymbol{x}$'s:
+
+$$
+\lbrace\boldsymbol{x}_i\rbrace_{i=1}^t = \lbrace\boldsymbol{x}_1,\cdots,\boldsymbol{x}_t\rbrace
+$$
+
+where each $\boldsymbol{x}_i$ is an $n$-dimensional vector. Since the set has $t$ elements, each of which belongs to $\mathbb{R}^n$, we can represent the set as a matrix $\boldsymbol{X}\in\mathbb{R}^{n \times t}$.
+
+With self-attention, the hidden representation $h$ is a linear combination of the inputs:
+
+$$
+\boldsymbol{h} = \alpha_1 \boldsymbol{x}_1 + \alpha_2 \boldsymbol{x}_2 + \cdots +  \alpha_t \boldsymbol{x}_t
+$$
+
+Using the matrix representation described above, we can write the hidden layer as the matrix product:
+
+$$
+\boldsymbol{h} = \boldsymbol{X} \boldsymbol{a}
+$$
+
+where $\boldsymbol{a} \in \mathbb{R}^n$ is a column vector with components $\alpha_i$.
+
+Note that this differs from the hidden representation we have seen so far, where the inputs are multiplied by a matrix of weights.
+
+Depending on the constraints we impose on the vector $\vect{a}$, we can achieve hard or soft attention. -->
+
+
+### Self Attention (I)
+
+Рассмотрим множество $t$ входов $\boldsymbol{x}$'s:
+
+$$
+\lbrace\boldsymbol{x}_i\rbrace_{i=1}^t = \lbrace\boldsymbol{x}_1,\cdots,\boldsymbol{x}_t\rbrace
+$$
+
+гд каждый $\boldsymbol{x}_i$ есть $n$-мерный вектор. Поскольку в множестве есть $t$ элементов, каждый из которых принадлежит $\mathbb{R}^n$, мы можем представить множество как матрицу $\boldsymbol{X}\in\mathbb{R}^{n \times t}$.
+
+При self-attention внутреннее представление $h$ является линейной комбинацией входов:
+
+$$
+\boldsymbol{h} = \alpha_1 \boldsymbol{x}_1 + \alpha_2 \boldsymbol{x}_2 + \cdots +  \alpha_t \boldsymbol{x}_t
+$$
+
+Используя матрицу представлений, описанную выше, мы можем записать внутренний слой, как произведение матриц:
+
+$$
+\boldsymbol{h} = \boldsymbol{X} \boldsymbol{a}
+$$
+
+где $\boldsymbol{a} \in \mathbb{R}^n$ вектор-столбец с компонентами $\alpha_i$.
+
+Отметим, что это отличается от внутреннего представления, которое мы видели до сих пор, где входы умножались на матрицу весов.
+
+В зависимости от ограничений налагаемых на вектор $\vect{a}$, мы получаем жёсткое или мягкое внимание.
+
+
+<!-- #### Hard Attention
+
+With hard-attention, we impose the following constraint on the alphas: $\Vert\vect{a}\Vert_0 = 1$. This means $\vect{a}$ is a one-hot vector. Therefore, all but one of the coefficients in the linear combination of the inputs equals zero, and the hidden representation reduces to the input $\boldsymbol{x}_i$ corresponding to the element $\alpha_i=1$. -->
+
+
+#### Жётское Внимание
+
+При жёстком внимании, мы налагаем следующие ограничения на альфы: $\Vert\vect{a}\Vert_0 = 1$. Это значит $\vect{a}$ является унитарным вектором. Следовательно все, кроме одного, коэффициенты в линейной комбинации входов равняются нулю, и внутренние представления сокращаются до входа $\boldsymbol{x}_i$, соответствующего элементу $\alpha_i=1$.
+
+
+<!-- #### Soft Attention
+
+With soft attention, we impose that $\Vert\vect{a}\Vert_1 = 1$. The hidden representations is a linear combination of the inputs where the coefficients sum up to 1. -->
+
+
+#### Мягкое внимание
+
+При мягком внимании, мы налагаем ограничение $\Vert\vect{a}\Vert_1 = 1$. Внутреннее представление является линейной комбинацией входов, где сумма коэффициентов равна единице.
+
+
+<!-- ### Self Attention (II)
+
+Where do the $\alpha_i$ come from?
+
+We obtain the vector $\vect{a} \in \mathbb{R}^t$ in the following way:
+
+$$
+\vect{a} = \text{[soft](arg)max}_{\beta} (\boldsymbol{X}^{\top}\boldsymbol{x})
+$$
+
+Where $\beta$ represents the inverse temperature parameter of the $\text{soft(arg)max}(\cdot)$. $\boldsymbol{X}^{\top}\in\mathbb{R}^{t \times n}$ is the transposed matrix representation of the set $\lbrace\boldsymbol{x}_i \rbrace\_{i=1}^t$, and $\boldsymbol{x}$ represents a generic $\boldsymbol{x}_i$ from the set. Note that the $j$-th row of $X^{\top}$ corresponds to an element $\boldsymbol{x}_j\in\mathbb{R}^n$, so the $j$-th row of $\boldsymbol{X}^{\top}\boldsymbol{x}$ is the scalar product of $\boldsymbol{x}_j$ with each $\boldsymbol{x}_i$ in $\lbrace \boldsymbol{x}_i \rbrace\_{i=1}^t$.
+
+The components of the vector $\vect{a}$ are also called "scores" because the scalar product between two vectors tells us how aligned or similar two vectors are. Therefore, the elements of $\vect{a}$ provide information about the similarity of the overall set to a particular $\boldsymbol{x}_i$.
+
+The square brackets represent an optional argument. Note that if $\arg\max(\cdot)$ is used, we get a one-hot vector of alphas, resulting in hard attention. On the other hand, $\text{soft(arg)max}(\cdot)$ leads to soft attention. In each case, the components of the resulting vector $\vect{a}$ sum to 1.
+
+Generating $\vect{a}$ this way gives a set of them, one for each $\boldsymbol{x}_i$. Moreover, each $\vect{a}_i \in \mathbb{R}^t$ so we can stack the alphas in a matrix $\boldsymbol{A}\in \mathbb{R}^{t \times t}$.
+
+Since each hidden state is a linear combination of the inputs $\boldsymbol{X}$ and a vector $\vect{a}$, we obtain a set of $t$ hidden states, which we can stack into a matrix $\boldsymbol{H}\in \mathbb{R}^{n \times t}$.
+
+$$
+\boldsymbol{H}=\boldsymbol{XA}
+$$ -->
+
+
+### Self Attention (II)
+
+Откуда берутся $\alpha_i$?
+
+Мы получаем вектор $\vect{a} \in \mathbb{R}^t$ следующим образом:
+
+$$
+\vect{a} = \text{[soft](arg)max}_{\beta} (\boldsymbol{X}^{\top}\boldsymbol{x})
+$$
+
+Где $\beta$ представляет параметр обратной температуры $\text{soft(arg)max}(\cdot)$. $\boldsymbol{X}^{\top}\in\mathbb{R}^{t \times n}$ есть транспонированная матрица представлений множества $\lbrace\boldsymbol{x}_i \rbrace\_{i=1}^t$, и $\boldsymbol{x}$ представляет собой набор $\boldsymbol{x}_i$ из множества. Заметим, что $j$-я строка $X^{\top}$ соответствует элементу $\boldsymbol{x}_j\in\mathbb{R}^n$, так что  $j$-я строка $\boldsymbol{X}^{\top}\boldsymbol{x}$ является скалярным произведением $\boldsymbol{x}_j$ с каждым $\boldsymbol{x}_i$ из $\lbrace \boldsymbol{x}_i \rbrace\_{i=1}^t$.
+
+Компоненты вектора $\vect{a}$ также называются "оценки", потому что скалярное произведение двух векторов говорит нам, как направлены или схожи два вектора. Следовательно элементы $\vect{a}$ предоставляют информацию о схожести всего множества с частным $\boldsymbol{x}_i$.
+
+Квадратные скобки представляют  represent необязательный аргумент. Заметим, что если используется $\arg\max(\cdot)$, мы получаем унитарный вектор альф, результирующий в жёсткое внимание. С другой стороны, $\text{soft(arg)max}(\cdot)$ приводит к мягкому вниманию. В каждом случае компоненты результирующего вектора $\vect{a}$ в сумме дают 1.
+
+Генерируя $\vect{a}$ таким образом даёт их множество, по одному для каждого $\boldsymbol{x}_i$. Более того, каждый $\vect{a}_i \in \mathbb{R}^t$, так что мы можем образовать из альф матрицу $\boldsymbol{A}\in \mathbb{R}^{t \times t}$.
+
+Поскольку каждое внутреннее состояние является линейной комбинацией входов $\boldsymbol{X}$ и вектора $\vect{a}$, мы получаем множество $t$ внутренних состояний, которые можем объединить в матрицу $\boldsymbol{H}\in \mathbb{R}^{n \times t}$.
+
+$$
+\boldsymbol{H}=\boldsymbol{XA}
+$$
+
+
+<!-- ## [Key-value store](https://www.youtube.com/watch?v=f01J0Dri-6k&t=1056s)
+
+A key-value store is a paradigm designed for storing (saving), retrieving (querying) and managing associative arrays (dictionaries / hash tables).
+
+For example, say we wanted to find a recipe to make lasagne. We have a recipe book and search for "lasagne" - this is the query. This query is checked against all possible keys in your dataset - in this case, this could be the titles of all the recipes in the book. We check how aligned the query is with each title to find the maximum matching score between the query and all the respective keys. If our output is the argmax function - we retrieve the single recipe with the highest score. Otherwise, if we use a soft argmax function, we would get a probability distribution and can retrieve in order from the most similar content to less and less relevant recipes matching the query.
+
+Basically, the query is the question. Given one query, we check this query against every key and retrieve all matching content. -->
+
+
+## [Хранилище ключ-значение](https://www.youtube.com/watch?v=f01J0Dri-6k&t=1056s)
+
+Хранилище ключ-значение - парадигма разработанная для хранения (сохранения), извлечения (запроса) и управления ассоциативными массивами (словарями / хеш-таблицами).
+
+Например, скажем нам нужно найти рецепт лазаньи. У нас есть книга рецептов и поисковое слово "лазанья" - это запрос. Этот запрос проверяется для каждого возможного из ключей вашей выборки данных - в этом случае это могут быть названия всех рецептов в книге. Мы проверяем как направлен запрос по отношению к каждому названию, чтобы найти максимальную оценку совпадения между запросом и всем соответствующими ключами. Если нашим выходом является функция argmax - мы получаем один рецепт с наивысшей оценкой. В другом случае, если мы используем функцию soft argmax, мы получим вероятностное распределение и можем получить рецепты в порядке от наиболее схожего содержимого до менее и менее релевантного, соответствующего запросу.
+
+По сути, запрос есть вопрос. По заданному запросу, мы проверяем этот запрос по каждому ключу и получаем всё соответствующее содержимое.
+
+
+<!-- ### Queries, keys and values
+
+$$
+\begin{aligned}
+\vect{q} &= \vect{W_q x} \\
+\vect{k} &= \vect{W_k x} \\
+\vect{v} &= \vect{W_v x}
+\end{aligned}
+$$
+
+Each of the vectors $\vect{q}, \vect{k}, \vect{v}$ can simply be viewed as rotations of the specific input $\vect{x}$. Where $\vect{q}$ is just $\vect{x}$ rotated by $\vect{W_q}$, $\vect{k}$ is just $\vect{x}$ rotated by $\vect{W_k}$ and similarly for $\vect{v}$. Note that this is the first time we are introducing "learnable" parameters. We also do not include any non-linearities since attention is completely based on orientation.
+
+In order to compare the query against all possible keys, $\vect{q}$ and $\vect{k}$ must have the same dimensionality, *i.e.* $\vect{q}, \vect{k} \in \mathbb{R}^{d'}$.
+
+However, $\vect{v}$ can be of any dimension. If we continue with our lasagne recipe example - we need the query to have the dimension as the keys, *i.e.* the titles of the different recipes that we're searching through. The dimension of the corresponding recipe retrieved, $\vect{v}$, can be arbitrarily long though. So we have that $\vect{v} \in \mathbb{R}^{d''}$.
+
+For simplicity, here we will make the assumption that everything has dimension $d$, i.e.
+
+$$
+d' = d'' = d
+$$
+
+So now we have a set of $\vect{x}$'s, a set of queries, a set of keys and a set of values. We can stack these sets into matrices each with $t$ columns since we stacked $t$ vectors; each vector has height $d$.
+
+$$
+\{ \vect{x}_i \}_{i=1}^t \rightsquigarrow \{ \vect{q}_i \}_{i=1}^t, \, \{ \vect{k}_i \}_{i=1}^t, \, \, \{ \vect{v}_i \}_{i=1}^t \rightsquigarrow \vect{Q}, \vect{K}, \vect{V} \in \mathbb{R}^{d \times t}
+$$
+
+We compare one query $\vect{q}$ against the matrix of all keys $\vect{K}$:
+
+$$
+\vect{a} = \text{[soft](arg)max}_{\beta} (\vect{K}^{\top} \vect{q}) \in \mathbb{R}^t
+$$
+
+Then the hidden layer is going to be the linear combination of the columns of $\vect{V}$ weighted by the coefficients in $\vect{a}$:
+
+$$
+\vect{h} = \vect{V} \vect{a} \in \mathbb{R}^d
+$$
+
+Since we have $t$ queries, we'll get $t$ corresponding $\vect{a}$ weights and therefore a matrix $\vect{A}$ of dimension $t \times t$.
+
+$$
+\{ \vect{q}_i \}_{i=1}^t \rightsquigarrow \{ \vect{a}_i \}_{i=1}^t, \rightsquigarrow \vect{A} \in \mathbb{R}^{t \times t}
+$$
+
+Therefore in matrix notation we have:
+
+$$
+\vect{H} = \vect{VA} \in \mathbb{R}^{d \times t}
+$$
+
+As an aside, we typically set $\beta$ to:
+
+$$
+\beta = \frac{1}{\sqrt{d}}
+$$
+
+This is done to keep the temperature constant across different choices of dimension $d$ and so we divide by the square root of the number of dimensions $d$. (Think what is the length of the vector $\vect{1} \in \R^d$.)
+
+For implementation, we can speed up computation by stacking all the $\vect{W}$'s into one tall $\vect{W}$ and then calculate $\vect{q}, \vect{k}, \vect{v}$ in one go:
+
+$$
+\begin{bmatrix}
+\vect{q} \\
+\vect{k} \\
+\vect{v}
+\end{bmatrix} =
+\begin{bmatrix}
+\vect{W_q} \\
+\vect{W_k} \\
+\vect{W_v}
+\end{bmatrix} \vect{x} \in \mathbb{R}^{3d}
+$$
+
+There is also the concept of "heads". Above we have seen an example with one head but we could have multiple heads. For example, say we have $h$ heads, then we have $h$ $\vect{q}$'s, $h$ $\vect{k}$'s and $h$ $\vect{v}$'s and we end up with a vector in $\mathbb{R}^{3hd}$:
+
+$$
+\begin{bmatrix}
+\vect{q}^1 \\
+\vect{q}^2 \\
+\vdots \\
+\vect{q}^h \\
+\vect{k}^1 \\
+\vect{k}^2 \\
+\vdots \\
+\vect{k}^h \\
+\vect{v}^1 \\
+\vect{v}^2 \\
+\vdots \\
+\vect{v}^h
+\end{bmatrix} =
+\begin{bmatrix}
+\vect{W_q}^1 \\
+\vect{W_q}^2 \\
+\vdots \\
+\vect{W_q}^h \\
+\vect{W_k}^1 \\
+\vect{W_k}^2 \\
+\vdots \\
+\vect{W_k}^h \\
+\vect{W_v}^1 \\
+\vect{W_v}^2 \\
+\vdots \\
+\vect{W_v}^h
+\end{bmatrix} \vect{x} \in \R^{3hd}
+$$
+
+However, we can still transform the multi-headed values to have the original dimension $\R^d$ by using a $\vect{W_h} \in \mathbb{R}^{d \times hd}$. This is just one possible way to implement the key-value store. -->
+
+
+### Запросы, ключи и значения
+
+$$
+\begin{aligned}
+\vect{q} &= \vect{W_q x} \\
+\vect{k} &= \vect{W_k x} \\
+\vect{v} &= \vect{W_v x}
+\end{aligned}
+$$
+
+Каждый из векторов $\vect{q}, \vect{k}, \vect{v}$ может быть представлен как поворот определённого входа $\vect{x}$. Где $\vect{q}$ есть просто $\vect{x}$ повёрнутый $\vect{W_q}$, $\vect{k}$ просто $\vect{x}$ повёрнутый посредством $\vect{W_k}$ и аналогично для $\vect{v}$. Заметим, что мы впервые вводим "обучаемые" параметры. Мы также не включаем каких-либо нелинейностей, поскольку внимание полностью опирается на направление.
+
+Чтобы сравнить запрос с каждым из возможных ключей, $\vect{q}$ и $\vect{k}$ должны быть одинаковой размерности, *т.е.* $\vect{q}, \vect{k} \in \mathbb{R}^{d'}$.
+
+Однако, $\vect{v}$ может быть любой размерности. Если мы продолжим нажпример с рецептом лазаньи - нам нужно, чтобы запрос имел такую же размерность, как у ключей, *т.е.* названий различных рецептов по которым мы будем искать. Размерность соответствующего полученного рецепта, $\vect{v}$, однако может быть сколь угодно большой. Таким образом мы имеем, что $\vect{v} \in \mathbb{R}^{d''}$.
+
+Для простоты здесь мы сделаем предположение, что у всего размерность $d$, т.е.
+
+$$
+d' = d'' = d
+$$
+
+Так что сейчас у нас есть множество $\vect{x}$-ов, множество запросов, множество ключей и множество значений. Мы можем объединить эти множества в матрицы, каждая с $t$ столбцами, поскольку мы объединяем $t$ векторов; каждый вектор длины $d$.
+
+$$
+\{ \vect{x}_i \}_{i=1}^t \rightsquigarrow \{ \vect{q}_i \}_{i=1}^t, \, \{ \vect{k}_i \}_{i=1}^t, \, \, \{ \vect{v}_i \}_{i=1}^t \rightsquigarrow \vect{Q}, \vect{K}, \vect{V} \in \mathbb{R}^{d \times t}
+$$
+
+Мы сравниваем один запрос $\vect{q}$ с матрицей всех ключей $\vect{K}$:
+
+$$
+\vect{a} = \text{[soft](arg)max}_{\beta} (\vect{K}^{\top} \vect{q}) \in \mathbb{R}^t
+$$
+
+Затем внутренний слой будет линейной комбинацией столбцов $\vect{V}$, взвешенной коэффициентами из $\vect{a}$:
+
+$$
+\vect{h} = \vect{V} \vect{a} \in \mathbb{R}^d
+$$
+
+Поскольку у нас $t$ запросов, мы получим $t$ соответствующих $\vect{a}$ весов и следовательно матрицу $\vect{A}$ размерности $t \times t$.
+
+$$
+\{ \vect{q}_i \}_{i=1}^t \rightsquigarrow \{ \vect{a}_i \}_{i=1}^t, \rightsquigarrow \vect{A} \in \mathbb{R}^{t \times t}
+$$
+
+Следовательно в матричной записи мы имеем:
+
+$$
+\vect{H} = \vect{VA} \in \mathbb{R}^{d \times t}
+$$
+
+Отдельно мы обычно устанавливаем $\beta$ значение:
+
+$$
+\beta = \frac{1}{\sqrt{d}}
+$$
+
+Это делается для того, чтобы поддерживать постоянную температуру на протяжении различных выборов размерности $d$, и поэтому мы делим на квадратный корень числа измерений $d$. (Подумайте какая длина вектора $\vect{1} \in \R^d$.)
+
+Во время реализации мы можем ускорить вычисления объединяя все $\vect{W}$-ки в одну $\vect{W}$ и затем вычислить $\vect{q}, \vect{k}, \vect{v}$ за один проход:
+
+$$
+\begin{bmatrix}
+\vect{q} \\
+\vect{k} \\
+\vect{v}
+\end{bmatrix} =
+\begin{bmatrix}
+\vect{W_q} \\
+\vect{W_k} \\
+\vect{W_v}
+\end{bmatrix} \vect{x} \in \mathbb{R}^{3d}
+$$
+
+Существует также концепция "голов". Выше мы видели пример с одной головой, но у нас может быть много голов. Например, скажем у нас $h$ голов, значит у нас $h$ $\vect{q}$-ек, $h$ $\vect{k}$-ек и $h$ $\vect{v}$-ек и мы получаем вектор из $\mathbb{R}^{3hd}$:
+
+$$
+\begin{bmatrix}
+\vect{q}^1 \\
+\vect{q}^2 \\
+\vdots \\
+\vect{q}^h \\
+\vect{k}^1 \\
+\vect{k}^2 \\
+\vdots \\
+\vect{k}^h \\
+\vect{v}^1 \\
+\vect{v}^2 \\
+\vdots \\
+\vect{v}^h
+\end{bmatrix} =
+\begin{bmatrix}
+\vect{W_q}^1 \\
+\vect{W_q}^2 \\
+\vdots \\
+\vect{W_q}^h \\
+\vect{W_k}^1 \\
+\vect{W_k}^2 \\
+\vdots \\
+\vect{W_k}^h \\
+\vect{W_v}^1 \\
+\vect{W_v}^2 \\
+\vdots \\
+\vect{W_v}^h
+\end{bmatrix} \vect{x} \in \R^{3hd}
+$$
+
+Однако, всё ещё можем преобразовать многоголовые значения, чтобы иметь изначальную размерность $\R^d$, используя $\vect{W_h} \in \mathbb{R}^{d \times hd}$. Это просто один из возможных способов реализации хранилища ключ-значение.
+
+
+<!-- ## [The Transformer](https://www.youtube.com/watch?v=f01J0Dri-6k&t=2114s)
+
+Expanding on our knowledge of attention in particular, we now interpret the fundamental building blocks of the transformer. In particular, we will take a forward pass through a basic transformer, and see how attention is used in the standard encoder-decoder paradigm and compares to the sequential architectures of RNNs. -->
+
+
+## [Трансформер](https://www.youtube.com/watch?v=f01J0Dri-6k&t=2114s)
+
+Расширяя наши знания о внимании в частности, мы теперь интерпретируем фундаментальные строительные блоки трансформера. В частности, мы возьмём прямой проход через базовый трансформер, и посмотрим, как внимание используется в стандартной парадигме кодировщик-декодировщик и сравним последовательные архитектуры с RNNs.
+
+
+<!-- ### Encoder-Decoder Architecture
+
+We should be familiar with this terminology. It is shown most prominently during autoencoder demonstrations, and is prerequisite understanding up to this point. To summarize, an input is fed through an encoder and decoder which impose some sort of bottleneck on the data, forcing only the most important information through. This information is stored in the output of the encoder block, and can be used for a variety of unrelated tasks.
+
+<center>
+<img src="{{site.baseurl}}/images/week12/12-3/figure1.png" style="zoom: 60%; background-color:#DCDCDC;" /><br>
+<b>Figure 1:</b> Two example diagrams of an autoencoder. The model on the left shows how an autoencoder can be design with two affine transformations + activations, where the image on the right replaces this single "layer" with an arbitrary module of operations.
+</center>
+
+Our "attention" is drawn to the autoencoder layout as shown in the model on the right and will now take a look inside, in the context of transformers. -->
+
+
+### Архитектура Кодировщик-Декодировщик
+
+Мы уже знакомы с этой терминологией. Наиболее заметно это было показано на примере автокодировщика, и необходимо понимание этого момента. Резюмируя, вход подаётся в кодировщик и декодировщик налагает некоторого рода ограничения на данные, стимулируя использовать только самую важную информацию. Эта информация сохранена в выходе кодирующего блока, и может быть использована в множестве не связанных задач.
+
+<center>
+<img src="{{site.baseurl}}/images/week12/12-3/figure1.png" style="zoom: 60%; background-color:#DCDCDC;" /><br>
+<b>Рисунок 1:</b> Два примера диаграм автокодировщика. Модель слева показывает, как автокодировщик может быть спроектирован с двумя аффинными преобразованиями + активациями, где изображение справа заменяет единственный "слой" на произвольный модуль операций.
+</center>
+
+Наше "внимание" изображено на схеме автокодировщика, как показано в модели справа, и сейчас заглянем внутрь в контексте трансформеров.
+
+
+<!-- ### Encoder Module
+
+<center>
+<img src="{{site.baseurl}}/images/week12/12-3/figure2.png" style="zoom: 60%; background-color:#DCDCDC;" /><br>
+<b>Figure 2:</b> The transformer encoder, which accepts at set of inputs $\vect{x}$, and outputs a set of hidden representations $\vect{h}^\text{Enc}$.
+</center>
+
+The encoder module accepts a set of inputs, which are simultaneously fed through the self attention block and bypasses it to reach the `Add, Norm` block. At which point, they are again simultaneously passed through the 1D-Convolution and another `Add, Norm` block, and consequently outputted as the set of hidden representation. This set of hidden representation is then either sent through an arbitrary number of encoder modules *i.e.* more layers), or to the decoder. We shall now discuss these blocks in more detail. -->
+
+
+### Модуль Кодировщик
+
+<center>
+<img src="{{site.baseurl}}/images/week12/12-3/figure2.png" style="zoom: 60%; background-color:#DCDCDC;" /><br>
+<b>Рисунок 2:</b> Кодировщик трансформера, которые принимает множество входов $\vect{x}$, и выводит множество внутренних представлений $\vect{h}^\text{Enc}$.
+</center>
+
+Кодирующий модуль принимает множество входов, которые сразу подаются на блок self attention и проходят через него, достигая блока `Add, Norm`. В этот момент, они снова проходят через 1D-Свёртку и другой `Add, Norm` блок, и в результате выводятся как мнрожество внутренних представлений. Это множество внутренних представлений затем либо проходит через произвольное число кодирующих модулей (*т.е.* больше слоёв), либо через декодировщик. Теперь обсудим эти блоки более подробно.
+
+
+<!-- ### Self-attention
+
+The self-attention model is a normal attention model. The query, key, and value are generated from the same item of the sequential input. In tasks that try to model sequential data, positional encodings are added prior to this input. The output of this block is the attention-weighted values. The self-attention block accepts a set of inputs, from $1, \cdots , t$, and outputs $1, \cdots, t$ attention weighted values which are fed through the rest of the encoder.
+
+<center>
+<img src="{{site.baseurl}}/images/week12/12-3/figure3.png" style="zoom: 60%; background-color:#DCDCDC;" /><br>
+<b>Figure 3:</b> The self-attention block. The sequence of inputs is shown as a set along the 3rd dimension, and concatenated.
+</center> -->
+
+
+### Self-attention
+
+Модель self-attention есть обычная модель внимания. Запросы, ключи и значения сгенерированы из того же самого последовательного входа. В задачах, которые пытаются моделировать последовательные данные, позиционные кодировщики добавляются перед этим входом. Вход этого блока есть взвешенные вниманием значения. Блок self-attention принимает множество входов, из $1, \cdots , t$, и выводит $1, \cdots, t$ взевешнных вниманием значений, которые проходят через остальную часть кодировщика.
+
+<center>
+<img src="{{site.baseurl}}/images/week12/12-3/figure3.png" style="zoom: 60%; background-color:#DCDCDC;" /><br>
+<b>Рисунок 3:</b> Блок self-attention. Последовательность входов изображена как множество в третьем измерении и сконкатенирована.
+</center>
+
+
+<!-- #### Add, Norm
+
+The add norm block has two components. First is the add block, which is a residual connection, and layer normalization. -->
+
+
+#### Сумма, Норма
+
+Блок суммы нормы содержит два компонента. Первый - это блок суммы, который есть остаточное соеденение, и нормализация по слою.
+
+
+<!-- #### 1D-convolution
+
+Following this step, a 1D-convolution (aka a position-wise feed forward network) is applied. This block consists of two dense layers. Depending on what values are set, this block allows you to adjust the dimensions of the output $\vect{h}^\text{Enc}$. -->
+
+
+#### 1D-свёртка
+
+Продолжая этот шаг, применяется 1D-свёртка (позиционная прямая сеть). Этот блок состоит из двух полносвязных слоёв. В зависимости от заданных значений, этот блок позволяет вам настраивать размеры выхода $\vect{h}^\text{Enc}$.
+
+
+<!-- ### Decoder Module
+
+The transformer decoder follows a similar procedure as the encoder. However, there is one additional sub-block to take into account. Additionally, the inputs to this module are different.
+
+<center>
+<img src="{{site.baseurl}}/images/week12/12-3/figure5.png" style="zoom: 60%; background-color:#DCDCDC;" /><br>
+<b>Figure 4:</b> A friendlier explanation of the decoder.
+</center> -->
+
+
+### Модуль декодировщик
+
+Декодировщик трансформера следует процедуре, похожей на кодировщик. Однако, тут есть дополнительный блок, который принимается во внимание. Дополнительно, входы этого модуля различны.
+
+<center>
+<img src="{{site.baseurl}}/images/week12/12-3/figure5.png" style="zoom: 60%; background-color:#DCDCDC;" /><br>
+<b>Рисунок 4:</b> Более дружелюбное объяснение декодировщика.
+</center>
+
+
+<!-- #### Cross-attention
+
+The cross attention follows the query, key, and value setup used for the self-attention blocks.  However, the inputs are a little more complicated. The input to the decoder is a data point $\vect{y}\_i$, which is then passed through the self attention and add norm blocks, and finally ends up at the cross-attention block. This serves as the query for cross-attention, where the key and value pairs are the output $\vect{h}^\text{Enc}$, where this output was calculated with all past inputs $\vect{x}\_1, \cdots, \vect{x}\_{t}$. -->
+
+
+#### Перекрёстное-внимание
+
+Перекрёстное внимание следует настройке запросов, ключей и значений, используемой в блоках self-attention. Однако, входы немного более сложные. Вход в декодировщик - это точка данных $\vect{y}\_i$, которая затем проходит через self attention и блоки суммы нормы, и наконец завершается блоком перекрёстного внимания. Это служит запоросом перекрёстного внимания, где пары ключей и значений есть выходы $\vect{h}^\text{Enc}$, где этот выход рассчитывается со всеми прошлыми входами $\vect{x}\_1, \cdots, \vect{x}\_{t}$.
+
+
+<!-- ## Summary
+
+A set, $\vect{x}\_1$ to $\vect{x}\_{t}$ is fed through the encoder. Using self-attention and some more blocks, an output representation, $\lbrace\vect{h}^\text{Enc}\rbrace_{i=1}^t$ is obtained, which is fed to the decoder. After applying self-attention to it, cross attention is applied. In this block, the query corresponds to a representation of a symbol in the target language $\vect{y}\_i$, and the key and values are from the source language sentence ($\vect{x}\_1$ to $\vect{x}\_{t}$). Intuitively, cross attention finds which values in the input sequence are most relevant to constructing $\vect{y}\_t$, and therefore deserve the highest attention coefficients. The output of this cross attention is then fed through another 1D-convolution sub-block, and we have $\vect{h}^\text{Dec}$. For the specified target language, it is straightforward from here to see how training will commence, by comparing $\lbrace\vect{h}^\text{Dec}\rbrace_{i=1}^t$ to some target data. -->
+
+
+## Резюме
+
+Множество от $\vect{x}\_1$ до $\vect{x}\_{t}$ проходит через кодировщик. Используя self-attention и несколько других блоков, получаем представление выхода $\lbrace\vect{h}^\text{Enc}\rbrace_{i=1}^t$, которое подаётся на декодировщик. После применения к нему self-attention, применяется перекрёстное внимание. В этом блоке запрос соответствует представлению символа в целевом языке $\vect{y}\_i$, а ключ и значения из предложения на исходном языке (от $\vect{x}\_1$ до $\vect{x}\_{t}$). Интуитивно, перекрёстное внимание находит, какие значения во входной последовательности наиболее релевантны к построению $\vect{y}\_t$, и затем заслуживают наивысшие коэффициенты внимания. Выход этого перекрёстного внимания затем проходит через ещё один блок 1D-свёртки, имеем $\vect{h}^\text{Dec}$. Для указанного целевого языка, отсюда легко увидеть, как начать обучение, сравнивая $\lbrace\vect{h}^\text{Dec}\rbrace_{i=1}^t$ с некоторыми целевыми данными.
+
+
+<!-- ### Word Language Models
+
+There are a few important facts we left out before to explain the most important modules of a transformer, but will need to discuss them now to understand how transformers can achieve state-of-the-art results in language tasks. -->
+
+
+### Мировые модели языка
+
+Есть несколько важных фактов, которые мы опустили выше, чтобы объяснить наиболее важных модулей трансформера, но нам нужно обсудить их сейчас, чтобы понимать, как трансформеры могут достичь state-of-the-art результатов в задачах языка.
+
+
+<!-- #### Positional encoding
+
+Attention mechanisms allow us to parallelize the operations and greatly accelerate a model's training time,  but loses sequential information. The positional encoding feature enables allows us to capture this context. -->
+
+
+#### Позиционное кодирование
+
+Механизмы внимания позволяют нам параллелизовать операции и сильно ускорить время обучения модели, но теряется последовательность информации. Функция позционного кодирования позволяет нам захватить этот контекст.
+
+
+<!-- #### Semantic Representations
+
+Throughout the training of a transformer, many hidden representations are generated. To create an embedding space similar to the one used by the word-language model example in PyTorch, the output of the cross-attention, will provide a semantic representation of the word $x_i$, at which point further experimentation can be performed over this dataset. -->
+
+
+#### Сематические представления
+
+На протяжении обучения трансформера, генерируется множество внутренних представлений. Чтобы создать пространство характеристик, подобное тому, которое использовалось в примере модели мирового языка на PyTorch, выход перекрёстного внимания предоставит семантическое представление слова $x_i$, после чего можно проводить дальнейшие эксперименты на этой выборке данных.
+
+
+<!-- ### Code Summary
+
+We will now see the blocks of transformers discussed above in a far more understandable format, code!
+
+The first module we will look at the multi-headed attention block. Depenending on query, key, and values entered into this block, it can either be used for self or cross attention.
+
+
+```python
+class MultiHeadAttention(nn.Module):
+    def __init__(self, d_model, num_heads, p, d_input=None):
+        super().__init__()
+        self.num_heads = num_heads
+        self.d_model = d_model
+        if d_input is None:
+            d_xq = d_xk = d_xv = d_model
+        else:
+            d_xq, d_xk, d_xv = d_input
+        # Embedding dimension of model is a multiple of number of heads
+        assert d_model % self.num_heads == 0
+        self.d_k = d_model // self.num_heads
+        # These are still of dimension d_model. To split into number of heads
+        self.W_q = nn.Linear(d_xq, d_model, bias=False)
+        self.W_k = nn.Linear(d_xk, d_model, bias=False)
+        self.W_v = nn.Linear(d_xv, d_model, bias=False)
+        # Outputs of all sub-layers need to be of dimension d_model
+        self.W_h = nn.Linear(d_model, d_model)
+```
+
+
+Initialization of multi-headed attention class. If a `d_input` is provided, this becomes cross attention. Otherwise, self-attention. The query, key, value setup is constructed as a linear transformation of the input `d_model`.
+
+
+```python
+def scaled_dot_product_attention(self, Q, K, V):
+    batch_size = Q.size(0)
+    k_length = K.size(-2)
+
+    # Scaling by d_k so that the soft(arg)max doesnt saturate
+    Q = Q / np.sqrt(self.d_k)  # (bs, n_heads, q_length, dim_per_head)
+    scores = torch.matmul(Q, K.transpose(2,3))  # (bs, n_heads, q_length, k_length)
+
+    A = nn_Softargmax(dim=-1)(scores)  # (bs, n_heads, q_length, k_length)
+
+    # Get the weighted average of the values
+    H = torch.matmul(A, V)  # (bs, n_heads, q_length, dim_per_head)
+
+    return H, A
+```
+
+Return hidden layer corresponding to encodings of values after scaled by the attention vector. For book-keeping purposes (which values in the sequence were masked out by attention?) A is also returned.
+
+```python
+def split_heads(self, x, batch_size):
+    return x.view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
+```
+
+Split the last dimension into (`heads` × `depth`). Return after transpose to put in shape (`batch_size` × `num_heads` × `seq_length` × `d_k`)
+
+```python
+def group_heads(self, x, batch_size):
+    return x.transpose(1, 2).contiguous().
+        view(batch_size, -1, self.num_heads * self.d_k)
+```
+
+Combines the attention heads together, to get correct shape consistent with batch size and sequence length.
+
+```python
+def forward(self, X_q, X_k, X_v):
+    batch_size, seq_length, dim = X_q.size()
+    # After transforming, split into num_heads
+    Q = self.split_heads(self.W_q(X_q), batch_size)
+    K = self.split_heads(self.W_k(X_k), batch_size)
+    V = self.split_heads(self.W_v(X_v), batch_size)
+    # Calculate the attention weights for each of the heads
+    H_cat, A = self.scaled_dot_product_attention(Q, K, V)
+    # Put all the heads back together by concat
+    H_cat = self.group_heads(H_cat, batch_size)  # (bs, q_length, dim)
+    # Final linear layer
+    H = self.W_h(H_cat)  # (bs, q_length, dim)
+    return H, A
+```
+
+The forward pass of multi headed attention.
+
+Given an input is split into q, k, and v, at which point these values are fed through a scaled dot product attention mechanism, concatenated and fed through a final linear layer. The last output of the attention block is the attention found, and the hidden representation that is passed through the remaining blocks.
+
+Although the next block shown in the transformer/encoder's is the Add,Norm, which is a function already built into PyTorch. As such, it is an extremely simple implementation, and does not need it's own class. Next is the 1-D convolution block. Please refer to previous sections for more details.
+
+Now that we have all of our main classes built (or built for us), we now turn to an encoder module.
+
+```python
+class EncoderLayer(nn.Module):
+    def __init__(self, d_model, num_heads, conv_hidden_dim, p=0.1):
+        self.mha = MultiHeadAttention(d_model, num_heads, p)
+        self.layernorm1 = nn.LayerNorm(normalized_shape=d_model, eps=1e-6)
+        self.layernorm2 = nn.LayerNorm(normalized_shape=d_model, eps=1e-6)
+
+    def forward(self, x):
+        attn_output, _ = self.mha(x, x, x)
+        out1 = self.layernorm1(x + attn_output)
+        cnn_output = self.cnn(out1)
+        out2 = self.layernorm2(out1 + cnn_output)
+        return out2
+```
+
+In the most powerful transformers, an arbitarily large number of these encoders are stacked on top of one another.
+
+Recall that self attention by itself does not have any recurrence or convolutions, but that's what allows it to run so quickly. To make it sensitive to position we provide positional encodings. These are calculated as follows:
+
+
+$$
+\begin{aligned}
+E(p, 2)    &= \sin(p / 10000^{2i / d}) \\
+E(p, 2i+1) &= \cos(p / 10000^{2i / d})
+\end{aligned}
+$$
+
+
+As to not take up too much room on the finer details, we will point you to https://github.com/Atcold/pytorch-Deep-Learning/blob/master/15-transformer.ipynb for the full code used here.
+
+
+An entire encoder, with N stacked encoder layers, as well as position embeddings, is written out as
+
+
+```python
+class Encoder(nn.Module):
+    def __init__(self, num_layers, d_model, num_heads, ff_hidden_dim,
+            input_vocab_size, maximum_position_encoding, p=0.1):
+        self.embedding = Embeddings(d_model, input_vocab_size,
+                                    maximum_position_encoding, p)
+        self.enc_layers = nn.ModuleList()
+        for _ in range(num_layers):
+            self.enc_layers.append(EncoderLayer(d_model, num_heads,
+                                                ff_hidden_dim, p))
+    def forward(self, x):
+        x = self.embedding(x) # Transform to (batch_size, input_seq_length, d_model)
+        for i in range(self.num_layers):
+            x = self.enc_layers[i](x)
+        return x  # (batch_size, input_seq_len, d_model)
+``` -->
+
+
+### Резюме кода
+
+Сейчас мы увидим блоки трансформера, обсуждённые выше в намного более понятном формате, давайте кодить!
+
+Первый модуль, который мы рассмотрим - блок много-голового внимания. В зависимости от запроса, ключа и значений поданных на вход этого блока, он может быть использован, либо для self-attention, либо для перекрёстного внимания.
+
+
+```python
+class MultiHeadAttention(nn.Module):
+    def __init__(self, d_model, num_heads, p, d_input=None):
+        super().__init__()
+        self.num_heads = num_heads
+        self.d_model = d_model
+        if d_input is None:
+            d_xq = d_xk = d_xv = d_model
+        else:
+            d_xq, d_xk, d_xv = d_input
+        # Embedding dimension of model is a multiple of number of heads
+        assert d_model % self.num_heads == 0
+        self.d_k = d_model // self.num_heads
+        # These are still of dimension d_model. To split into number of heads
+        self.W_q = nn.Linear(d_xq, d_model, bias=False)
+        self.W_k = nn.Linear(d_xk, d_model, bias=False)
+        self.W_v = nn.Linear(d_xv, d_model, bias=False)
+        # Outputs of all sub-layers need to be of dimension d_model
+        self.W_h = nn.Linear(d_model, d_model)
+```
+
+
+Инициализация класса многоголового внимания. Если подан `d_input`, он становится перекрёстным вниманием. В ином случае - self-attention. Настройка запроса, ключа, значения строится как линейное преобразование входа `d_model`.
+
+
+```python
+def scaled_dot_product_attention(self, Q, K, V):
+    batch_size = Q.size(0)
+    k_length = K.size(-2)
+
+    # Scaling by d_k so that the soft(arg)max doesnt saturate
+    Q = Q / np.sqrt(self.d_k)  # (bs, n_heads, q_length, dim_per_head)
+    scores = torch.matmul(Q, K.transpose(2,3))  # (bs, n_heads, q_length, k_length)
+
+    A = nn_Softargmax(dim=-1)(scores)  # (bs, n_heads, q_length, k_length)
+
+    # Get the weighted average of the values
+    H = torch.matmul(A, V)  # (bs, n_heads, q_length, dim_per_head)
+
+    return H, A
+```
+
+Возвращает внутренний слой, соответствующий кодам значений после масштабирования вектором внимания. Для целей учёта (какие значения в последовательности были замаскированы вниманием?) A также возвращается.
+
+```python
+def split_heads(self, x, batch_size):
+    return x.view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
+```
+
+Разделим последнюю размерность на (`heads` × `depth`). Возвращает результат после транспонирования в форме (`batch_size` × `num_heads` × `seq_length` × `d_k`)
+
+```python
+def group_heads(self, x, batch_size):
+    return x.transpose(1, 2).contiguous().
+        view(batch_size, -1, self.num_heads * self.d_k)
+```
+
+Комбинирует головы внимания вместе, чтобы получить правильную форму, совместимую с размером батча и длиной последовательности.
+
+```python
+def forward(self, X_q, X_k, X_v):
+    batch_size, seq_length, dim = X_q.size()
+    # After transforming, split into num_heads
+    Q = self.split_heads(self.W_q(X_q), batch_size)
+    K = self.split_heads(self.W_k(X_k), batch_size)
+    V = self.split_heads(self.W_v(X_v), batch_size)
+    # Calculate the attention weights for each of the heads
+    H_cat, A = self.scaled_dot_product_attention(Q, K, V)
+    # Put all the heads back together by concat
+    H_cat = self.group_heads(H_cat, batch_size)  # (bs, q_length, dim)
+    # Final linear layer
+    H = self.W_h(H_cat)  # (bs, q_length, dim)
+    return H, A
+```
+
+Прямой проход многоголового внимания.
+
+Заданные входы разделяет на q, k и v, после чего эти значения проходят через масштабированное скалярное произведение механизма внимания, конкатенируются и проходят через конечный линейный слой. Последний выход блока внимания есть полученное внимание, и внутреннее предстваление, которое проходит через остальные блоки.
+
+Следующий блок, демонстрируемый в трансформере/кодировщике, является Суммой,Нормой, которые являются функциями, уже встроенными в PyTorch. Таким образом, это чрезвычайно простая реализация, и для неё не нужен отдельный класс. Затем идёт 1-D свёрточный блок. Пожалуйста обратитесь к предыдущим разделами для более подробной информации.
+
+Теперь, когда все наши основные классы построены (или встроены для нас), мы перейдём к модулю кодировщику.
+
+```python
+class EncoderLayer(nn.Module):
+    def __init__(self, d_model, num_heads, conv_hidden_dim, p=0.1):
+        self.mha = MultiHeadAttention(d_model, num_heads, p)
+        self.layernorm1 = nn.LayerNorm(normalized_shape=d_model, eps=1e-6)
+        self.layernorm2 = nn.LayerNorm(normalized_shape=d_model, eps=1e-6)
+
+    def forward(self, x):
+        attn_output, _ = self.mha(x, x, x)
+        out1 = self.layernorm1(x + attn_output)
+        cnn_output = self.cnn(out1)
+        out2 = self.layernorm2(out1 + cnn_output)
+        return out2
+```
+
+В самых мощных трансформерах произвольно большое количество этих кодировщиков соединены друг с другом.
+
+Вспомните, что self attention само по себе не имеет каких-либо циклов или свёрток, но это то, что позволяет выполнять его так быстро. Чтобы сделать его чувствительным к положению, мы обеспечиваем позиционные кодировки. Они вычисляются следующим образом:
+
+
+$$
+\begin{aligned}
+E(p, 2i)    &= \sin(p / 10000^{2i / d}) \\
+E(p, 2i+1) &= \cos(p / 10000^{2i / d})
+\end{aligned}
+$$
+
+
+Чтобы не занимать слишком много места на мелких деталях, мы отсылаем вас к https://github.com/Atcold/pytorch-Deep-Learning/blob/master/15-transformer.ipynb для полного кода, использованного здесь.
+
+
+Весь кодировщик, с N последовательными слоями кодировщиками, а также позиционные характеристики, написаны следующим образом:
+
+
+```python
+class Encoder(nn.Module):
+    def __init__(self, num_layers, d_model, num_heads, ff_hidden_dim,
+            input_vocab_size, maximum_position_encoding, p=0.1):
+        self.embedding = Embeddings(d_model, input_vocab_size,
+                                    maximum_position_encoding, p)
+        self.enc_layers = nn.ModuleList()
+        for _ in range(num_layers):
+            self.enc_layers.append(EncoderLayer(d_model, num_heads,
+                                                ff_hidden_dim, p))
+    def forward(self, x):
+        x = self.embedding(x) # Transform to (batch_size, input_seq_length, d_model)
+        for i in range(self.num_layers):
+            x = self.enc_layers[i](x)
+        return x  # (batch_size, input_seq_len, d_model)
+```
+
+
+<!-- ## Example Use
+
+There is a lot of tasks you can use just an Encoder for. In the accompanying notebook, we see how an encoder can be used for sentiment analysis.
+
+Using the imdb review dataset, we can output from the encoder a latent representation of a sequence of text, and train this encoding process with binary cross entropy, corresponding to a positive or negative movie review.
+
+Again we leave out the nuts and bolts, and direct you to the notebook, but here is the most important architectural components used in the transformer:
+
+
+
+```python
+class TransformerClassifier(nn.Module):
+    def forward(self, x):
+        x = Encoder()(x)
+        x = nn.Linear(d_model, num_answers)(x)
+        return torch.max(x, dim=1)
+
+model = TransformerClassifier(num_layers=1, d_model=32, num_heads=2,
+                         conv_hidden_dim=128, input_vocab_size=50002, num_answers=2)
+```
+Where this model is trained in typical fashion. -->
+
+
+## Пример использования
+
+Есть множество задач, где вы можете использовать только Кодировщик. В прилагаемой рабочей тетради, мы увидим, как кодировщик может быть использован для анализа настроений.
+
+Используя выборку данных обзоров imdb, мы можем выводить из кодировщика скрытое представление последовательности текста, и обучить этот процесс кодирования с двоичной перекрёстной энтропией, соответствующей положительному или негативному обзору фильма.
+
+Опять мы опускаем азы, и направляем вас к рабочей тетради, но здесь есть наиболее важные компоненты архитектуры, используемые в трансформере:
+
+
+
+```python
+class TransformerClassifier(nn.Module):
+    def forward(self, x):
+        x = Encoder()(x)
+        x = nn.Linear(d_model, num_answers)(x)
+        return torch.max(x, dim=1)
+
+model = TransformerClassifier(num_layers=1, d_model=32, num_heads=2,
+                         conv_hidden_dim=128, input_vocab_size=50002, num_answers=2)
+```
+Где эта модель обучается типичным образом.
diff --git a/docs/ru/week12/12.md b/docs/ru/week12/12.md
new file mode 100644
index 000000000..6bea6d8f9
--- /dev/null
+++ b/docs/ru/week12/12.md
@@ -0,0 +1,30 @@
+---
+lang: ru
+lang-ref: ch.12
+title: Неделя 12
+translation-date: 01 Dec 2020
+translator: Evgeniy Pak
+---
+
+
+<!-- ## Lecture part A -->
+## Часть A лекции
+
+<!-- In this section we discuss the various architectures used in NLP applications, beginning with CNNs, RNNs, and eventually covering the state of-the art architecture, transformers. We then discuss the various modules that comprise transformers and how they make transformers advantageous for NLP tasks. Finally, we discuss tricks that allow transformers to be trained effectively. -->
+
+В этом разделе мы обсуждаем различные архитектуры, используемые в приложениях обработки естественного языка, начиная с CNNs, RNNs, и, в конечном итоге, рассматривая state-of-the-art архитектуру, трансформеры. Затем мы обсуждаем различные модули, которые включают трансформеры и то, как они дают преимущество трансформерам в задачах естественной обработки языка. В итоге мы обсудим приёмы, позволяющие эффективно обучать трансформеры.
+
+
+<!-- ## Lecture part B -->
+## Часть B лекции
+
+<!-- In this section we introduce beam search as a middle ground between greedy decoding and exhaustive search. We consider the case of wanting to sample from the generative distribution (*i.e.* when generating text) and introduce "top-k" sampling. Subsequently, we introduce sequence to sequence models (with a transformer variant) and backtranslation. We then introduce unsupervised learning approaches for learning embeddings and discuss word2vec, GPT, and BERT. -->
+
+В этом разделе мы  знакомим с лучевым поиском как золотой серединой между жадным декодированием и полным перебором. Мы рассматриваем случай, когда требуется выборка из порождающего распределения (*т.e.* при генерации текста) и вводим понятие "top-k" выборки. Затем мы знакомим с моделями sequence to sequence (в варианте трансформера) и обратным переводом. После рассматриваем подход обучения без учителя к  обучению характеристик и обсуждаем word2vec, GPT и BERT.
+
+<!-- ## Practicum -->
+## Практикум
+
+<!-- We introduce attention, focusing on self-attention and its hidden layer representations of the inputs. Then, we introduce the key-value store paradigm and discuss how to represent queries, keys, and values as rotations of an input. Finally, we use attention to interpret the transformer architecture, taking a forward pass through a basic transformer, and comparing the encoder-decoder paradigm to sequential architectures. -->
+
+Вводим понятие внимания, фокусируясь на self-attention и его представлениях входов на скрытом слое. Затем мы представляем парадигму хранилища ключ-значение и обсуждаем, как представить запросы, ключи и значения, как повороты входов. Наконец мы используем внимание для интерпретации архитектуры трансформер, взяв результат прямого прохода через базовый трансформер и сравнивая парадигму кодирования-декодирования с последовательной архитектурой.