# Transformer

A transformer is a type of sequence to sequence neural network models that was proposed in the 2017 paper Attention Is All You Need. It relies on a self attention mechanism that allows it to use a larger portion of the context, allowing it to out-perform Recurrent Neural Networks (RNNs) such as LSTMS and GRUs.

## The Model

The components are explained below.

### Positional Encoding

Because transformers work in parallel rather than in sequence, the model has no concept of the order of things. To fix this, there is a positional encoding, which adds the location of each token to each token's data so the model can keep track of it.

### Multi Head attention

Multihead attention is a component that uses Scaled Dot-Product Attention with pre and post processing. Its primary purpose is to transform the large set of results from the multiple Scaled Dot-Product Attention components and concatenate them.

#### Scaled Dot-Product Attention

Scaled Dot-Product Attention uses the vectors Q, K, and V. Q is Query, K is Key, and V is Value. These are each derived from a matrix transformation of the input. Q&K are multiplied to give a matrix that roughly corresponds to each token's relation to each other token in the sequence. After being transformed to the range 0-1, the result is multiplied by V to get a matrix representing the importance of each word. The rows are then added to get the output.