From AI Dungeon Wiki
Revision as of 23:09, 2 September 2020 by Devon not duck (talk | contribs) (Link was bad)
Jump to navigation Jump to search

A Transformer is a type of sequence to sequence Neural Network that was proposed in the 2017 paper Attention Is All You Need. It relies on a self attention mechanism that allows it to use a larger portion of the context, allowing it to out perform Recurrent Neural Networks such as LSTMS and GRUs.

The Model

The transformer model

The model works in the way you can see above. The components are explained below.

Positional Encoding

Because transformers work in parallel rather than in sequence, the model has no concept of the order of things. To fix this, there is a positional encoding, which adds the location of each token to each token so the model can keep track of it.

Multi Head attention

The multihead attention component

Multi Head attention is a component that uses Scaled Dot-Product Attention with pre and post processing. Its primary purpose is to transform the large set of results from the multiple Scaled Dot-Product Attention components and concatenate them.

Scaled Dot-Product Attention

The scaled Dot-Product attention component

Scaled Dot-Product Attention uses the vectors Q, K, and V. Q is Query, K is Key, and V is Value. These are each derived from a matrix transformation of the input. Q&K are multiplied to give a matrix that roughly corresponds to each token's relation to each other token in the sequence. After being transformed to the range 0-1, the result is multiplied by V to get a matrix representing the importance of each word. The rows are then added to get the output.