Transformer

From AI Dungeon Wiki
Revision as of 16:06, 8 March 2021 by Luihum (talk | contribs) (Categorizing)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

A transformer is a type of sequence to sequence neural network models that was proposed in the 2017 paper Attention Is All You Need. It relies on a self attention mechanism that allows it to use a larger portion of the context, allowing it to out-perform Recurrent Neural Networks (RNNs) such as LSTMS and GRUs.

The Model

The transformer model

The components are explained below.

Positional Encoding

Because transformers work in parallel rather than in sequence, the model has no concept of the order of things. To fix this, there is a positional encoding, which adds the location of each token to each token's data so the model can keep track of it.

Multi Head attention

The multihead attention component

Multihead attention is a component that uses Scaled Dot-Product Attention with pre and post processing. Its primary purpose is to transform the large set of results from the multiple Scaled Dot-Product Attention components and concatenate them.

Scaled Dot-Product Attention

The scaled Dot-Product attention component

Scaled Dot-Product Attention uses the vectors Q, K, and V. Q is Query, K is Key, and V is Value. These are each derived from a matrix transformation of the input. Q&K are multiplied to give a matrix that roughly corresponds to each token's relation to each other token in the sequence. After being transformed to the range 0-1, the result is multiplied by V to get a matrix representing the importance of each word. The rows are then added to get the output.