Difference between revisions of "Transformer"

From AI Dungeon Wiki
Jump to navigation Jump to search
[unchecked revision][unchecked revision]
m (Link was bad)
m (Add redlink to tokens)
Line 5: Line 5:
 
The model works in the way you can see above. The components are explained below.
 
The model works in the way you can see above. The components are explained below.
 
===Positional Encoding===
 
===Positional Encoding===
Because transformers work in parallel rather than in sequence, the model has no concept of the order of things. To fix this, there is a positional encoding, which adds the location of each token to each token so the model can keep track of it.
+
Because transformers work in parallel rather than in sequence, the model has no concept of the order of things. To fix this, there is a positional encoding, which adds the location of each [[token]] to each token's data so the model can keep track of it.
 
===Multi Head attention===
 
===Multi Head attention===
 
[[File:Multihead attention.png|The multihead attention component]]
 
[[File:Multihead attention.png|The multihead attention component]]
Line 14: Line 14:
 
[[File:Scaled Dot-Product Attention.png|The scaled Dot-Product attention component]]
 
[[File:Scaled Dot-Product Attention.png|The scaled Dot-Product attention component]]
  
Scaled Dot-Product Attention uses the vectors Q, K, and V. Q is Query, K is Key, and V is Value. These are each derived from a matrix transformation of the input. Q&K are multiplied to give a matrix that roughly corresponds to each token's relation to each other token in the sequence. After being transformed to the range 0-1, the result is multiplied by V to get a matrix representing the importance of each word. The rows are then added to get the output.
+
Scaled Dot-Product Attention uses the vectors Q, K, and V. Q is Query, K is Key, and V is Value. These are each derived from a matrix transformation of the input. Q&K are multiplied to give a matrix that roughly corresponds to each token's relation to each other [[token]] in the sequence. After being transformed to the range 0-1, the result is multiplied by V to get a matrix representing the importance of each word. The rows are then added to get the output.

Revision as of 04:06, 3 September 2020

A Transformer is a type of sequence to sequence Neural Network that was proposed in the 2017 paper Attention Is All You Need. It relies on a self attention mechanism that allows it to use a larger portion of the context, allowing it to out perform Recurrent Neural Networks such as LSTMS and GRUs.

The Model

The transformer model

The model works in the way you can see above. The components are explained below.

Positional Encoding

Because transformers work in parallel rather than in sequence, the model has no concept of the order of things. To fix this, there is a positional encoding, which adds the location of each token to each token's data so the model can keep track of it.

Multi Head attention

The multihead attention component

Multi Head attention is a component that uses Scaled Dot-Product Attention with pre and post processing. Its primary purpose is to transform the large set of results from the multiple Scaled Dot-Product Attention components and concatenate them.

Scaled Dot-Product Attention

The scaled Dot-Product attention component

Scaled Dot-Product Attention uses the vectors Q, K, and V. Q is Query, K is Key, and V is Value. These are each derived from a matrix transformation of the input. Q&K are multiplied to give a matrix that roughly corresponds to each token's relation to each other token in the sequence. After being transformed to the range 0-1, the result is multiplied by V to get a matrix representing the importance of each word. The rows are then added to get the output.