Difference between revisions of "Transformer"

From AI Dungeon Wiki
Jump to navigation Jump to search
[unchecked revision][checked revision]
m (Add redlink to tokens)
m (Categorizing)
 
(One intermediate revision by one other user not shown)
Line 1: Line 1:
A Transformer is a type of sequence to sequence [[Neural Network]] that was proposed in the 2017 paper [https://arxiv.org/abs/1706.03762 Attention Is All You Need]. It relies on a self attention mechanism that allows it to use a larger portion of the context, allowing it to out perform Recurrent Neural Networks such as LSTMS and GRUs.
+
A transformer is a type of sequence to sequence [[neural network]] models that was proposed in the 2017 paper [https://arxiv.org/abs/1706.03762 Attention Is All You Need]. It relies on a self attention mechanism that allows it to use a larger portion of the context, allowing it to out-perform Recurrent Neural Networks (RNNs) such as LSTMS and GRUs.
 
==The Model==
 
==The Model==
 
[[File:Transformer model picture.png|none|The transformer model]]
 
[[File:Transformer model picture.png|none|The transformer model]]
  
The model works in the way you can see above. The components are explained below.
+
The components are explained below.
 
===Positional Encoding===
 
===Positional Encoding===
 
Because transformers work in parallel rather than in sequence, the model has no concept of the order of things. To fix this, there is a positional encoding, which adds the location of each [[token]] to each token's data so the model can keep track of it.
 
Because transformers work in parallel rather than in sequence, the model has no concept of the order of things. To fix this, there is a positional encoding, which adds the location of each [[token]] to each token's data so the model can keep track of it.
Line 9: Line 9:
 
[[File:Multihead attention.png|The multihead attention component]]
 
[[File:Multihead attention.png|The multihead attention component]]
  
Multi Head attention is a component that uses Scaled Dot-Product Attention with pre and post processing. Its primary purpose is to transform the large set of results from the multiple Scaled Dot-Product Attention components and concatenate them.
+
Multihead attention is a component that uses Scaled Dot-Product Attention with pre and post processing. Its primary purpose is to transform the large set of results from the multiple Scaled Dot-Product Attention components and concatenate them.
  
 
====Scaled Dot-Product Attention====
 
====Scaled Dot-Product Attention====
Line 15: Line 15:
  
 
Scaled Dot-Product Attention uses the vectors Q, K, and V. Q is Query, K is Key, and V is Value. These are each derived from a matrix transformation of the input. Q&K are multiplied to give a matrix that roughly corresponds to each token's relation to each other [[token]] in the sequence. After being transformed to the range 0-1, the result is multiplied by V to get a matrix representing the importance of each word. The rows are then added to get the output.
 
Scaled Dot-Product Attention uses the vectors Q, K, and V. Q is Query, K is Key, and V is Value. These are each derived from a matrix transformation of the input. Q&K are multiplied to give a matrix that roughly corresponds to each token's relation to each other [[token]] in the sequence. After being transformed to the range 0-1, the result is multiplied by V to get a matrix representing the importance of each word. The rows are then added to get the output.
 +
 +
[[Category:Artificial intelligence]]

Latest revision as of 16:06, 8 March 2021

A transformer is a type of sequence to sequence neural network models that was proposed in the 2017 paper Attention Is All You Need. It relies on a self attention mechanism that allows it to use a larger portion of the context, allowing it to out-perform Recurrent Neural Networks (RNNs) such as LSTMS and GRUs.

The Model

The transformer model

The components are explained below.

Positional Encoding

Because transformers work in parallel rather than in sequence, the model has no concept of the order of things. To fix this, there is a positional encoding, which adds the location of each token to each token's data so the model can keep track of it.

Multi Head attention

The multihead attention component

Multihead attention is a component that uses Scaled Dot-Product Attention with pre and post processing. Its primary purpose is to transform the large set of results from the multiple Scaled Dot-Product Attention components and concatenate them.

Scaled Dot-Product Attention

The scaled Dot-Product attention component

Scaled Dot-Product Attention uses the vectors Q, K, and V. Q is Query, K is Key, and V is Value. These are each derived from a matrix transformation of the input. Q&K are multiplied to give a matrix that roughly corresponds to each token's relation to each other token in the sequence. After being transformed to the range 0-1, the result is multiplied by V to get a matrix representing the importance of each word. The rows are then added to get the output.