Understanding ChatGPT, Transformers and Attention — questions, notes, answers

8 min readJan 18, 2023

After reading a few papers and articles on Transformers, some things were still unclear to me. Here is the list, I will add answers in time.

RNNs, LSTMs — why even mention them?

As far as I can tell, you can ignore every single sentence that relates to RNN or LSTM. These were used historically, but since “Attention is all you need” there is no point in focusing on them anymore in transformer networks.

Residual Connections, Encoder, Decoder

We can see terms “Decoder” and “Encoder” all around the place.
We can see residual connections around attention blocks.
Skip connection implies the same input and output shape to particular layer (or block of layers)
The same input and output is great — it not only prevents vanishing gradient problem in very deep networks, but it also makes our work as ml engineers so much easier — we don’t have to compute the shapes all the time, parallelization is easier, visualization of features works better. We enhance or transform the features representation — instead of building it from scratch.
Autoencoders have a bottleneck. It is often used as an embedding vector, in case of convolutional networks i.e for FaceID. So the middle layer, at the end of decoder should ideally be very small (i.e 1024x1, or 32x32x1; in case of U-net it’s 28x28x1024, not VERY small, but still small). We call this layer a bottleneck. Everything before it is called Encoder, everything after it is called Decoder.
Btw don’t mix this embedding mentioned above, with the embedding layer in the beginning of a transformer network — these are related ideas, but not the same.
U-nets (which are usually considered auto-encoders) also have something like a skip/residual connection — but it doesn’t run every layer/block — it rather runs from block i to block N-i, (0 to N, 1to N-1, 2 to N-2 etc). The operation is ‘copy and crop’ (and concatenate).
OK, the question is: how can they have encoder and decoder, and residual connections at the same time?! How can they get gradually smaller shape of representation every layer in Encoder, and gradually larger representation every layer in Decoder AND residual connections simultaneously?
My current answer is: Encoder and Decoder in NLP are called that way just because of some historic reasons, or maybe because for language translation tasks they will have 1 encoder per input language and 1 decoder per output language. So in theory you could switch language by switching either encoder or decoder. But there is no actual bottleneck and no ‘hourglass’ architecture there.

Example autoencoder with the ‘hourglass architectue’. Bottleneck is of shape 1151x1. I have no idea what Layer with 10 and h stands for, but it’s not important for this consideration. Left part is an encoder, right part is a decoder.

U-net. The representation is 32x32x1024b (~1 mln params) in the bottleneck. It used to be 570×570×64 = (20mln params near the input). Left part is ENCODER, right part is DECODER.

Residual Connection. Addition can be done without any padding if F(x) and x are of the same shape. I.e Input of size 64x64x128 and output also of shape 64x64x128.

Meaning of “Output (shifted right)”

Transformer network diagram from “Attention is all you need” paper.

From the paper, I couldn’t figure what does “Outputs (shifted right)” caption stand for.

So, let’s say we have a sentence with N tokens (and that 1 token = 1 word).
“A random sentence in English was written by some random guy and translated to Spanish using google translate.” Google says it means “Una oración al azar en inglés fue escrita por un chico al azar y traducida al español usando el traductor de Google.” in Spanish.

So for i = 0, Inputs = “A”, Outputs (shifted) = nothing, Output probabilities = ‘Una’ with let’s say 98% probability. Another 2% is distributed into all possible tokens in the vocabulary (i.e 1.2% goes to ‘Un’).

For i = 5, Inputs= “A random sentence in English was”, Outputs (shifted right) = “Una oración al azar en”. Output probabilities ‘inglés’ (84%), and some other words (their probabilities sum up to 16%).

We are concatenating the tokens that we have translated in previous steps, so that the network knows what has already been translated.

Even though doing this step by step looks really slow, during learning instead of actual outputs of previous step, we feed the ground truth (this allows us to do multiple training steps in parallel). Even if the network predicts something wrong “Un” instead of “Una”, or equivalently good to ground truth (something that makes sense, but different words are used), outputs (shifted right) will be taken from labels.

Dot product as a measure of… vectors similarity?

Often people talk about dot product as ‘vector similarity measure’ (in the context of dot product of Query and Key vectors in attention). But let’s look at some example dot products:

Dot product of [5,5,5],[5,5,5] is 75.
Dot product of [1,2,3],[3,0,-1] is 0.
Dot product of [77,89,111],[0,0,0] is 0.
Dot product of [1,1,1],[1,1,1] is 3.

It seems that identical vectors can have low or high dot products. The ‘vector similarity’ notion comes from geometry:

‘ Two nonzero vectors are perpendicular, or orthogonal, if and only if their dot product is equal to zero.’ and ‘The dot product of two parallel vectors is equal to the product of the magnitude of the two vectors.’.

But it seems to me that using this term in the context of attention doesn’t make any sense.

Query, Key… VALUE. What do we need value for?

The authors of “Attention is all you need” define these 3 matrices and matrix multiply inputs by them, than after some computations we get the Attention.

So we get query for i-th token, and compute dot product of it and every other key in the whole input sentence. And then, magically, for nobody knows what reason, we also add value to it (maybe scaled by some weight).
Why? Where does it come from?

Let’s establish a general rule: we can multiply values from the previous layer, or inputs in general by some random bunch of weights (aka. a matrix) and, provided that we have a skip connection, this will most likely improve our accuracy. (Skip connection is needed to avoid vanishing gradient.)

So you know, if you are not sure about something when writing your network architecture, if something is unstable or weird, or the shapes don’t fit, just multiply dangling things by some parameters. Add a fully connected layer or a ‘point wise convolution’.
In worst case the whole thing will need more memory and will run slower but the scores won’t go up, they only stay the same. Ok, maybe train acc goes up, test goes down due to some overfitting… But you know what I mean, introducing more parameters just in case usually is OK.

Knowing that rule, we could replace the revolutionary Attention Layer with plain old fully connected layer, and see how much the accuracy drops. The whole network should stil work, the shapes of the tensors should stay the same, only the accuracy will drop significantly (but not to zero). This would be a great idea for an experiment. Oh, somebody already did it: https://arxiv.org/pdf/2005.13895.pdf .

Guess what: the accuracy actually went up. (Imagine my face when I found the paper). Note: they replaced only some of the attention blocks with feed forward, not all of them.

Ok, let’s unpack this mess in a separate paragraph, because I’m getting a headache.

PS. Of course by ‘accuracy’ I mean the general sense of the word, i.e BLEU or some other metric popular in NLP community.

Query, Key, Value — 2nd attempt.

Remember that graph? I made it simpler for us by brushing out the stuff that is not interesting for us now.

Ok, seems that in fact we have two types of Attention Layer in the “Attention is all you need” paper.

Type one has the same inputs used to compute Q, K, V. I think you could totally get away by simply using Q and K — the whole attention ‘magic’ would happen there. And Value? It’s just some weights. Not important. Could be a separate fully connected or pointwise layer.

Type two, has Q and K originating from Input Embedding, V originating from our old friend ‘Outputs (shifted right)”. So I guess it’s just a clever way to incorporate info “what has been translated so far” into the representation.

It could still probably done as a separate linear layer. But now it’s elegant. We have a single layer that can do multiple things, and the final image in the paper looks cleaner. Who knows, maybe it’s also super convenient for some optimizations on GPUs?

Update on Q, K, V:

This video is really, really good. It clearly shows that Q, K, V in ‘self attention’ variant are created in the same way, and that you could probably omit V without too much punishment. The thing is, when you do Q * K and apply softmax you loose a lot of information. The resulting matrix says how each token (token = word/letter/subword/embedding) relates to each other (thanks to softmax), but you may actually loose the info about the token itself. So the value is something smilar to a skip connection — it makes sure the original input is not discarded.

How to understand embedding layer

An embedding layer (not the bottleneck layer in ‘real’ auto-encoders, but the input layer to language models, i.e transformers) converts one-hot encoded vocabulary to a convenient format.

It may or may not include some optimizations (lookup table), used especially in inference, but in general it’s just a normal linear layer. It’s weights are trained jointly with other layers during model optimization. It’s usually initialized with random weights.

You could use some weights from word2vec, trained with CBOW or Skip Gram if you have a very small dataset but you certainly don’t have to.

https://discuss.pytorch.org/t/how-does-nn-embedding-work/88518/20 , answer by https://discuss.pytorch.org/u/vdw

Why do we even need one-hot-encoded vectors or embeddings?

This is pretty basic, but in one-hot-encoding, we have 0s in all positions but one — the one that we point with an index, and which marks the token we are encoding.

Why don’t we just use an integer? Why ‘0 0 0 0 1 0 0’ instead of just ‘4'?

This is because our dictionary assumes that every token is equally distant to every other token.
If 0 = A, 1=B, … 25 = Z, we don’t want to imply that A is closer to B then to Z.
Side note: ‘A’ is probably more related to ‘a’ then to ‘z’, but this relation is ignored in both approaches.