Thanks for the explanation. At 9:19 : Shouldn't the order of multiplication be the opposite here? E.g. x1(vector) * Wq(matrix) = q1(vector). Otherwise I don't understand how we get the 1x3 dimensionality at the end
Thanks so much for this video. I’ve gone through a number of videos on transformers and this is much easier to grasp and understand for a non-data scientist like myself.
You know how to explain things. This one is not easy: I can see the amount of work that went into this video, and it was a lot. I hope that your career takes you where you deserve.
This is a very well-made explanation. I hadn't known that the feedforward layers only received one token at a time. Thanks for clearing that up for me! 😁
As far as I am aware, word embedding has changed from legacy static embedding like Word2Vec/GLOVE (like the famous queen=woman+king-man metaphor) to BPE & unigram, this change gave me quite a headache, as most of paper do not mention any detail of their "word embedding". Perhaps Letitia you can make a video to clarify this a bit for us.
Tomorrow i have thesis evaluation and i was thinking about watching that video again, but youtube algorithm suggested me without searching anything, Thank u youtube algo.. 😅❤🔥
Thank you very much for the very clear explanations and detailed analysis of the transformer architecture. Your truly the 3blue1brown of machine learning!
Wow, this explanation on the difference between RNNs and Transformers at the end is what I was missing! I've always heard that Transformers are great because of parallelization but never really saw why until today, thank you! Great video!
Today, I had the problem I need to understand how Transformers work.. I searched on youtube and found your video 20 minutes after release. What a perfect timing
that's great, I'm a little stuck on the special mask token? ... I'll keep digging, good info, the video is good explanation, it allows for more experimentation instead of relying on open source models that can have components look like a black box to noobs like me :)
❤ Letitia, thank you for great visualization and intuition. For inspiration: In the original paper, the decoder utilizes the output of the encoder by running a cross-attention process. Why does GPT not use an encoder? As you've mentioned, the encoder is typically used for classification, while the decoder is for text generation. They are never used in combination. Why is this the case? Missing Intuition: Why does the cross-attention layer inside the decoder take the values from the ENCODER’s output to create the enhanced embeddings (as a weighted mix)? Intuitively, I would use the values from the DECODER.
Thanks for your prompt reply. Hence, understanding the concept and intuition behind feeding the encoder output into the decoder is essential. I found only this one video on encoder-decoder cross-attention: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-Dqjq4Gxdhng.htmlsi=gtLzNxAU0pUGyLvk In it, Lennart emphasizes the observation that, based on the original equations, we have the enhanced embeddings calculated as a weighted sum of ENCODER values. Inside of a DECODER, I would rather expect to have the DECODER values pass through. Letitia, I am sure, you will resolve this mystery. 🍀
Awesome video, thank you! I love the idea of you revisiting older topics -- either as a 201 or as a re-introduction. "Attention combines the representation of input vector's value vectors, weighted by the importance score (computed by the query and key vectors)."
In 11:14, the weighted sum is the sum of 3 vectors of 3 elements each, but the results is a vector of 4 elements. Which, conveniently, is the same size as the input vector. Could there be a missing step there?
What a great video. It still could expand more and really sum up every sub-part and connect it to a certain clear visualization or clear step of what happens with the information at each time step and how its "transformation" progresses over time. So i think you could redo this video and really make it monkey proof for folks like me. But beware, if you look for example at the StatQuest version, its to slow and too repetative and also does not really capture, what really goes on inside the Transformer, once all steps are stacked together. Great work!
Thanks for your video. I have a question on inference process. For example when I have a input prompt of 2 tokens = {t1, t2}. we will get the output {o1, o2, o3}. we will take only o3 and make new input sequence {t1, t2, o3}. Then we will get another output {o'1, o'2, o'3, o'4}. Here my questions are. When we use causal masking for attention, o1= o'1 and o2=o'2 and so on? Another question, even though the mask guarantee the causal attension. but still the matrix calcuation is performed. Then it means the computation is used any way. How can we reduce the computation resource for this case.
Everything makes sense except multiple attention heads. Each layer has only one set of Q, K, V, O matrices. But 8 attention heads per layer? I want to understand that.
Think about it this way: In one layer, instead of having one head telling you how to pay attention at things, you have 8. In other words, instead of having one person shout at you the things they want you to pay attention to, you have 8 people simultaneously shouting at you. This is beneficial because it has an ensembling effect (the effect of a voting parliament. Think of Random Forests that are an ensemble of Decision Trees). I do not know if this helps, but I thought giving it another shot at explaining this.
hi! can i ask question of how did you get the dimension (d)? because all i know is dimension can be found in square matrices, and the dot product of the attention formula says that Q•K^T. if we're using 1x3 matrices, we'll get 1x1 matrices or 1 dimension, how do you get 3 ? unless its 3x1 matrix beforehand, so we'll get 3x3 or 3 dimensional matrix. thankyouu !
Hi, if you mean the mistake at 10:00, then the problem is that I have written matrix times vector when I should have written vector times matrix! (or I could have used column vectors instead of row vectors). Is this what you mean?
Time is quadratic, but memory is linear -- see the FlashAttention paper. But the number of parameters is constant -- that's the magic ! Thanks for the excellent videos ! 👍