Lecture 12.1 Self-attention

Подписаться 5 тыс.

Просмотров 69 тыс.

50% 1

ERRATA:
- In slide 23, the indices are incorrect. The index of the key and value should match (j) and theindex of the query should be different (i).
- In slide 25, the diagram illustrating how multi-head self-attention is computed is a slight departure from how it's usually done (the implementation in the subsequent slide is correct, but these are not quite functionally equivalent). See the slides PDF below for an updates diagram.
In this video, we discuss the self-attention mechanism. A very simple and powerful sequence-to-sequence layer that is at the heart of transformer architectures.
annotated slides: dlvu.github.io/sa
Lecturer: Peter Bloem

Опубликовано:

1 июн 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 97

@derkontrolleur1904 3 года назад

Finally an actual _explanation_ of self-attention, particularly of the key, value and query that was bugging me a lot. Thanks so much!

@Epistemophilos 2 года назад

Exactly! Thanks Mr. Bloem!

@rekarrkr5109 Год назад

OMG , me too i was thinking of relational databases bcs they were saying database and it wasnt making any sense

@Mars.2024 4 дня назад

Finally i have intuitive view of seld_attention . Thank you😇

@MrOntologue 7 месяцев назад

Google should rank videos according to the likes and the number of previously viewed videos on the same topics: this should go straight to the top for Attention/Transformer searches because I have seen and read plenty, and this is the first time the QKV as dictionary vs RDBMs made sense; that confusion had been so bad it literally stopped me thinking every time I had to consider Q, or K, or V and thus prevented me grokking the big idea. I now want to watch/read everything by you.

@ArashKhoeini Год назад

This is the best explanation of self-attention I have ever seen! Thank you VERY MUCH!

@constantinfry3087 3 года назад

Wow - only 700 views for probably the best explanation of Transformers I came across so far! Really nice work! Keep it up!!! (FYI: I also read the blog post)

@sohaibzahid1188 2 года назад

A very clear and broken down explanation of self-attention. Definitely deserves much more recognition.

@HiHi-iu8gf 11 месяцев назад

holy shit, been trying to wrap my head around self-attention for a while, but it all finally clicked together with this video. very well explained, very good video :)

@dhruvjain4372 3 года назад

Best explanation out there, highly recommended. Thank you!

@olileveque Месяц назад

Absolutely amazing series of videos! Congrats!

@tizianofranza2956 3 года назад

Saved lots of hours with this simple but awesome explanation of self-attention, thanks a lot!

@nengyunzhang6341 2 года назад

Thank you! This is the best introductory video to self-attention!

@thcheung 2 года назад

The best ever video showing how self-attention works.

@muhammadumerjavaid6663 3 года назад

thanks, man! you packed some really complex concepts in a very short video. going to watch more material that you are producing.

@Ariel-px7hz Год назад

This is a really excellent video. I was finding this a very confusing topic but I found it really clarifying.

@farzinhaddadpour7192 Год назад

I think one of the best videos describing self-attention. Thank you for sharing.

@MonicaRotulo 2 года назад

The best explanation of transformers and self-attention! I am watching all of your videos :)

@sathyanarayanankulasekaran5928 2 года назад

I have gone through 10+ videos on this, but this is the best ...hats off

@josemariabraga3380 2 года назад

This is the best explanation of multi-head self attention I've seen.

@maxcrous 3 года назад

Read the blog post and then found this presentation, what a gift!

@marcolehmann6477 3 года назад

Thank you for the video and the slides. Your explanations are very clear.

@workstuff6094 2 года назад

Literally the BEST explanation of attention and transformer EVER!! Agree with everyone else about why this is not ranked higher :( I'm just glad I found it !

@RioDeDoro Год назад

Great lecture! I really appreciated your presentation by starting with simple self-attention, very helpful.

@szilike_10 Год назад

This is the kind of content that deserves the like, subscribe and share promotion. Thank you for your efforts, keep up!

@martian.07_ 2 года назад

Take my money, you deserve everything, greatest of all time. GOT

@xkalash1 3 года назад

I had to leave a comment, the best explanation of Query, Key, Value I have seen!

@Raven-bi3xn Год назад

This is the best video I've seen on attention models. The only thing is that I don't think the explanation of the multihead part in minute 19 is accurate. What multihead does it not treating the word "too" and "terrible" different from the word "restaurant. What is does is that, instead of using the same weight for all elements of the embedding vector, as shown in 5':30", it calculates 2 weights, one for each half of the embedding vector. So, in other words, we break down the embedding vectors of the input words into small pieces and do self attention to ALL embedding sub-vectors, as opposed to doing self attention for the embedding of "too" and "terrible" differently from the attention of "restaurant".

@imanmossavat9383 2 года назад

This is a very clear explanation. Why RU-vid does not recommend it??!

@BenRush Год назад

This really is a spectacular explanation.

@AlirezaAroundItaly Год назад

best explanation i found for self attention and multi head attention on internet , thank you sir

@bello3137 Год назад

very nice explanation of self attention

@fredoliveira7569 8 месяцев назад

Best explanation ever! Congratulations and thank you!

@mohammadyahya78 Год назад

Fantastic explanation for self-attention

@senthil2sg 11 месяцев назад

Better than the Karpathy explainer video. Enough said!

@user-fd5px2cw9v 5 месяцев назад

Thanks for your sharing! Nice and clear video!

@huitangtt 2 года назад

Best transformer explanation so far !!!

@peregudovoleg Год назад

Great video explanation and there is also a good read of this for those interested. Thank you very much professor.

@maganaluis92 Год назад

This is a great explanation, I have to admit I read your blog thinking the video was just a summary of it but it's much better than expected. I would appreciate it if you can create lectures in the future of how transformers are used for image recognition, I suspect we are just getting started with self attention and we'll start seeing more in CV.

@davidadewoyin468 Год назад

This is the best explanation i have ever heard

@junliu7398 Год назад

Very good course which is easy to understand!

@impracticaldev Год назад

Thank you. This is as simple as it can get. Thanks a lot!!!

@MrFunlive 3 года назад

such a great explanation with examples :-) one has to love it. thank you

@zadidhasan4698 9 месяцев назад

You are a great teacher.

@shandou5276 3 года назад

This is incredible and deserves a lot more views! (glad RU-vid ranked it high for me to discover it :))

@saurabhmahra4084 8 месяцев назад

Watching this video feels like trying to decipher alien scriptures with a blindfold on.

@rahulmodak6318 3 года назад

Finally found the Best explanation TY.

@clapathy 2 года назад

Thanks for such a nice explanation!

@LukeZhang1 3 года назад

Thanks for the video! It was super helpful

@HeLLSp4Wn123 2 года назад

Thanks, found this very useful !!

@aiapplicationswithshailey3600 Год назад

so far the best video describing this topic. Only questoin i have is how do we get around the fact that a word will have highest self attention with it self. You said you would clarify about this but I could not find this point.

@stephaniemartinez1294 Год назад

Good sir, I thank ye for this educational video with nice visuals

@ChrisHalden007 11 месяцев назад

Great video. Thanks

@adrielcabral6634 9 месяцев назад

I loved u explanation !!!

@laveenabachani 2 года назад

Thank you so much! This was amazing! Keep it up! This vdo is so underrated. I will share. :)

@deepu1234able 3 года назад

best explanation ever!

@somewisealien 2 года назад

VU Master's student here revisiting this lecture to help for my thesis. Super easy to get back into after a few months away from the concept. I did deep learning last December and I have to say it's my favourite course of the entire degree, mostly due to the clear and concise explanations given by the lecturers. I have one question though, I'm confused as to how simple self-attention would learn since it essentially doesn't use any parameters? I feel I'm missing something here. Thanks!

@randomdudepostingstuff9696 Год назад

Excellent, excellent, excellent!

@Markste-in 2 года назад

Best explanation I have seen so far on the topic! One of the few that describe the underlaying math and not just show a simple flowchart. The only thing that confuses me: at 6:24 you say W = X_T*X but on your website you show a pytorch implementation wiith W = X*X_T. Depending on what you use you get either a [k x k] or a [t x t] matrix?

@darkfuji196 2 года назад

This is a great explanation, thanks so much! I got really sick of explanations just skipping over most of the details.

@WM_1310 Год назад

Man, if only I had found this video early on during my academic project, would've probably been able to do a whole lot better in my project. Shame it's already about to end

@TheCrmagic 2 года назад

You are a God.

@wolfisraging 2 года назад

Amazing video!!

@manojkumarthangaraj2122 2 года назад

I know this is the best explanation about transformers I've come across so far. Still I'm having an problem with understanding Key, query and value part. Is there any recommendation, where I can learn completely from the basics? Thanks in advance

@jiaqint961 Год назад

This is gold.

@karimedx 2 года назад

Man I was looking for this for a long time, thank you very much for this best explanation, yep it's the best, btw RU-vid recommended this video, I guess this is the power of self-attention in recommended systems.

@geoffreysworkaccount160 2 года назад

This video was raaaaad THANK YOU

@abhishektyagi154 2 года назад

Thank you very much

@kafaayari 2 года назад

Wow, this is unique.

@WahranRai 2 года назад

9:55 if the sequence get longer, the weights become smaller (soft max with many components ) : is it better to have shorter sequences ?

@VadimSchulz 3 года назад

thank you so much

@balasubramanyamevani7752 Год назад

It was very well put the presentation on self-attention. Thank you for uploading this. I had a doubt @15:56 how it will suffer from vanishing gradients without the normalization. As dimensionality increases, the overall dot product should be larger. Wouldn't this be a case of exploding gradient? I'd really love some insight on this. EDIT: Listened more carefully again. The vanishing gradient on the "softmax" operation. Got it now. Great video 🙂

@ax5344 3 года назад

I love it when you are talking about the different ways of implementing multi-head attention, there are so many tutorials just glossing over it or taking it for granted, but I would wish to know more details @ 20:30. I came here because your article discussed it but i did not feel I have a too clear picture. Here, with the video, I still feel unclear. Which one was implemented in Transformer and which one for BERT? Suppose they cut the original input vector matrix into 8 or 12 chunks, why did not I see in their code the start of each section? I only saw a line dividing the input dimension by number of heads. That's all. How would the attention head the input vector idx they need to work on? Somehow I feel the head need to know the starting index ...

@dlvu6202 3 года назад

Thanks for you kind words! In the slide you point to the bottom version is used in every implementation I've seen. The way this "cutting up" is usually done is with a view operation. If I take a vector x of length 128, and do x.view(8, 16), I get a matrix with 8 rows and 16 columns, which I can then interpret as the 8 vectors of length 16 that will go into the 8 different heads. Here is that view() operation in the Huggingface GPT2 implementation: github.com/huggingface/transformers/blob/8719afa1adc004011a34a34e374b819b5963f23b/src/transformers/models/gpt2/modeling_gpt2.py#L208

@recessiv3 2 года назад

Great video, I just have a question: When we compute the weights that are then multiplied by the value, are these vectors or just a single scalar value? I know we used the dot product to get w so it should be just a single scalar value, but just wanted to confirm. As an example, at 5:33 are the values for w a single value or vectors?

@TubeConscious 2 года назад

Yes, it is a single scalar the result of the dot product further normalize by softmax, so the sum of all weights equals to one.

@user-oq1rb8vb7y 10 месяцев назад

Thanks for the great explanation! Just one question, if simple self-attention has no parameters, how can we expect it to learn? it is not trainable.

@soumilbinhani8803 8 месяцев назад

Hello can someone explain me this, the key and the values for each iteration wont it be the same, as we compare it to 5:29 , please help me on this

@turingmachine4122 3 года назад

Hi, thank you for this nice Explanation. However, there is one thing that I don‘t get. How can the self attention model, for instance in the sentence „John likes his new shoes“, compute high value for „his“ and „John“. I mean, we know that they are related, but the embeddings for these words can be very different. Hope you can help me out :)

@mohammadyahya78 Год назад

Question: At 8:46 May I know please why since Y is defined as multiplication, it's purely linear and thus is non-vanishing gradient, which means the gradient will be of a linear operation? While W=SoftMax(XX^T) is non-linear and thus can cause vanishing gradients. Second, what is the relationship between linearity/non-linearity and vanishing/non-vanishing gradient?

@abdot604 2 года назад

fine , i will subscribe

@ecehatipoglu209 Год назад

Hi extremely helpful video here, I really appreciate but i have a question i dont understand how multi head self attention works if we are not generate extra parameters for each stack of self attention layer, what is the difference in each stack so that we can grasp the different relations of the same word in each layer

@ecehatipoglu209 Год назад

Yeah after 9 days and re-watching this video i think I grasped why we are not using extra parameters. Lets say you have an embedding dimension of 768 and you want to make 3 attention head meaning somehow dividing the 768 vector so you could have a 256x1 vector for each attention head. (This splitting is actually a linear transformation so there is no weights to be learned here right. ) . After that, for each of this 3 attention heads we have parameters 3 of [K, Q, W](superscripted for each attention head). For each attention head our K will be the dimension of 256xwhatever, Q will be the dimension of 256xwhatever and V will be the dimension of 256xwhatever. And this is for one head. Concatanating all learned vectors K. Q and will end up a 768xwhatever for each of them, exact size that we would have with a single attention. Voila.

@donnap6253 3 года назад

On page 23, should it not be ki qj rather than kj qi?

@superhorstful 3 года назад

I totally agree on your opinion

@dlvu6202 3 года назад

You're right. Thanks for the pointer. We'll fix this in any future versions of the video.

@joehsiao6224 2 года назад

@@dlvu6202 Why the change? I think we are querying with current ith input against every other jth input, and the figure looks right to me.

@dlvu6202 2 года назад

@@joehsiao6224 It's semantics really. Since the key and query are derived from the same vector it's up to you which you call the key and which the query, so the figure is fine in the sense that it would technically work without problems. However, given the analogy with the discrete key-value store, it makes most sense to say that the key and value come from the same input vector (i.e. have the same index) and that the query comes from a (potentially) different input.

@joehsiao6224 2 года назад

@@dlvu6202 it makes sense. Thanks for the reply!