Self Attention in Transformer Neural Networks (with Code!)

Подписаться 122 тыс.

Просмотров 84 тыс.

50% 1

Let's understand the intuition, math and code of Self Attention in Transformer Neural Networks
ABOUT ME
⭕ Subscribe: ru-vid.com...
📚 Medium Blog: / dataemporium
💻 Github: github.com/ajhalthor
👔 LinkedIn: / ajay-halthor-477974bb
RESOURCES
[ 1🔎] Code for video: github.com/ajhalthor/Transfor...
[2 🔎] Transformer Main Paper: arxiv.org/abs/1706.03762
[3 🔎] Bidirectional RNN Paper: deeplearning.cs.cmu.edu/F20/d...
PLAYLISTS FROM MY CHANNEL
⭕ ChatGPT Playlist of all other videos: • ChatGPT
⭕ Transformer Neural Networks: • Natural Language Proce...
⭕ Convolutional Neural Networks: • Convolution Neural Net...
⭕ The Math You Should Know : • The Math You Should Know
⭕ Probability Theory for Machine Learning: • Probability Theory for...
⭕ Coding Machine Learning: • Code Machine Learning
MATH COURSES (7 day free trial)
📕 Mathematics for Machine Learning: imp.i384100.net/MathML
📕 Calculus: imp.i384100.net/Calculus
📕 Statistics for Data Science: imp.i384100.net/AdvancedStati...
📕 Bayesian Statistics: imp.i384100.net/BayesianStati...
📕 Linear Algebra: imp.i384100.net/LinearAlgebra
📕 Probability: imp.i384100.net/Probability
OTHER RELATED COURSES (7 day free trial)
📕 ⭐ Deep Learning Specialization: imp.i384100.net/Deep-Learning
📕 Python for Everybody: imp.i384100.net/python
📕 MLOps Course: imp.i384100.net/MLOps
📕 Natural Language Processing (NLP): imp.i384100.net/NLP
📕 Machine Learning in Production: imp.i384100.net/MLProduction
📕 Data Science Specialization: imp.i384100.net/DataScience
📕 Tensorflow: imp.i384100.net/Tensorflow
TIMSTAMPS
0:00 Introduction
0:34 Recurrent Neural Networks Disadvantages
2:12 Motivating Self Attention
3:34 Transformer Overview
7:03 Self Attention in Transformers
7:32 Coding Self Attetion

Опубликовано:

1 июл 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 151

@CodeEmporium Год назад

If you think I deserve it, please consider liking the video and subscribing for more content like this :)

@meguellatiyounes8659 Год назад

do have any idea how transformers generates new data ?

@15jorada Год назад

You are amazing man! Of course you deserve it! You are building transformers from the ground up! That's insane!

@vipinsou3170 9 месяцев назад

@@meguellatiyounes8659using decoder 😮😮😊

@nikkilin4396 4 месяца назад

It's one of the best videos I have watched. The concepts are explained very much, specially with codes.

@marktahu2932 Год назад

I have learnt so much between yourself, ChatGPT, and Alexander & Ava Amini iat MIT 6.S191. Thank you all.

@tonywang7933 Год назад

Thank you so much, I searched so many places, this is the first place finally have a nice person willing to spend time really dig in step by step. I'm going to value this channel as good as Fireship now.

@CodeEmporium Год назад

Thanks for the compliments and glad you are sticking around!

@rainmaker5199 Год назад

This is great! I've been trying to learn attention but it's hard to get past the abstraction in a lot of the papers that mention it, much clearer this way!

@jeffrey5602 Год назад

What's important is that for every token generation step we always feed the whole sequence of previously generated tokens into the decoder, not just the last one. So you start with the token and generate now new token, then feed + into the decoder, so basically just appending the generated token to the sequence of decoder inputs. That might have not been clear in the video. Otherwise great work. Love your channel!

@simonebonato5881 10 месяцев назад

One video to understand them all! Dude thanks I've tried to watch like 10 other videos on transformers and attention, yours was really super clear and much more intuitive!

@CodeEmporium 10 месяцев назад

Thanks so much for this compliment! Means a lot :)

@user-gq5rl1kb7n Год назад

I usually don't write comments, but this channel really deserves one! Thank you so much for such a great tutorial. I watched your first video about Transformers and the Attention mechanism, which was really informative, but this one is even more detailed and useful.

@CodeEmporium Год назад

Thanks so much for the compliments! This is the first in a series of videos called “Transformers from scratch “. Hope you’ll check the rest of the playlist out

@SOFTWAREMASTER Год назад

I was legit searching for self attention concept vids and thinking that it sucked that you didn't cover it yet. And voila here we are. Thankyou so much for uploading!!

@CodeEmporium Год назад

Glad I could deliver. Will be uploading more such content shortly :)

@pocco8388 11 месяцев назад

Best contents ever I've seen. Thanks for this video.

@rajpulapakura001 7 месяцев назад

This is exactly what I needed! Can't believe self-attention is that simple!

@kotcraftchannelukraine6118 7 месяцев назад

I still not understand how to perform a backward pass on the self-attention

@noahcasarotto-dinning1575 6 месяцев назад

Best video explaining this that ive seen by far

@dataflex4440 Год назад

This Has been a most wonderful series on this channel so far

@CodeEmporium Год назад

Thanks a ton! Super glad you enjoyed the series :D

@srijeetful 4 месяца назад

Extremely well explained. Kudos !!!!

@ParthivShah 15 дней назад

Really Appreciate Your Efforts. Love from Gujarat India.

@muskanmahajan04 Год назад

The best explaination on the internet, thank you!

@CodeEmporium Год назад

Thanks so much for the comment. Glad you liked it :)

@ChrisCowherd 9 месяцев назад

Fantastic explanation! Wow! You have a new subscriber. :) Keep up the great work

@prashantlawhatre7007 Год назад

waiting for your future videos. This was amazing. especially the masked attention part.

@CodeEmporium Год назад

Thanks so much! Will be making more over the coming weeks

@MahirDaiyan7 Год назад

Great! This is exactly what I was looking for in all of the other videos of yours

@CodeEmporium Год назад

Thanks for the comment! There is more to come :)

@softwine91 Год назад

What can I say, dude! God bless you This is the only content on the whole youtube that really explain the self-attention mechanism in a brilliant way. Thank you very much. I'd like to know if the key, query, and value matrixes are updated via backpropagation during the training phase.

@CodeEmporium Год назад

Thanks for the kind words. These matrices I mentioned in the code represent the actual data. So no. However, the 3 weight matrices that map a word vector to Q,K,V are indeed updated via backprop. Hope that lil nuance makes sense

@picassoofai4061 Год назад

I definitely agree.

@lawrencemacquarienousagi789 Год назад

Wonderful works you've done! I really love your video and have studied twice. Thank you so much!

@CodeEmporium Год назад

Thanks so much for watching! More to come :)

@shailajashukla5841 4 месяца назад

Excellent , how well you explained. NO other video on youtube explained like this , Really done good job.

@debjanidas5786 2 месяца назад

search CampusX

@chrisogonas Год назад

Awesome! Well illustrated. Thanks

@deepalisharma1327 10 месяцев назад

Thank you for making this concept so easy to understand. Can’t thank you enough 😊

@CodeEmporium 10 месяцев назад

My pleasure. Thank you for watching

@becayebalde3820 8 месяцев назад

This is pure gold man! Transformers are complex but this video really gives me hope.

@pratyushrao7979 5 месяцев назад

What are the prerequisites for this video? Do we need to know about encoder decoder architecture before hand? The video feels like I jumped right in the middle of something without any context. I'm confused

@VadimChes 3 месяца назад

@pratyushrao7979 there are Playlists for different topics

@PraveenHN-zj3ny 3 месяца назад

very happy to see kannada here Great 😍Love from kannadigas

@ayoghes2277 Год назад

Thanks a lot for making the video!! This deserves more views.

@CodeEmporium Год назад

Thanks for watching. Hope you enjoy the rest of the playlist as I code the entire transformer out !

@bradyshaffer3302 Год назад

Thank you for this very clear and helpful demonstration!

@CodeEmporium Год назад

You are so welcome! And be on the lookout for more :)

@amiralioghli8622 9 месяцев назад

Thank you so much for taking the time to code and explain the transformer model in such detail, I followed your series from zeros to heros. You are amazing and, if possible please do a series on how transformers can be used for time series anomaly detection and forecasting. it is extremly necessary on yotube for somone! Thanks in advance.

@shivamkaushik6637 Год назад

With all my heart, you deserve a lot of respect Thanks for the content. Damn I missed my metro station because of you.

@CodeEmporium Год назад

Hahahaha your words are too kind! Please check the rest of the Transformers from scratch” playlist for more (it’s fine to miss the metro for education lol)

@JBoy340a Год назад

Great walkthrough of the theory and then relating it to the code.

@CodeEmporium Год назад

Thanks so much! Will be making more of these over the coming weeks

@junior14536 Год назад

My god, that was amazing, you have a gift my friend; Love from Brazil :D

@CodeEmporium Год назад

Thanks a ton :) Hope you enjoy the channel

@SIADSrikanthB 2 месяца назад

I really like how you use Kannada language examples in your explanations.

@chessfreak8813 7 месяцев назад

Thanks! U r very deserved and underdog!

@PaulKinlan Год назад

This is brilliant, I've been looking for a bit more hands on demonstration of how the process is structured.

@CodeEmporium Год назад

Thanks so much! Happy you liked it :)

@shaktisd 6 месяцев назад

Excellent video . If you can please make a hello world on self attention like first showing pca representation before self attention and after self attention to show how context impacts the overall embedding

@FelLoss0 11 месяцев назад

Dear Ajay. Thank you so much for your videos! I have a quick question here. Why did you transpose the values in the softmax function? Also... why did you specify axis=-1? I'm a newbie at this and I'd like to have strong and clear foundations. have a lovely weekend :D

@maximilianschlegel3216 Год назад

This is an incredible video, thank you!

@CodeEmporium Год назад

Thanks so much for watching and commenting!

@pulkitmehta1795 Год назад

Simply wow..

@sockmonkeyadam5414 Год назад

u have saved me. thank u.

@Slayer-dan Год назад

Huge respect ❤️

@CodeEmporium Год назад

Thanks so much!

@nandiniloomba Год назад

Thank you for teaching this.❤

@CodeEmporium Год назад

My pleasure! Hope you enjoy the series

@mamo987 Год назад

Amazing work! Very glad I subscribed

@CodeEmporium Год назад

Thanks so much for commenting!

@AI-xe4fg Год назад

Good video Bro. Studying Transformer this week but still a little confused before I met your video. Thanks

@CodeEmporium Год назад

Thanks for the kind words. I really appreciate it :)

@jamesjang8389 7 месяцев назад

Amazing video! Thank you😊😊

@CodeEmporium 7 месяцев назад

You are very welcome

@faiazahsan6774 Год назад

Thank you for explaining in such an easy way. It would be great if you could upload some codes on GCN algorithm.

@CodeEmporium Год назад

I shall explore that possibility!

@rajv4509 Год назад

Absolutely brilliant! Thumba chennagidhay :)

@CodeEmporium Год назад

Thanks a ton! Super glad you like this. I hope you like the rest of this series :)

@pranayrungta Год назад

Your videos are way better than Stanford lecture cs224n

@CodeEmporium Год назад

Words I am not worthy of. Thank you :)

@jazonsamillano Год назад

Great video. Thank you very much.

@CodeEmporium Год назад

Thanks so much!

@virtualphilosophyjourney8897 6 месяцев назад

which phase does the model take the pretrianed info to decide the output?

@bhavyageethika4560 8 месяцев назад

why is it d_k in both Q and K in the np.random.randn ?

@yonahcitron226 11 месяцев назад

this is amazing!

@CodeEmporium 11 месяцев назад

Thanks a lot!

@paull923 Год назад

Thx for your efforts!

@CodeEmporium Год назад

Super welcome :)

@imagiro1 10 месяцев назад

Got it, thank you very much, but one question: What I still don't understand: We are talking about neural networks, and they are trained. So all the math you show here, how do we (know|make sure) that it actually happens inside the network? You don't train specific regions of the NN to specific tasks (like calculating a dot product), right?

@sriramayeshwanth9789 9 месяцев назад

you made me cry brother

@varungowtham3002 Год назад

ನಮಸ್ಕಾರ ಅಜಯ್, ನೀವು ಕನ್ನಡಿಗ ಎಂದು ತಿಳಿದು ತುಂಬ ಸಂತೋಷವಾಯಿತು! ನಿಮ್ಮ ವಿಡಿಯೋಗಳು ತುಂಬ ಚನ್ನಾಗಿ ಮೂಡಿಬರುತ್ತಿವೆ.

@CodeEmporium Год назад

Glad you liked this and thanks for watching! :)

@govindkatyura7485 Год назад

I have a few doubts 1. Do we use multiple ffnn after the attention layer? So suppose we have 100 input words for the encoder then 100 ffnn will get trained ? One for each of the word, i checked the source code but they were using only one, so I'm confused how one FFNN can handle multiple embedding specially with batch size. 2. In decoder do we pass multiple input also, just like encoder layer specially in training part?

@dickewurstfinger9093 5 месяцев назад

really great video, but why have the Q, K, V Vektors dim 8? i know its random in this video but what does the values in the vektors say about the word? or is it just to "identify" a word in a certain room like in word embeddings and give it a certain "id" ?

@CodeEmporium 5 месяцев назад

The choice of 8 heads in multi head attention is simple the choice of a hyper parameter in the main paper. This might be the number they experimented with that got reasonable results. That said, I am confident you shouldn’t see drastic differences with small fluctuations of this number. Further, I feel like powers of 2 (such as 1,2,4,8,16,32) are usually tried out as these hyper parameters. But as mentioned before, numbers in between may work just as well. I think it’s about having enough heads to capture complexity but not too many for slow processing

@picassoofai4061 Год назад

Mashallah, man you are a rocket.

@CodeEmporium Год назад

Thanks for the kind words :)

@yijingcui7736 6 месяцев назад

this is very helpful

@CodeEmporium 6 месяцев назад

Glad! And thank you!

@creativeuser9086 Год назад

how do we actually choose the dimensions of Q, K and V? Also, are they parameters that are fixed for each word in the English language, and do we get them from training the model? That part is a little confusing since you just mentioned that Q, V and K are initialized at random, so I assume they have to change in the training of the model.

@kotcraftchannelukraine6118 7 месяцев назад

Q - query, V - value and K - key

@7_bairapraveen928 Год назад

why we need to stabilise the variance of attention vector with query and key vectors.

@dataflex4440 Год назад

Brilliant Mate

@CodeEmporium Год назад

Thanks a ton! :)

@klam77 Год назад

"query" , "key" , and "value" terms come from the world of databases! So how do individual words in "My name is Ajay" each map to their own query and key and value semantically? that remains a bit foggy. i know you've shown random numbers in the example, but is there any semantic meaning to it? is this the "embeddings" of the LLM?

@arunganesan1559 Год назад

Thanks!

@CodeEmporium Год назад

Thanks for the donation! And you are very welcome!

@rujutaawate5412 11 месяцев назад

Thanks, @CodeEmporium / Ajay for the great explanation! One quick question- can you please explain how the true values of Q, K, and V are actually computed? I understand that we start with random initialization but do these get updated through something like backpropagation? If you already have a video of this then would be great if you can state the name/redirect! Thanks once again for helping me speed up my AI journey! :)

@CodeEmporium 11 месяцев назад

That's correct back prop will update these weights. For exact details, you can continue watching this playlist "Transformers From Scratch" where we will build a working transformer. This video was the first in that series. Hope you enjoy it :)

@gabrielnilo6101 Год назад

I stop the video sometimes and roll it back some seconds to hear you explaining something again and I am like: "No way that this works, this is insane", some explanations on AI techniques are not enough and yours are truly simple and easy to understand, thank you. Do you collab with anyone when making these videos, or is it done all by yourself?

@CodeEmporium Год назад

Haha yea. Things aren’t actually super complicated. :) I make these videos on my own. Scripting, coding, research, editing. Fun stuff

@naziadana7885 Год назад

Thank you very much for this great video! Can you please upload a video on Self Attention code using Graph Convolutional Network (GCN)?!

@CodeEmporium Год назад

I’ll look into this at some point. Thanks for the tips.

@li-pingho1441 Год назад

you save my life!!!!!

@CodeEmporium Год назад

It’s what I do best :)

@McMurchie Год назад

Hi I noticed this has been added to the transformer playlist, but there are 2 unavailable tracks - do i need them in order to get the full end to end grasp?

@CodeEmporium Год назад

You can follow the order of “transformers from scratch” playlist. This should be the first video in the series. Hope this helps and thanks for watching ! (It’s still being created so you can follow along :) )

@ritviktyagi9221 Год назад

How did we get the values of q, k and v vectors after initializing them as randoms. Great video btw. Waiting for more such videos.

@CodeEmporium Год назад

The weight matrices that map the original word vectors to these 3 vectors are trainable parameters. So they would be updated by back propagation during training

@ritviktyagi9221 Год назад

@@CodeEmporium Thanks for clarification

@wishIKnewHowToLove Год назад

thx

@CodeEmporium Год назад

My pleasure :)

@Slayer-dan Год назад

Ustad 🙏

@CodeEmporium Год назад

too kind :)

@ayush_stha Год назад

In the demonstration, you generated the q, k & v vectors randomly, but in reality, what will the actual source of those values be?

@CodeEmporium Год назад

Each of the q,k,v vectors will be a function of each word (or byte pair encoding) in the sentences. I say a “function” of the sentences since to the word vectors, we add position encoding and then convert into q,k,v vectors via feed forward layers. Some of the later videos in this “Transformers from scratch”playlist show some code on exactly how it’s created. So you can check those out for more intel :)

@philhamilton3946 Год назад

What is the name of the text book you are using?

@klam77 Год назад

if u watch the vid carefully, the url shows the books are "online" free access bibles of the field.

@ajaytaneja111 Год назад

Ajay, I don't think the point of capturing the context in terms of words 'after' has a significance in language modelling. In language modelling you are predicting only the next word. Yes, for a task like machine translation, yes. Thus I don't think Bi-directional RNNs have anything better to offer for language modelling than the regular (one-way) RNNs. . Let me know what you think

@anwarulislam6823 Год назад

How could someone hack my brain wave and convoluted this by evaluate inner voice? May I know this procedure? #Thanks

@SOFTWAREMASTER Год назад

Haha ikr. I felt the same. Was looking for a good Self attention video.

@josephpark2093 11 месяцев назад

I watched the video around 3 times but I still don't understand. Why are these awesome videos so unknown?

@jonfe Год назад

i still dont understand the difference between Q K V, can someone explain?

@sometimesdchordstrikes...7876 3 месяца назад

@1:41 here you have said that you want the context of the words that will be coming in the future but in masking part of the video you have said that it will be cheating know the context of the words that will be coming in the future

@NK-ju6ns Год назад

I felt the q, k, v parameter is not explained very well.. similar search analogy would be better to get a intuition of these parameter then explaining as what I can offer, what I actual offer

@SnehaSharma-nl9do 4 месяца назад

Kannada Represent!! 🖐

@CodeEmporium 4 месяца назад

Haha! Yes 🙌

@Tomcat342 19 дней назад

Ayo yellaru hegiddra?

@ChethanaSomeone Год назад

Seriously, are u from karnataka ? your accent is so different dude.

@bkuls Год назад

Guru aarama? Nanu kooda Kannada ne!

@CodeEmporium Год назад

Doin super well ma guy. Thanks for watching and commenting! :)

@kotcraftchannelukraine6118 7 месяцев назад

You forgot to show the most important thing, how to train self-attention with backpropagation? You forgot about backward pass

@CodeEmporium 7 месяцев назад

This is the first video in a series of videos called “Transformers from scratch”. Later videos show how the entire architecture is training. Hope you enjoy the videos

@kotcraftchannelukraine6118 7 месяцев назад

@@CodeEmporium thank you, i subscribe