Тёмный

Attention for RNN Seq2Seq Models (1.25x speed recommended) 

Shusen Wang
Подписаться 10 тыс.
Просмотров 30 тыс.
50% 1

Next Video: • Self-Attenion for RNN ...
Attention was originally proposed by Bahdanau et al. in 2015. Later on, attention finds much broader applications in NLP and computer vision. This lecture introduces only attention for RNN sequence-to-sequence models. The audience is assumed to know RNN sequence-to-sequence models before watching this video.
Slides: github.com/wan...
Reference:
Bahdanau, Cho, & Bengio. Neural machine translation by jointly learning to align and translate. In ICLR, 2015.

Опубликовано:

 

28 сен 2024

Поделиться:

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист
Посмотреть позже
Комментарии : 41   
@tahamagdy4932
@tahamagdy4932 2 года назад
Shusen Wang, this was extremely beneficial, absolute masterpiece.
@naraendrareddy273
@naraendrareddy273 2 года назад
Thanks man, I've done my last minute prep for the exam through this video.
@yugoi6944
@yugoi6944 2 года назад
Thank you for the fruitful lecture! Instead of α_i, using α_{i,j}=align(h_i, s_j) makes the equation easier to see for me. But it's super helpful for beginners like me, thanks again!
@yugoi6944
@yugoi6944 2 года назад
The same notation was already used in the next next lecture. Sorry for the redundant comment.
@programmer49
@programmer49 Год назад
The best on RU-vid, thank you very much
@sinamansourdehghan1195
@sinamansourdehghan1195 Год назад
Your explanation was very clear and useful. I strongly recommend this video if you want to understand the concept of the Attention mechanism in RNNs.
@vent_srikar7360
@vent_srikar7360 Год назад
very beautifullly and simply explained ,GGs
@kotanvich
@kotanvich 2 года назад
Best explanation I've ever seen
@madhu1987ful
@madhu1987ful Год назад
Amazing explanation 👏 just one question what are A and A prime in this video? h correspond to hidden states of encoder at different time steps
@archibaldchain1204
@archibaldchain1204 2 года назад
I have a question: what is the output of attention and how do you measure the loss?
@joshithmurthy6209
@joshithmurthy6209 10 месяцев назад
Very good explanations thank you very much
@HashanDananjaya
@HashanDananjaya Год назад
Explained nicely. Thank you.
@lancelotdsouza4705
@lancelotdsouza4705 2 года назад
Beautifully explained
@abhishekswain2502
@abhishekswain2502 2 года назад
This is really good ! Thanks !
@RamazanErdemUysal
@RamazanErdemUysal Год назад
In the decoder part of Seq2Seq with attention model, decoder uses three inputs. At first it uses c0, s0, and x'1 to predict s'1, here s0 is the latent representation of encoder and x'1 is the start sign, s0 and x'1 is different. In the next step uses c1, s1, and x'2 to predict s'2. Aren't s1 and x'2 same here? Because s1 is the previous hidden state and the x'2 is the predicted word, which is like a result of probability distribution based on s1. If I am not wrong, it supposed to use only one of them, or always use the s0. Can someone clarify this?
@josephwashington8939
@josephwashington8939 3 года назад
你讲的很清楚!谢谢!
@hoang_minh_thanh
@hoang_minh_thanh Год назад
Hi @ShusenWangEng, which template you have use to create this slide? I search but cannot found any slide in overleaf like this. Thanks
@pawelsubko7277
@pawelsubko7277 3 года назад
What is x' ? And where do you get x'1 from?
@maxwelikow9119
@maxwelikow9119 3 года назад
x‘1 is the start sign (like an empty space), x‘2 is the first word of the decoder, x‘3 the second and so on.
@Toluclassics
@Toluclassics 3 года назад
Best attention video!
@iblard
@iblard Год назад
You mention at 11:19 that x1prime is the start sign, later (15:22) you mention x2prime as obtained in the previous step, but how? You show clearly how to obtain s1 and c1 but not x2prime.
@RamazanErdemUysal
@RamazanErdemUysal Год назад
I am also confused about that. Based on my intuition, using s0, c0, and x1, generated hidden state s1 at the decoder is used to generate a probability distribution over the vocabulary of possible output tokens. And the possible outcome is x'2. Again, x'2 is used together with s1, and c1 to generate s2. My confusion is that, x'2 and s1 carries the same information since x'2 is generated from s1. Therefore I don't see any reason to use both of them.
@RyanMcCoppin
@RyanMcCoppin 2 года назад
Very clear lecture. Thank you!
@t3dx
@t3dx 2 года назад
It is not clear to me what the vector V, used for inner product with tanh of W and hiS0, corresponds.
@longdang7791
@longdang7791 2 года назад
You can review his previous slides about basics of RNN. I guess it is the learnable parameter matrix connecting the inputs to the hidden states.
@avojtech
@avojtech Год назад
How it comes at about 7:27 that s0 is suddenly a vector? In the previous slide you state that s0 = hm. Useless video...
@anhtotuyet9652
@anhtotuyet9652 2 года назад
the video image is too poor, you need to fix it more
@iotsharingdotcom22
@iotsharingdotcom22 2 года назад
you need to check your internet quality
@teetanrobotics5363
@teetanrobotics5363 3 года назад
Amazing content bro.lots of hard work . thank you so much .please make more AI playlists like NLP, RL , Deep RL ans Meta Learning with these amazing animations.
@alex-m4x4h
@alex-m4x4h 8 месяцев назад
at 19:26 the number of weights should be m*t+1 or am i getting it wrong ? because we have c0 as well
@likeapple1929
@likeapple1929 3 года назад
It could be better if you can address QKV with your notation. I'm new to attention mechanism and I'm getting confused with some of your notations. But the explanation itself is very clear.
@longdang7791
@longdang7791 2 года назад
Which slides or time step are you referring to?
@longdang7791
@longdang7791 2 года назад
So excited. Great supporting material to Goodfellow textbook. I am building my knowledge for the vision Transformer model.
@FelLoss0
@FelLoss0 Год назад
You're my hero. Marry me! Hahaha this is just a comment to let you know that your explanation can easily be the clearest one on RU-vid to understand attention. Keep up the good work! Thanks a mil!
@RAHUDAS
@RAHUDAS 2 года назад
I was looking for way to implement the encoder decoder , with attention model , with out using the for loop at decoder stage, is it possible???
@sachavanweeren9578
@sachavanweeren9578 2 года назад
very well explained ... thank you very much
@thanser67
@thanser67 2 года назад
Astonishing pedagogic effort Shusen! That’s a lot of work involved to share knowledge. Kudos !
@modai7452
@modai7452 2 года назад
Excellent video
@Obbe79
@Obbe79 3 года назад
So good! Thanks
@srinathkumar1452
@srinathkumar1452 3 года назад
Very well explained!
@-long-
@-long- 2 года назад
7:25 I think concatenation before the linear layer is from the paper of Luong et al. 2015. In Bahdanau et al., the authors performed matrix multiplication with linear layers first (on both h and s), then the concatenation.
Далее
Self-Attenion for RNN (1.25x speed recommended)
10:01
titan tvman's plan (skibidi toilet 77)
01:00
Просмотров 4,7 млн
Why Does Diffusion Work Better than Auto-Regression?
20:18
Transformer Model (1/2): Attention Layers
32:59
Просмотров 27 тыс.
Cross Attention | Method Explanation | Math Explained
13:06
How might LLMs store facts | Chapter 7, Deep Learning
22:43
titan tvman's plan (skibidi toilet 77)
01:00
Просмотров 4,7 млн