Тёмный

Transformers explained | The architecture behind LLMs 

AI Coffee Break with Letitia
Подписаться 45 тыс.
Просмотров 18 тыс.
50% 1

All you need to know about the transformer architecture: How to structure the inputs, attention (Queries, Keys, Values), positional embeddings, residual connections. Bonus: an overview of the difference between Recurrent Neural Networks (RNNs) and transformers.
9:19 Order of multiplication should be the opposite: x1(vector) * Wq(matrix) = q1(vector). Otherwise we do not get the 1x3 dimensionality at the end. Sorry for messing up the animation!
➡️ AI Coffee Break Merch! 🛍️ aicoffeebreak.creator-spring....
Outline:
00:00 Transformers explained
00:47 Text inputs
02:29 Image inputs
03:57 Next word prediction / Classification
06:08 The transformer layer: 1. MLP sublayer
06:47 2. Attention explained
07:57 Attention vs. self-attention
08:35 Queries, Keys, Values
09:19 Order of multiplication should be the opposite: x1(vector) * Wq(matrix) = q1(vector).
11:26 Multi-head attention
13:04 Attention scales quadratically
13:53 Positional embeddings
15:11 Residual connections and Normalization Layers
17:09 Masked Language Modelling
17:59 Difference to RNNs
Thanks to our Patrons who support us in Tier 2, 3, 4: 🙏
Dres. Trost GbR, Siltax, Vignesh Valliappan, @Mutual_Information , Kshitij
Our old Transformer explained 📺 video: • The Transformer neural...
📺 Tokenization explained: • What is tokenization a...
📺 Word embeddings: • How modern search engi...
📽️ Replacing Self-Attention: • Replacing Self-attention
📽️ Position embeddings: • Positional encodings i...
@SerranoAcademy Transformer series: • The Attention Mechanis...
📄 Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. "Attention is all you need." Advances in neural information processing systems 30 (2017).
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
🔥 Optionally, pay us a coffee to help with our Coffee Bean production! ☕
Patreon: / aicoffeebreak
Ko-fi: ko-fi.com/aicoffeebreak
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
🔗 Links:
AICoffeeBreakQuiz: / aicoffeebreak
Twitter: / aicoffeebreak
Reddit: / aicoffeebreak
RU-vid: / aicoffeebreak
#AICoffeeBreak #MsCoffeeBean #MachineLearning #AI #research​
Music 🎵 : Sunset n Beachz - Ofshane
Video editing: Nils Trost

Наука

Опубликовано:

 

9 июн 2024

Поделиться:

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист
Посмотреть позже
Комментарии : 95   
@YuraCCC
@YuraCCC 4 месяца назад
Thanks for the explanation. At 9:19 : Shouldn't the order of multiplication be the opposite here? E.g. x1(vector) * Wq(matrix) = q1(vector). Otherwise I don't understand how we get the 1x3 dimensionality at the end
@AICoffeeBreak
@AICoffeeBreak 4 месяца назад
Oh, shoot, messed up the order in the animations there. You are right. Sorry, pinning your comment.
@YuraCCC
@YuraCCC 4 месяца назад
No problem thanks for clarifying that, and thanks again for the great video@@AICoffeeBreak
@DerPylz
@DerPylz 4 месяца назад
Wow, you've come a long way since your first transformer explained video!
@xyphos915
@xyphos915 4 месяца назад
Wow, this explanation on the difference between RNNs and Transformers at the end is what I was missing! I've always heard that Transformers are great because of parallelization but never really saw why until today, thank you! Great video!
@AICoffeeBreak
@AICoffeeBreak 4 месяца назад
Oh, this makes me happy !
@420_gunna
@420_gunna 4 месяца назад
Awesome video, thank you! I love the idea of you revisiting older topics -- either as a 201 or as a re-introduction. "Attention combines the representation of input vector's value vectors, weighted by the importance score (computed by the query and key vectors)."
@AICoffeeBreak
@AICoffeeBreak 4 месяца назад
Thanks for your appreciation!
@DaveJ6515
@DaveJ6515 4 месяца назад
You know how to explain things. This one is not easy: I can see the amount of work that went into this video, and it was a lot. I hope that your career takes you where you deserve.
@AICoffeeBreak
@AICoffeeBreak 4 месяца назад
Thanks for watching and thanks for the kind words. All the best to you as well!
@abhishek-tandon
@abhishek-tandon 4 месяца назад
One of the best videos on transformers that I have ever watched. Views 📈
@AICoffeeBreak
@AICoffeeBreak 4 месяца назад
Do you have examples of others you liked?
@MachineLearningStreetTalk
@MachineLearningStreetTalk 4 месяца назад
Epic as always 🤌
@AICoffeeBreak
@AICoffeeBreak 4 месяца назад
Thanks, Tim!
@mumcarpet109
@mumcarpet109 4 месяца назад
your videos has helped visual learner like me so much, thank you
@AICoffeeBreak
@AICoffeeBreak 4 месяца назад
Happy to hear that!
@16876
@16876 4 месяца назад
What a thorougfh and much anticipated overview laid out so coherently ,, thank you
@AICoffeeBreak
@AICoffeeBreak 4 месяца назад
Our pleasure! We should have done this video much earlier, considering that our old Transformer Explained is our most watched video to date. 😅
@jonas4223
@jonas4223 4 месяца назад
Today, I had the problem I need to understand how Transformers work.. I searched on youtube and found your video 20 minutes after release. What a perfect timing
@AICoffeeBreak
@AICoffeeBreak 4 месяца назад
What a timing!
@user-th2ec8ms3m
@user-th2ec8ms3m 4 месяца назад
Really well done and easy to follow, thank you
@AICoffeeBreak
@AICoffeeBreak 4 месяца назад
Glad you enjoy it!
@cosmic_reef_17
@cosmic_reef_17 4 месяца назад
Thank you very much for the very clear explanations and detailed analysis of the transformer architecture. Your truly the 3blue1brown of machine learning!
@AICoffeeBreak
@AICoffeeBreak 4 месяца назад
@l.suurmeijer1382
@l.suurmeijer1382 4 месяца назад
Absolute banger of a video. Wish I had seen this when I was learning about transformers in uni last year :-)
@AICoffeeBreak
@AICoffeeBreak 4 месяца назад
Haha, glad I could help. Even if a bit late.
@connorshorten6311
@connorshorten6311 4 месяца назад
Awesome! Epic Visuals!
@AICoffeeBreak
@AICoffeeBreak 4 месяца назад
Thanks, Connor!
@mccartym86
@mccartym86 3 месяца назад
I think I had at least 10 aha moments watching this, and I've watched many videos on these topics. Incredible job, thank you!
@AICoffeeBreak
@AICoffeeBreak 3 месяца назад
Wow, thank You for this wonderful comment!
@volpir4672
@volpir4672 4 месяца назад
that's great, I'm a little stuck on the special mask token? ... I'll keep digging, good info, the video is good explanation, it allows for more experimentation instead of relying on open source models that can have components look like a black box to noobs like me :)
@Thomas-gk42
@Thomas-gk42 4 месяца назад
Understood about 10%, but I like these vidoes and feel intuitively the usefulness.
@AICoffeeBreak
@AICoffeeBreak 4 месяца назад
@muhammedaneesk.a4848
@muhammedaneesk.a4848 4 месяца назад
Thanks for the explanation 😊
@AICoffeeBreak
@AICoffeeBreak 4 месяца назад
Thanks for watching!
@DatNgo-uk4ft
@DatNgo-uk4ft 4 месяца назад
Great Video!! Nice improvement over the original
@AICoffeeBreak
@AICoffeeBreak 4 месяца назад
Glad you think so!
@darylallen2485
@darylallen2485 2 месяца назад
Letitia, you're awesome and I look forward to learning more from you.
@jcneto25
@jcneto25 4 месяца назад
Best didatic explanation about Transformers so far. Thank you for sharing it.
@AICoffeeBreak
@AICoffeeBreak 3 месяца назад
Wow, thanks! Glad it's helpful.
@SamehSyedAjmal
@SamehSyedAjmal 4 месяца назад
Thank you for the video! Maybe an explanation on the Mamba Architecture next?
@AICoffeeBreak
@AICoffeeBreak 3 месяца назад
The Mamba and SSM beans are roasting as we speak.
@manuelafernandesblancorodr6366
@manuelafernandesblancorodr6366 4 месяца назад
What a wonderful video! Thank you so much for sharing it!
@AICoffeeBreak
@AICoffeeBreak 4 месяца назад
Thank you too for this wonderful comment!
@xxlvulkann6743
@xxlvulkann6743 2 месяца назад
This is a very well-made explanation. I hadn't known that the feedforward layers only received one token at a time. Thanks for clearing that up for me! 😁
@phiphi3025
@phiphi3025 4 месяца назад
Thanks, you helped so much explain Transformers to my PhD advisors
@AICoffeeBreak
@AICoffeeBreak 4 месяца назад
This is really funny. In what field are you doing your PhD? 😅
@paprikar
@paprikar 4 месяца назад
here we go! TY for content
@bartlomiejkubica1781
@bartlomiejkubica1781 4 месяца назад
Thank you! Finally, I start to get it...
@HarishAkula-df8gs
@HarishAkula-df8gs 2 месяца назад
Amazing explanation, Thank you! Just discovered your channel and I really like how the difficult topics are demystified.
@AICoffeeBreak
@AICoffeeBreak 2 месяца назад
Thanks a lot!
@Clammer999
@Clammer999 29 дней назад
Thanks so much for this video. I’ve gone through a number of videos on transformers and this is much easier to grasp and understand for a non-data scientist like myself.
@AICoffeeBreak
@AICoffeeBreak 29 дней назад
You're very welcome!
@meguellatiyounes8659
@meguellatiyounes8659 4 месяца назад
well explained . as you promised
@AICoffeeBreak
@AICoffeeBreak 4 месяца назад
@rahulrajpvr7d
@rahulrajpvr7d 4 месяца назад
Tomorrow i have thesis evaluation and i was thinking about watching that video again, but youtube algorithm suggested me without searching anything, Thank u youtube algo.. 😅❤🔥
@AICoffeeBreak
@AICoffeeBreak 4 месяца назад
It read your mind.
@ArthasDKR
@ArthasDKR 4 месяца назад
Excellent explanation. Thank you!
@AICoffeeBreak
@AICoffeeBreak 4 месяца назад
@dannown
@dannown 4 месяца назад
Really appreciate this video.
@AICoffeeBreak
@AICoffeeBreak 4 месяца назад
So glad!
@ai-interview-questions
@ai-interview-questions 4 месяца назад
Thank you, Letitia!
@AICoffeeBreak
@AICoffeeBreak 3 месяца назад
Our pleasure!
@Ben_D.
@Ben_D. 2 месяца назад
...ok. After binging some of your vids, I now need to go make coffee. 😆
@AICoffeeBreak
@AICoffeeBreak 2 месяца назад
Please do!
@zahrashah6567
@zahrashah6567 Месяц назад
What a wonderful explanation😍 Just discovered your channel and absolutely loving the explanations as well as visuals😘
@AICoffeeBreak
@AICoffeeBreak Месяц назад
Thank you! welcome!
@pfever
@pfever 3 месяца назад
Just discovered your channel and this is great! Thank you! :D
@AICoffeeBreak
@AICoffeeBreak 3 месяца назад
Thank you! Hope to see you again soon in the comments.
@MuruganR-tg9yt
@MuruganR-tg9yt 3 месяца назад
Thank you. Nice explanation 😊
@AICoffeeBreak
@AICoffeeBreak 3 месяца назад
Thank You for your visit!
@zbynekba
@zbynekba 4 месяца назад
❤ Letitia, thank you for great visualization and intuition. For inspiration: In the original paper, the decoder utilizes the output of the encoder by running a cross-attention process. Why does GPT not use an encoder? As you've mentioned, the encoder is typically used for classification, while the decoder is for text generation. They are never used in combination. Why is this the case? Missing Intuition: Why does the cross-attention layer inside the decoder take the values from the ENCODER’s output to create the enhanced embeddings (as a weighted mix)? Intuitively, I would use the values from the DECODER.
@AICoffeeBreak
@AICoffeeBreak 4 месяца назад
Thanks for your thoughts! Encoders are sometimes used in combination with decoders, right? The most famous example is the T5 architecture.
@zbynekba
@zbynekba 4 месяца назад
Thanks for your prompt reply. Hence, understanding the concept and intuition behind feeding the encoder output into the decoder is essential. I found only this one video on encoder-decoder cross-attention: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-Dqjq4Gxdhng.htmlsi=gtLzNxAU0pUGyLvk In it, Lennart emphasizes the observation that, based on the original equations, we have the enhanced embeddings calculated as a weighted sum of ENCODER values. Inside of a DECODER, I would rather expect to have the DECODER values pass through. Letitia, I am sure, you will resolve this mystery. 🍀
@l3nn13
@l3nn13 4 месяца назад
great video
@AICoffeeBreak
@AICoffeeBreak 4 месяца назад
Thanks for the visit and for leaving the comment!
@tildarusso
@tildarusso 4 месяца назад
As far as I am aware, word embedding has changed from legacy static embedding like Word2Vec/GLOVE (like the famous queen=woman+king-man metaphor) to BPE & unigram, this change gave me quite a headache, as most of paper do not mention any detail of their "word embedding". Perhaps Letitia you can make a video to clarify this a bit for us.
@AICoffeeBreak
@AICoffeeBreak 4 месяца назад
Great suggestion, thanks!
@LEQN
@LEQN 2 месяца назад
Awesome video :) thanks!
@AICoffeeBreak
@AICoffeeBreak 2 месяца назад
Thank you for watching and for your wonderful comment!
@kallamamran
@kallamamran 4 месяца назад
Phew 😳
@M4ciekP
@M4ciekP 4 месяца назад
How about a video explaining SSMs?
@AICoffeeBreak
@AICoffeeBreak 4 месяца назад
✍️
@AICoffeeBreak
@AICoffeeBreak 3 месяца назад
Psst: This will be the video coming up in a few days. it's in editing right now.
@M4ciekP
@M4ciekP 3 месяца назад
Yaay! @@AICoffeeBreak
@ehudamitai
@ehudamitai 4 месяца назад
In 11:14, the weighted sum is the sum of 3 vectors of 3 elements each, but the results is a vector of 4 elements. Which, conveniently, is the same size as the input vector. Could there be a missing step there?
@AICoffeeBreak
@AICoffeeBreak 4 месяца назад
Yes, there is a missing back transformation to 4 dimensions I skipped. :) Well spotted!
@nmfhlbj
@nmfhlbj 2 месяца назад
hi! can i ask question of how did you get the dimension (d)? because all i know is dimension can be found in square matrices, and the dot product of the attention formula says that Q•K^T. if we're using 1x3 matrices, we'll get 1x1 matrices or 1 dimension, how do you get 3 ? unless its 3x1 matrix beforehand, so we'll get 3x3 or 3 dimensional matrix. thankyouu !
@AICoffeeBreak
@AICoffeeBreak 2 месяца назад
Hi, if you mean the mistake at 10:00, then the problem is that I have written matrix times vector when I should have written vector times matrix! (or I could have used column vectors instead of row vectors). Is this what you mean?
@benjamindilorenzo
@benjamindilorenzo 3 месяца назад
What a great video. It still could expand more and really sum up every sub-part and connect it to a certain clear visualization or clear step of what happens with the information at each time step and how its "transformation" progresses over time. So i think you could redo this video and really make it monkey proof for folks like me. But beware, if you look for example at the StatQuest version, its to slow and too repetative and also does not really capture, what really goes on inside the Transformer, once all steps are stacked together. Great work!
@DaeOh
@DaeOh 4 месяца назад
Everything makes sense except multiple attention heads. Each layer has only one set of Q, K, V, O matrices. But 8 attention heads per layer? I want to understand that.
@AICoffeeBreak
@AICoffeeBreak 4 месяца назад
Think about it this way: In one layer, instead of having one head telling you how to pay attention at things, you have 8. In other words, instead of having one person shout at you the things they want you to pay attention to, you have 8 people simultaneously shouting at you. This is beneficial because it has an ensembling effect (the effect of a voting parliament. Think of Random Forests that are an ensemble of Decision Trees). I do not know if this helps, but I thought giving it another shot at explaining this.
@josephvanname3377
@josephvanname3377 4 месяца назад
I want to train a transformer that eats a row of matrices instead of just a row of vectors.
Далее
СТРИМ-МАФИЯ С ДРУЗЬЯМИ
4:09:21
Просмотров 988 тыс.
УТОПИЯ ШОУ В КИНО
2:36:54
Просмотров 297 тыс.
Lessons Learned on LLM RAG Solutions
34:31
Просмотров 21 тыс.
Self-Attention Using Scaled Dot-Product Approach
16:09
НАШ ЛЮБИМЫЙ КЛИЕНТ
1:00
Просмотров 232 тыс.
Apple watch hidden camera
0:34
Просмотров 57 млн
don't throw your faulty fan
0:24
Просмотров 1,6 млн