Transformer Neural Networks Derived from Scratch

Подписаться 26 тыс.

Просмотров 120 тыс.

50% 1

#transformers #chatgpt #SoME3 #deeplearning
Join me on a deep dive to understand the most successful neural network ever invented: the transformer. Transformers, originally invented for natural language translation, are now everywhere. They have fast taken over the world of machine learning (and the world more generally) and are now used for almost every application, not the least of which is ChatGPT.
In this video I take a more constructive approach to explaining the transformer: starting from a simple convolutional neural network, I will step through all of the changes that need to be made, along with the motivations for why these changes need to be made.
*By "from scratch" I mean "from a comprehensive mastery of the intricacies of convolutional neural network training dynamics". Here is a refresher on CNNs: • Why do Convolutional N...
Chapters:
00:00 Intro
01:13 CNNs for text
05:28 Pairwise Convolutions
07:54 Self-Attention
13:39 Optimizations

Опубликовано:

1 июн 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 212

@ullibowyer 13 дней назад

I now realise that the key to understanding transformers is to ask why they work, not how. Thanks!

@algorithmicsimplicity 13 дней назад

Thank you so much!

@algorithmicsimplicity 9 месяцев назад

Video about Diffusion/Generative models coming next, stay tuned!

@mahmirr 9 месяцев назад

Was coming to comment this, thanks

@arslanjutt4282 7 месяцев назад

Please make video

@micmac8171 3 месяца назад

Please!

@abdullahbaig7517 14 дней назад

This gem is underrated. This is the only video that after watching, I feel like I know how transformers work. Thanks!

@Alpha_GameDev-wq5cc 9 дней назад

I still remember when all the cool acronyms I had to deal with was just FNNs, CNNs, ADAM, RNNs, LSTMs and the newest kid on the block, GANs.

@newbie8051 8 дней назад

Damn FNN's and CNN's are basic stuff we were taught in our 4semester of our undergrad. Adam and RNNs were in the "additional resources" section for an Introdcutory course for Deep Learning I took in the same semester. Encountered LSTMs through personal projects lol Still haven't used GANs and Autoencoders, but it they were talk of the town back then due to the diffusion models.

@Alpha_GameDev-wq5cc 7 дней назад

@@newbie8051 yea I did FNN from scratch in high school, I was really hopeful for getting into Ai Research and then the transformers arrived in my college year…

@RoboticusMusic 9 месяцев назад

Thank you for not using slides filled with math equations. If someone understands the math they're probably not watching these videos, if they're watching these videos they're not understanding the math. It's incredible that so many RU-vid teachers decide to add math and just point at it for an hour without explaining anything their audience can grasp, and then in the comments you can tell everybody golf clapped and understood nothing except for the people who already grasp the topic. Thank you again for thinking of a smart way to teach simple concepts.

@xt3708 9 месяцев назад

amen. the power of out of the box teachers is infinite.

@rah-66comanche94 9 месяцев назад

Amazing video ! I really appreciate that you explained the Transformer model *from scratch*, and didn't just give a simplistic overview of it 👍 I can definitely see that *a lot* of work was put into this video, keep it up !

@korigamik 2 месяца назад

Would you share the source code for the animations?

@StratosFair 3 месяца назад

I am currently doing my PhD in machine learning (well, on its theoretical aspects), and this video is the best explanation of transformers I've seen on RU-vid. Congratulations and thank you for your work

@IllIl 9 месяцев назад

Dude, your explanations are truly next level. This really opened my eyes to understanding transformers like never before. Thank you so much for making these videos. Really amazing resource that you have created.

@halflearned2190 5 месяцев назад

Hey man, I watched your video months ago, and found it excellent. Then I forgot the title, and could not find it again for a long time. It doesn't show up when I search for "transformers deep learning", "transformers neural network", etc. Consider changing the title to include that keyword? This is such a good video, it should have millions of views.

@algorithmicsimplicity 5 месяцев назад

Thanks for the tip.

@tdv8686 9 месяцев назад

Thanks for your explanation; This is probably the best video on RU-vid about the core of transformer architecture so far, other videos are more about the actual implementation but lack the fundamental explanation. I 100% recommend it to everyone on the field.

@MalTramp 12 дней назад

This was an excellent video on the global design structure for transformer. Love all your videos!

@asier6734 9 месяцев назад

I love the algorithmic way of explaining what mathematics does. Not too deep, not too shallow, just the right level of abstraction and detail. Please please explain RNNs and LSTMs, I'm unable to find a proper explanation. Thanks !

@Magnetic-Milk 6 месяцев назад

Not so long ago I was searching for hours trying to understand transformers. In this 18 min video I learned more than I learned in 3 hours of researching. This is best computer science video I have ever watched in my entire life.

@anatolyr3589 2 месяца назад

yeah! this "functional" approach to the explanation rather than "mechanical" is truly amazing 👍👍👍👏👏👏

@benjamindilorenzo 3 месяца назад

This is the best Video on Transformers i have seen on whole youtube.

@terjeoseberg990 9 месяцев назад

I wasn’t aware that they were using a convolutional neural network in the transformer, so I was extremely confused about why the positional vectors were needed. Nobody else in any of the other videos describing transformers pointed this out. Thanks.

@Hexanitrobenzene 9 месяцев назад

"they were using a convolutional neural network in the transformer" No no, Transformers do not have any convolutional layers, the author of the video just chose CNN as a starting point in the process "Let's start with the solution that doesn't work well, understand why it doesn't work well and try to improve it, changing the solution completely along the way". The main architecture in natural language processing before transformers was RNN, recurrent neural network. Then in 2014 researchers improved it with attention mechanism. However, RNNs do not scale well, because they are inherently sequential, and scale is very important for accuracy. So, researchers tried to get rid of RNNs and succeded in 2017. CNNs were also tried, but, to my not-very-deep knowledge, were less succesful. Interesting that the author of the video chose CNN as a starting point.

@terjeoseberg990 9 месяцев назад

@@Hexanitrobenzene, I suppose I’ll have to watch this video again. I’ll look for what you mentioned.

@Hexanitrobenzene 9 месяцев назад

@@terjeoseberg990 A little off topic, but... Not long ago I noticed that RU-vid deletes comments with links. Ok, automatic spam protection. (Still, the thing that it does this silently frustrates a lot...) But, does it also delete comments where links are separated into words with "dot" between them ? I tried to give you a resource I learned this from, but my comment got dropped two times...

@Hexanitrobenzene 9 месяцев назад

...Silly me, I figured I could just give you the title you can search for: "Dive into deep learning". It's an open textbook with code included.

@terjeoseberg990 9 месяцев назад

@@Hexanitrobenzene, The best thing to do when RU-vid deletes comments is to provide a title or something so I can find it. A lot of words are banned too.

@jackkim5869 2 месяца назад

Truly this is the best explanation of transformers I have seen so far. Especially great logical flow makes it easier to understand difficult concepts. Appreciate your hard work!

@TropicalCoder 9 месяцев назад

Very nicely done. Your graphics had a calming, almost hypnotic effect.

@chrisvinciguerra4128 8 месяцев назад

It seems like whenever I want to dive deeper into the workings of a subject, I always only find videos that simply define the parts to how something works, like it is from a textbook. You not only explained the ideas behind why the inner workings exist the way they do and how they work, but acknowledged that it was an intentional effort to take a improved approach to learning.

@TTTrouble 8 месяцев назад

I’ve watched so many video explainers on transformers and this is the first one that really helped show the intuition in a unique and educational way. Thank you, I will need to rewatch this a few times but I can tell it has unlocked another level of understanding with regard to the attention mechanism that has evaded me for quite some time.(darned KQV vectors…) Thanks for your work!

@xt3708 9 месяцев назад

Absolutely love how you explain the process of discovery, in other words figure out one part which then causes a new problem, which then can be solved with this method, etc. The insight into this process for me was even more valuable than understanding this architecture itself.

@ryhime3084 9 месяцев назад

This was so helpful. I was reading through how other models work like ELMo and it makes sense how they came up with ideas for those, but the transformer it just seemed like it popped out of nowhere with random logic. This video really helps to understand their thought process.

@Muhammed.Abd. 9 месяцев назад

That is the possibly the best explanation of Attention I have ever seen!

@RalphDratman 9 месяцев назад

This is by far the best explanation of the transformer architecture. Well done, and thank you very much.

@ChrisCowherd 8 месяцев назад

This video is by far the clearest and best explained I've seen! I've watched so many videos on how transformers work and still came away lost. After watching this video (and the previous background videos) I feel like I finally get it. Thank you so much!

@diegobellani 9 месяцев назад

Wow just wow. This video makes you understanding really the reason behind the architecture, something that even reading the original paper you don't really get.

@briancase6180 9 месяцев назад

This a truly great introduction. I've watched other also excellent introductions, but yours is superior in a few ways. Congrats and thanks! 🤙

@CharlieZYG 8 месяцев назад

Wonderful video. Easily the best video I've seen on explaining transformer networks. This "incremental problem-solving" approach to explaining concepts personally helps me understand and retain the information more efficiently.

@ItsRyanStudios 9 месяцев назад

This is AMAZING I've been working on coding a transformer network from scratch, and although the code is intuitive, the underlying reasoning can be mind bending. Thank you for this fantastic content.

@jcorey333 3 месяца назад

This is one of the genuinely best and most innovative explanations of transformers/attention I've ever seen! Thank you.

@igNights77 8 месяцев назад

Explained thoroughly and clearly from basic principles and practical motivations. Basically the perfect explanation video.

@rishikakade6351 29 дней назад

Insane that this website is free. Thanks!

@user-eu2li6vf3z 8 месяцев назад

Cant wait for more content from your channel. Brilliantly explained.

@giphe 9 месяцев назад

Wow! I knew about attention mechanisms but this really brought my understanding to a new level. Thank you!!

@corydkiser 9 месяцев назад

This was top notch. Please do one for RetNets and Liquid Neural Nets.

@c1tywi 13 дней назад

This video is gold! Subscribed.

@TeamDman Месяц назад

I keep coming back to this because it's the best explanation!!

@Muuip 8 месяцев назад

Great concise visual presentation! Thank you, much appreciated! 👍👍

@JunYamog 4 месяца назад

Your visualization and explanation are very good. Helped me understand a lot. I hope you can put more videos, it must be not easy otherwise you would have done it. Keep it up.

@dmlqdk 3 месяца назад

Thank you for answering my questions!!

@algorithmicsimplicity 3 месяца назад

Thanks for the tip! I'm always happy to answer questions.

@TeamDman 8 месяцев назад

I've had to watch this a few times, great explanation!

@SahinKupusoglu 9 месяцев назад

This video was all I needed for LLMs/transformers!

@rogerzen8696 4 месяца назад

Good job! There was a lot of intuition in this explanation.

@antonkot6250 11 дней назад

The best explanation I found so far!

@ronakbhatt4880 5 месяцев назад

What a simple but perfect explanation!! You deserve 100s time more subscriber.

@shantanuojha3578 Месяц назад

Awesome video bro. i always like some intutive explanation.

@algorithmicsimplicity Месяц назад

Thanks so much!

@AdhyyanSekhsaria 9 месяцев назад

Great explanation. Havent found this perspective before.

@kul6420 12 дней назад

I may be too late to the party but glad I found this channel.

@clray123 9 месяцев назад

Great video, maybe you could cover retentive network (from the RetNet paper) in the same fashion next - as it aims to be a replacement for the quadratic/linear attention in transformer (I'm curious as to how much of the "blurry vector" problem their approach suffers from).

@mvlad7402 20 дней назад

Excellent explanation! All kudos to the author!

@yonnn7523 7 месяцев назад

best explainer of transformers I saw so far, thnx!

@quocanhad 3 месяца назад

you deserve my like bro, really awesome video

@pravinkool 6 месяцев назад

Fantastic! Loved it! Exactly what I needed.

@adityachoudhary151 4 месяца назад

really made me appreciate NN even more. Thanks for the video

@nara260 5 месяцев назад

thank a lot lot! this visual lecture cleared the dense fogs over my cognitive picture of the transformer.

@iustinraznic5811 8 месяцев назад

Amazing explainations and video!

@hadadvitor 9 месяцев назад

fantastic video, congratulations on and thank you for making it

@_MrKekovich 9 месяцев назад

FINALLY I have something me basic understanding. Thank you so much!

@ArtOfTheProblem 8 месяцев назад

Really well done, I haven't seen your channel before and this is a breath of fresh air. I've been working on my GPT + transformer video for months and this is the only video online which is trying to simplify things through an indepdnent realization approach. Before I watched this video my 1 sentence summary of why Transformers matter was: "They contain layers that have weights which adapt based on context" (vs. using deeper networks with static layers). and this video helped solidify that further, would you agree? I also wanted to boil down the attention heads as "mini networks" (or linear functions) connected to each token which are trained to do this adaptation. One network pulls out what's important in each word given the context around it, the other networks combines these values to decide the important those two words in that context, and this is how the 'weights adapt' I still wonder how important the distinction of linear layer vs. just a single layer, I like how you pulled that into the optimization section. i know how hard this stuff is to make clear and you did well here

@maxkho00 8 месяцев назад

My one-sentence summary of why transformers matter would be "they are standard CNNs, except the words are re-ordered in a way that makes the CNN's job easier first before being fed ". Also, a single NN layer IS a linear layer; I'm not sure what you mean by saying you don't know how important the distinction between the two is.

@ArtOfTheProblem 8 месяцев назад

thanks@@maxkho00

@IzUrBoiKK 9 месяцев назад

As both a math enthusiasts and a programme (who obv also works on AI) I rly liked this vid. I can confirm that this is one of the best and genuine explanation of transformers...

@ArtOfTheProblem 8 месяцев назад

the first so far this year

@yash1152 9 месяцев назад

2:36 wow, just 50k words... that soud pretty easy for computers. amazing.

@marcfruchtman9473 9 месяцев назад

Very interesting. Thank you for the video.

@christrifinopoulos8639 4 месяца назад

The visualisation was amazing.

@minhsphuc12 7 месяцев назад

Thank you so much for this video.

@TaranovskiAlex 8 месяцев назад

thank you for the explanation!

@user-js7ym3pt6e 4 месяца назад

Amazing, continue like this.

@lakshay510 3 месяца назад

Halfway through the video and I pressed the subscribed button. Very intutive and easy to understand. Keep up the good work man :) 1 suggestion: Change the title of video and you'll get more traction.

@algorithmicsimplicity 3 месяца назад

Thanks, any title in particular you'd recommend?

@palyndrom2 9 месяцев назад

Great video

@rafa_br34 15 дней назад

I'd love to see you explain how KANs work.

@vedantkhade4395 3 месяца назад

This video is damn impressive mann

@sairaj6875 8 месяцев назад

Thank you!!

@anilaxsus6376 9 месяцев назад

best explanation i have seen so far. Basically The transformer is cnn with a lot of extra upgrades. Good to know.

@arongil 9 месяцев назад

Great, thank you!

@cem_kaya 9 месяцев назад

Thank you so much

@albertmashy8590 9 месяцев назад

This was amazing

@TheSonBAYBURTLU 8 месяцев назад

Thank you 🙂

@christianjohnson961 9 месяцев назад

Can you do a video on tricks like layer normalization, residual connections, byte pair encoding, etc.?

@domasvaitmonas8814 2 месяца назад

Thanks. Amazing video. One question though - how do you train the network to output the "importance score"? I get the other part of the self-attention mechanism, but the score seems a bit out of the blue.

@algorithmicsimplicity 2 месяца назад

The entire model is trained end-to-end to solve the training task. What this means is you have some training dataset consisting of a bunch of input/label pairs. For each input, you run the model on that input, then you change the parameters in the model a bit, evaluate it again and check if the new output is closer to the training label, if it is you keep the changes. You do this process for every parameter in all layers and in all value and score networks, at the same time. By doing this process, the importance score generating networks will change over time so that they produce scores which cause the model's outputs to be closer to the training dataset labels. For standard training tasks, such as predicting the next word in a piece of text, it turns out that the best way for the score generating networks to influence the model's output is by generating 'correct' scores which roughly correspond to how related 2 words are, so this is what they end up learning to do.

@AerialWaviator 9 месяцев назад

Very fascinating topic with an excellent dive and insights into how neural networks derive results. One thing I was left wondering is why is there no scoring vector describing the probability a word is a noun, verb. or adjective? Encoding a words context (regardless of language), should provide a great deal of context and thus eliminating many convolutional pairings, reducing computational effort. Thanks for a new found appreciation of transformers.

@ArtOfTheProblem 8 месяцев назад

this is a good question and it's also a GOFAI type approach where we make the mistake thinking we can inject some human semantic idea to improve a network. but the reality is it will do this automatically without our help. For example papers back in 1986 show tiny networks automatically grouping words into nouns or verbs, it's amazing. let me know if you want more details

@komalsinghgurjar 6 месяцев назад

Sir I like your videos very much. Love from India ♥️♥️.

@Tigerfour4 9 месяцев назад

Great video, but it left me with a question. I tried to compare what you arrived at (16:25) to the original transformer equations, and if I understand it correctly, in the original we don't add the red W2X matrix, but we have a residual connection instead, so it is as if we would add X without passing it through an additional linear layer. Am I correct in this observation, and do you have an explanation for this difference?

@algorithmicsimplicity 9 месяцев назад

Yes that's correct, the transformer just adds x without passing it through an additional linear layer. Including the additional linear layer doesn't actually change the model at all, because when the result of self attention is run through the MLP in the next layer, the first thing the MLP does is apply a linear transform to the input. Composition of 2 linear transforms is a linear transform, so we may as well save computation and just let the MLP's linear transform handle it.

@laithalshiekh3792 9 месяцев назад

Your video is amazing

@Supreme_Lobster 8 месяцев назад

Thanks. I had read the original Transformer paper and I barely understood the underlying ideas.

@AN-ch3ly 2 месяца назад

Great video, but I was wondering how one aspect of the transformer is handled in the real world. How are importance scores assigned to pairs in order to determine their importance? Basically, on a massive scale, how can important scores be automatically assigned in order to get the correct importance for a pair for a given sentence?

@algorithmicsimplicity 2 месяца назад

The entire model is trained end-to-end to solve the training task. What this means is you have some training dataset consisting of a bunch of input/label pairs. For each input, you run the model on that input, then you change the parameters in the model a bit, evaluate it again and check if the new output is closer to the training label, if it is you keep the changes. By doing this process, the score generating networks will change over time so that they produce scores which cause the model's outputs to be closer to the training dataset labels. It turns out that the best way for the score generating networks to influence the model's output is by generating 'correct' scores which roughly correspond to how related 2 words are, so this is what they end up learning.

@cloudysh 6 месяцев назад

This is perfect

@Baigle1 6 месяцев назад

I think they were actually used as far back or more as 2006, in compressor algorithm competitions publicly

@user-km3kq8gz5g 4 месяца назад

You are amazing

@DanOneOne 8 дней назад

so what does it really classify? The image recognition needed to output a label of that image, What does this transformer output after processing the text?

@algorithmicsimplicity 8 дней назад

What ever you train it to. People have trained transformers to categorize text, predict the sentiment of sentences, all sorts of things. ChatGPT is specifically trained to predict the next word that comes after a partial piece of text. It turns out that you can use this to generate new text from scratch by repeatedly applying it to its own output. This technique is known as 'auto-regression' and I explain it in more detail in this video: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-zc5NTeJbk-k.html

@AurL_69 3 месяца назад

Holy pepperoni you're great !

@seraine22 7 месяцев назад

Thanks!

@algorithmicsimplicity 7 месяцев назад

Thank you for your support!

@4.0.4 8 месяцев назад

Dunno if you asked to like and subscribe, but if you did, it wasn't necessary. I really feel like I have a remote grasp on it now 😅

@The_DorkLord 10 дней назад

13:00 This may be a silly question, but would it be possible for the transformer to encounter a sentence where all words would have a score of 0.0, creating an issue with simply using an exponential function? I imagine it would be vanishingly rare, but something along the lines of Chomsky's "Colorless green ideas sleep furiously" would seem like the type of sentence that would create such an issue. I assume that this is not a real problem, but I am curious as to why it isn't one.

@algorithmicsimplicity 10 дней назад

It's almost impossible for that to happen in practice because we compare words against themselves. So if one word has no relationship with any other word in the sentence, it will still have a large score for itself: so the normalized weight will be 1 for itself and 0 for all other words. Which means that its vector won't include information from any other words, but that's kind of what you want if it really doesn't have any relationship to any other words.

@The_DorkLord 10 дней назад

@@algorithmicsimplicity Right, of course, that makes sense. I hadn't thought about words having weights for themselves. Thanks! Your channel is really great, I love the level of depth you go into while still keeping the material approachable.

@frederik7054 8 месяцев назад

The video is of great quality! With which tool did you create this? Manim?

@algorithmicsimplicity 8 месяцев назад

Yep all my videos so far have been done in Manim.

@introstatic 9 месяцев назад

This is brilliant. Could you give a hint where to look for details of the idea of the pairwise convolution layer? Can't find anything with this exact wording.

@algorithmicsimplicity 9 месяцев назад

Yeah it's a term I made up so you won't find it in any sources, sorry about that. Usually sources will just talk about self attention in terms of key, query and value lookups, so you can look at those to get a more detailed understanding of the transformer. The value transform is equivalent to the linear representation function I use in the pairwise convolution, the key and query attention scores are equivalent to the bi-linear form scoring function I use (with the bi-linear form weight matrix given by Q^TK). I chose to use this unusual terminology because, personally, I feel the key, query and value terminology comes out of nowhere, and I wanted to connect the transformer more directly to its predecessor (the CNN).

@introstatic 9 месяцев назад

@algorithmicsimplicity, this is a surprising connection. Thanks a lot for the explanation.

@dsagman 9 месяцев назад

@@algorithmicsimplicityit would be great if you could make this connection between terminology in video form. maybe next time?

@iandanforth 9 месяцев назад

I wish this had tied in specifically to the nomenclature of the transformer such as where these operations appear in a block, if they are part of both encoder and decoder paths, how they relate to "KQV" and if there's any difference between these basic operations and "cross attention".

@ArtOfTheProblem 8 месяцев назад

I"ll be doing this, but in short, the little networks he showed connected to each pair are KQ (word pair representation) and the V is the value network., all of this can be done in the decoder only model as well. and cross attention is the same thing but you are using two separate sequences looking at each other (such as two sentences in a translation network). it's nice to know that GPT for example is decorder only, and so doesn't even need this

@cezarydziemian6734 8 месяцев назад

Wow, great video, but have some problem understaning one thing. I'm trying to understand it watching all 3 videos and what I have trouble to understand is how these pairs of words (vectors) from the first layer are match together into new vectors. For exaple, for "catsat" pair, we have twe vectors: [0001] and [0100]. How are they transformed to the vector [1.3, -0.9...]? If this is just the result of some internal neural net, where did the data (wages) for this net came from? Or if they started fom random numebers, how ware they trained?

@algorithmicsimplicity 8 месяцев назад

The pair vectors are first concatenated together into one vector e.g. [00010100], and this vector is then run through the neural network which produces the output vector. The output is the result of the weights in the neural network. Initially, those weights are completely random (usually sampled from a normal distribution centred at 0), and then they are updated during training. The neural network is trained on a labelled training dataset of input and output pairs. For example, ChatGPT was trained to do next word prediction on billions of passages of text scraped from the internet. In this case, each training example is a random part of a text passage (e.g. "the cat sat on the") and the output is the next word that occurs in the text (e.g. "mat"). For every training example an update step is performed on the neural network to update all of the weights of all of the layers. The update step works as follows: 1) Evaluate the neural network on the input. 2) For every weight in every layer, increase the value of that weight by a small amount (e.g. 0.001) and then re-evaluate the entire neural network on the input. If the new output is closer to the target (e.g. the vector output is closer to the one-hot encoding of "mat") then it was good to change that weights value, so it keeps the new value. If the new output is further away from the target, then it was a bad change, so reverse it. And that's it. Just keep repeating that update step for billions of different inputs and all of the weights in all layers will eventually be set to values which allow the transformer as a whole to map inputs to outputs correctly. Also I should point out that in practice there is a faster way to do the update step which is called backprop. Backprop computes exactly the same result as the update process I described, it is just faster computationally (you only need to evaluate the model twice instead of once for every weight), but it is also more difficult to understand.

@GaryBernstein 8 месяцев назад

Can you explain how the NN produces the important-word-pair information-scores method described after 12:15 from the sentence problem raised at 10:17? Well it’s just another trained set of values. I supposs it scores pairs importance over the pairs’ uses in ~billions of sentences.

@algorithmicsimplicity 8 месяцев назад

The importance-scoring neural network is trained in exactly the same way that the representation neural network is. Roughly speaking, for every weight in the importance-scoring neural network you increase the value of that weight slightly and then re-evaluate the entire transformer on a training example. If the new output is closer to the training label, then that was a good change so the weight stays at its new value. If the new output is further away, then you reverse the change to that weight. Repeat this over and over again on billions of training examples and the importance-scoring neural network weights will end up set to values so that that the produced scores are useful.

@nightchicken3517 8 месяцев назад

I really love SoME

@korigamik 2 месяца назад

Man can you tell us what you used to create the animations and how you edit the videos?

@algorithmicsimplicity 2 месяца назад

The animations were made with the Manim Python library (www.manim.community/ ) and edited with KDenLive.

@dmlqdk 4 месяца назад

How does the explanation in this video relate to Query, Key and Values (as defined in the Attention is all you need paper)? This is really a great video - thank you!!

@algorithmicsimplicity 4 месяца назад

The "key-query" attention scoring is equivalent to the bi-linear scoring function in my explanation, where the bi-linear form matrix is given by K^TQ. The value transformation V is exactly the linear representation function in my explanation. I still have no idea why they decided to give the scoring function matrix two different names (key and query), it just confuses everyone.

@dmlqdk 3 месяца назад

Let's assume X is our input, a sentence containing N words. Each word has embedding dimension of size P. Thus X is an NxP matrix. Then according to the "Attention is All You Need" paper, we have: K = X * W_k Q = X * W_q V = X * W_v A = softmax(QK^T) Output = AV Where: W_k, W_q and W_v are PxP matrices K, Q and V are NxP matrices. A is an NxN matrix Output is an NxP matrix. I am confused about how the V matrix connects to the "pair-wise representations". In the video, you show operations being done on pairs of words (such as 13:40). However, there doesn't seem to be any pair-wise operations occurring when computing V? If there was a pair-wise operation, wouldn't the dimension of W_v need to be NxN instead of PxP? I agree that we are computing an a single "attention" scalar value for each word pair. This is why A has dimension NxN. However, it seems like V contains individual representation of the words that are "smooshed" together when we multiply by A, rather than V containing (or operating on) pair-wise representations? Again, great video! And I greatly appreciate your help!!@@algorithmicsimplicity

@algorithmicsimplicity 3 месяца назад

@@dmlqdk When you apply the linear transform V to the pair [x1, x2] the result is V1x1 + V2x2. Basically we are applying a linear transform to each input and summing them. Because, in a given column, x2 is the same for every pair, you are effectively just adding a constant value to each V1x_i. You can factor this constant value outside of the attention weights, at which point it just becomes part of the residual term. I explained this process in more detail here: www.reddit.com/r/MachineLearning/comments/17cmzcz/comment/k5t7g70/?context=3 At this point, you no longer have 'pair' representations, since each value vector is now just a linear transform applied to one word. Each column of the [NxN] grid of value vectors contains V1x_i for i in {1,...n}, i.e. all of the columns are identical. Since all of the columns are identical, instead of elementwise multiplying the matrix of attention values by the matrix of value vectors and then summing, you can instead rewrite this operation as a single matrix-vector product, which is what the AV operation is in the standard self attention. V is that column of value vectors, where each entry is just V1x_i.

@dmlqdk 3 месяца назад

This makes so much more sense now. Thank you!!@@algorithmicsimplicity