My name is Damien, former ML Tech Lead at Meta and more than 10 years in the field of AI/ML! I share my knowledge of the field to help prepare the next generation of ML Engineers
Amazing content! I almost didn't click on the video because of the title "Intro to ml sd" but I'm glad I watched it to learn about the complexities of the facebook-friends recommender design. I came for an intro but got the real content I was seeking. Thanks!
Great video. But I would have loved if you had spent a minute also on why float32 vs bfloat16 is applied in backpropagation. But the video is still brilliant as always!
I have a question, for pos=0 and "horizontal_index"=2, shouldn't it be PE(pos,2) = sin(pos/10000^(2/d_model)) ? I believe you used the same symbol "i" for 2 different way of indexing, right ? 7:56
Thank you Damien, and math_in_cantonese I'm in the middle of writing a short article discussing position encoding. Damien, feel proud that you are the first reference I quote in the article! I was just going crazy trying to nail the exact meaning of "i". In Damien's video it is clear he means "i" the dimension index, and the values shown with sin/cos match. But now I could not make any logic of this understanding with the equation formulation below: PE(pos,2i) = sin(pos/10000^2i/dmodel) PE(pos,2i+1) = cos(pos/10000^2i/dmodel) If see this as PE(pos, 0) referreing to the first column (column zero) and, say, PE(pos,5) as referring to the sixth column (column 5), with 5 = 2i+1 => i = (5-1)/2 = 2. So "i" is more like the index of a (sin,cos) pair of dimensions. Its range is d_model/2. The original sin (😄, pun intended) is in the Attention is all you need. There they simply state: > where pos is the position and i is the dimension This is wrong, it seems, 2i and 2i+1 are the dimensions. In any case big thank you Damien, I have watched, many of your videos. They are quite useful in ramping me up on LLM and the rest. Merci beaucoup Alain
nice. but i have one doubt. like how adding sine and cosine values ensuring that we are encoding the positions. like how did the author come to this conclusion why not other values?
The sine and cosine functions provide smooth and continuous representations, which help in learning the relative positions effectively. For example, the encoding for positions k and k+1 will be similar, reflecting their proximity in the sequence. The frequency-based sinusoidal functions allow the encoding to generalize to sequences of arbitrary length without needing to re-learn positional information for different sequence lengths. The model can understand relative positions beyond the length of sequences seen during training. The combination of sine and cosine functions ensures that each position has a unique encoding. The orthogonality property of these functions helps in distinguishing between different positions effectively, even for long sequences. The different frequencies used in the positional encodings allow the model to capture both short-term and long-term dependencies within the sequence. Higher frequency components help in understanding local relationships, while lower frequency components help in capturing global structures. Also, sinusoidal functions are differentiable, which is crucial for backpropagation during training. This ensures that the model can learn to use the positional encodings effectively through gradient-based optimization methods.
Thanks for the clear explanation. I've watched a few of your videos and follow you on LinkedIn, and I can say that you're killing it brother. Also love the simplicity in your infographics that you have in your videos. Do you get them from elsewhere or do you make it yourself?
I've got it now. I wonder why we can't calculate the x gradient by starting the backward pass closer to x instead of going through all the activations.
Thanks You.Can you explain the entire self attention flow? (from postional encode to final next word prediction). I think it will be an entire series 😅
@@TheMLTechLead Thanks for your reply but absolutely no apology necessary!! I think it is an excellent video and helpful information. Much appreciation for posting. I am a professor in a business school and always looking for insights into how to teach the technical side of technology in the context of business. Your explanation has been very helpful.
So we have an ensemble of trees F that predicts y such that F(x) = \hat{y}. The error is y - F(x) = e. We want to add a tree that predicts the error T(x) = \hat{e} = e + error = y - F(x) + error. Therefore F(x) + T(x) = y + error