This is a great video. It addresses many of the topics I’ve come across while working on my tensor compiler, which I was inspired to work on from your awesome Minitorch course/project.
Thanks! Upvoted. Some of us need the ELI5 version. For example , I didn't hear an explanation of D. I know it is dimension, but saying a few words about it would help.
D is the number of "neurons" and corresponds to how powerful a standard neural network is. This website is a classic for getting intution: playground.tensorflow.org/ . The bigger the D is the more curvy the line.
Hi, thanks a lot for this amazing video, really appreciate it! Just had a simple question: at 11.57 for the compute of LoRA, why is the term we add for the fine-tuned network `2xBxDxR`? Isn't that just for the forward pass? I thought we would add this value multiplied by two for the backward pass as well, i.e. for a final expression of `BxDxD + 2xBxDxR + 2x2xBxDxR` Would appreciate your help :)
Yes that's correct, it's the compute for forward. Backward is a bit more complex. What you wrote is correct for the diagram. In larger multilayer networks though, it would be what you wrote plus at least an additional DxD. That's because you need to do (part of) the backward pass to get derivatives from the DxD part to earlier layers.
In the 17th minute, when you are calculating the attention computation complexity, the value aggregation part is missed. So there should be two NxTxBxTxD, one for the attention matrix and the other one for the value weighted sum.
Yeah, I should put a pointer to a more in depth formula for attention. Mainly wanted to give intuition, and not dive into the details of how that is calculated in practice. (i.e. things like checkpointing.)
qq: in the summary it was mentioned that in finetuning, we can ignore optimizer states. But we still compute optimizer states at 10:07 and 12:08. Can you help explain why is that?
@@srush_nlp Thank you! I understand that 12:08 only accounts for LoRA adapters. My question was about 23:16 when you say "for finetuning, we can remove the optimizer states entirely".