Street Fighting Transformers

Подписаться 7 тыс.

Просмотров 6 тыс.

50% 1

Learn important back-of-the-envelope calculations for Transformers / LLM.
Useful primer for grad school, or maybe a job interview...

Опубликовано:

13 окт 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 12

@samlaki4051 2 месяца назад

absolute gem of a review, thank you professor

@rahul38474 2 месяца назад

This is a great video. It addresses many of the topics I’ve come across while working on my tensor compiler, which I was inspired to work on from your awesome Minitorch course/project.

@DanielCardenas1 2 месяца назад

Thanks! Upvoted. Some of us need the ELI5 version. For example , I didn't hear an explanation of D. I know it is dimension, but saying a few words about it would help.

@srush_nlp 2 месяца назад

D is the number of "neurons" and corresponds to how powerful a standard neural network is. This website is a classic for getting intution: playground.tensorflow.org/ . The bigger the D is the more curvy the line.

@erfanasgari21 2 месяца назад

Thanks for this great video. I was exactly struggling to calculate this!

@TheWayOfTheMango 2 месяца назад

Hi, thanks a lot for this amazing video, really appreciate it! Just had a simple question: at 11.57 for the compute of LoRA, why is the term we add for the fine-tuned network `2xBxDxR`? Isn't that just for the forward pass? I thought we would add this value multiplied by two for the backward pass as well, i.e. for a final expression of `BxDxD + 2xBxDxR + 2x2xBxDxR` Would appreciate your help :)

@srush_nlp 2 месяца назад

Yes that's correct, it's the compute for forward. Backward is a bit more complex. What you wrote is correct for the diagram. In larger multilayer networks though, it would be what you wrote plus at least an additional DxD. That's because you need to do (part of) the backward pass to get derivatives from the DxD part to earlier layers.

@wenhuchen4604 2 месяца назад

In the 17th minute, when you are calculating the attention computation complexity, the value aggregation part is missed. So there should be two NxTxBxTxD, one for the attention matrix and the other one for the value weighted sum.

@srush_nlp 2 месяца назад

Yeah, I should put a pointer to a more in depth formula for attention. Mainly wanted to give intuition, and not dive into the details of how that is calculated in practice. (i.e. things like checkpointing.)

@silversnow111 25 дней назад

qq: in the summary it was mentioned that in finetuning, we can ignore optimizer states. But we still compute optimizer states at 10:07 and 12:08. Can you help explain why is that?

@srush_nlp 21 день назад

You only need optimizer state for the parameter you change. So for example in LoRA you don't need it for the model weights, just the adapters.

@silversnow111 11 дней назад

@@srush_nlp Thank you! I understand that 12:08 only accounts for LoRA adapters. My question was about 23:16 when you say "for finetuning, we can remove the optimizer states entirely".