Тёмный

Street Fighting Transformers 

Sasha Rush 🤗
Подписаться 7 тыс.
Просмотров 6 тыс.
50% 1

Learn important back-of-the-envelope calculations for Transformers / LLM.
Useful primer for grad school, or maybe a job interview...

Опубликовано:

 

13 окт 2024

Поделиться:

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист
Посмотреть позже
Комментарии : 12   
@samlaki4051
@samlaki4051 2 месяца назад
absolute gem of a review, thank you professor
@rahul38474
@rahul38474 2 месяца назад
This is a great video. It addresses many of the topics I’ve come across while working on my tensor compiler, which I was inspired to work on from your awesome Minitorch course/project.
@DanielCardenas1
@DanielCardenas1 2 месяца назад
Thanks! Upvoted. Some of us need the ELI5 version. For example , I didn't hear an explanation of D. I know it is dimension, but saying a few words about it would help.
@srush_nlp
@srush_nlp 2 месяца назад
D is the number of "neurons" and corresponds to how powerful a standard neural network is. This website is a classic for getting intution: playground.tensorflow.org/ . The bigger the D is the more curvy the line.
@erfanasgari21
@erfanasgari21 2 месяца назад
Thanks for this great video. I was exactly struggling to calculate this!
@TheWayOfTheMango
@TheWayOfTheMango 2 месяца назад
Hi, thanks a lot for this amazing video, really appreciate it! Just had a simple question: at 11.57 for the compute of LoRA, why is the term we add for the fine-tuned network `2xBxDxR`? Isn't that just for the forward pass? I thought we would add this value multiplied by two for the backward pass as well, i.e. for a final expression of `BxDxD + 2xBxDxR + 2x2xBxDxR` Would appreciate your help :)
@srush_nlp
@srush_nlp 2 месяца назад
Yes that's correct, it's the compute for forward. Backward is a bit more complex. What you wrote is correct for the diagram. In larger multilayer networks though, it would be what you wrote plus at least an additional DxD. That's because you need to do (part of) the backward pass to get derivatives from the DxD part to earlier layers.
@wenhuchen4604
@wenhuchen4604 2 месяца назад
In the 17th minute, when you are calculating the attention computation complexity, the value aggregation part is missed. So there should be two NxTxBxTxD, one for the attention matrix and the other one for the value weighted sum.
@srush_nlp
@srush_nlp 2 месяца назад
Yeah, I should put a pointer to a more in depth formula for attention. Mainly wanted to give intuition, and not dive into the details of how that is calculated in practice. (i.e. things like checkpointing.)
@silversnow111
@silversnow111 25 дней назад
qq: in the summary it was mentioned that in finetuning, we can ignore optimizer states. But we still compute optimizer states at 10:07 and 12:08. Can you help explain why is that?
@srush_nlp
@srush_nlp 21 день назад
You only need optimizer state for the parameter you change. So for example in LoRA you don't need it for the model weights, just the adapters.
@silversnow111
@silversnow111 11 дней назад
@@srush_nlp Thank you! I understand that 12:08 only accounts for LoRA adapters. My question was about 23:16 when you say "for finetuning, we can remove the optimizer states entirely".
Далее
How to write an okay research paper.
32:04
Просмотров 6 тыс.
Why Does Diffusion Work Better than Auto-Regression?
20:18
РЫБКА С ПИВОМ
00:39
Просмотров 295 тыс.
小丑调戏黑天使的后果#short #angel #clown
00:16
А я с первого раза прошла (2024)
01:00
Large Language Models in Five Formulas
58:02
Просмотров 35 тыс.
MambaByte: Token-Free Language Modeling
16:26
Просмотров 6 тыс.
State Space Models (S4, S5, S6/Mamba) Explained
38:11
Просмотров 3,6 тыс.
Watching Neural Networks Learn
25:28
Просмотров 1,3 млн
AI can't cross this line and we don't know why.
24:07
The Most Important Algorithm in Machine Learning
40:08
Просмотров 465 тыс.
AI: Grappling with a New Kind of Intelligence
1:55:51
Просмотров 792 тыс.
Optimize Your AI Models
11:43
Просмотров 11 тыс.
How Deep Neural Networks Work - Full Course for Beginners
3:50:57
РЫБКА С ПИВОМ
00:39
Просмотров 295 тыс.