GaLore EXPLAINED: Memory-Efficient LLM Training by Gradient Low-Rank Projection

AI Coffee Break with Letitia

Подписаться 49 тыс.

Просмотров 9 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Опубликовано:

28 сен 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 45

@GeoffY2020 4 месяца назад

I always thought you have a Ph.D. until you mentioned your defense, congrads for the 'high rank' decoration !

@AICoffeeBreak 4 месяца назад

Glad I could fulfill your prophecy 🥠. 😅

@MikeMm-n9n 4 месяца назад

Congratulations Frau Doktor.

@kaffii1238 4 месяца назад

Congratulations!

@MechanicumMinds 4 месяца назад

I never thought I'd say this, but I'm actually excited to learn about efficient training methods for deep learning models on consumer GPUs. Who knew running out of GPU memory could be so... enlightening? Thanks for explaining LoRA and Galore in a way that doesn't make my brain hurt (too much). Now, if you'll excuse me, I have some large language models to train or at least, try not to run out of GPU memory

@AICoffeeBreak 4 месяца назад

Cheers!

@HomunMage 3 месяца назад

I love this paper. The solution is elegant.

@anilrajshinde7062 4 месяца назад

Congratulations

@vardhan254 3 месяца назад

congrats leititia!!

@AICoffeeBreak 3 месяца назад

Thank you!

@JosephCatrambone 4 месяца назад

Congrats on the PhD! :D

@AICoffeeBreak 4 месяца назад

Thank you!

@Ben_D. 4 месяца назад

It is now official. You have a big brain and have papers to prove it.

@cosmic_reef_17 4 месяца назад

Great video!

@AICoffeeBreak 4 месяца назад

Thank you!

@azertyQ 4 месяца назад

Congrats on your PhD! Could you use 'r' as a hyperparameter during pretraining as well? e.g. start pretraining with low r and gradually increase it as more precision is needed? I don't think it could do that much since gains are already very high at the start.

@AICoffeeBreak 4 месяца назад

Sure. Great idea.

@bhavulgauri7832 4 месяца назад

May I ask, how do you edit these videos? Inspired to do some youtube videos myself and would love to follow some of the steps you've taken.

@AICoffeeBreak 4 месяца назад

My editor uses Adobe Premiere Pro for editing (that is where Ms. Coffee Bean is cut in ). For the visuals, I use Powerpoint. The "morph" transition is really useful.

@amitgalor 4 месяца назад

I always liked Galor(e), though I might be biased.

@vladimirtchuiev2218 3 месяца назад

One of the selling points of Lora is to be able to mix and match the A and B matrices for different fine-tuning runs, without having to keep the weights of the full model if they are available elsewhere. Here it seems you have to save the entire model, so this is a big tradeoff compared to Lora and derivatives.

@vladimirnadvornik8254 4 месяца назад

Do I understand it correctly that it can't work with quantization and the model must fit in memory in 16bit?

@timothywcrane 4 месяца назад

Can info shed during the lossy compression process be set aside in a non memory fashion for retrieval? Thinking out loud. Not always helpful. But state mapping would be interesting in the training process as well as post.

@IxCIHAoX 4 месяца назад

As far as i get it, i determine the gradients G and and also a low rank component P. That component allows me to „shrink“ the gradient matrix G i calculate at every step to down to R before applying it to W. So i do not save compute while calculating the gradients, but while applying and saving (as momentum or such) them?

@elinetshaaf75 4 месяца назад

underwhelmed?

@IxCIHAoX 3 месяца назад

@@elinetshaaf75 on the contrary, i am afraid that i don’t quite get it😅

@SU3D3 4 месяца назад

😘 LOL! 礼 Stay for the EPIC lipstick! M4D L0V3!

@DerPylz 4 месяца назад

@shahidjabbar5933 4 месяца назад

Congratulations

@AICoffeeBreak 4 месяца назад

Thank you!

@eruiluvatar236 4 месяца назад

I imagine that they would do the SVD on the CPU as it would take way too much ram to do it on the GPU or maybe they do it layer by layer and discard the intermediate buffers. Anyway it seems like a great idea. ReLoRA achieved something similar but it required an small amount of pre training and I expect slower convergency for ReLoRA because each time a new LoRA is initialized the weights need to be moved around a lot while using SVD they would be roughly where they need to be minimizing the error in the main matrix.

@AaronALAI 4 месяца назад

Holy frick what a perfectly concise video on galore! There is a GitHub implementation of the research from the paper, they are currently working on a multigpu implementation. I too am curious how well things scale up to modern and larger llms, and have a multigpu rig I want to test it out on.

@ArnaldurBjarnason 3 месяца назад

The size of a LoRA (on disk) is a fraction of the size of the model it's applied to. GaLore looses this benefit by updating the weights. If you have use for many fine tunes of a single model, two GaLore fine-tunes will take more space than ten LoRAs (depending on rank, to be fair). I assume they don't mention this very significant tradeoff, as you don't mention it in the video. That seems like a dishonest comparison, if that's the case.

@AICoffeeBreak 3 месяца назад

Fair. It's just that HDD / SSD storage is not considered a bottleneck, while the size of the GPUs surely is. The first is cheap and abundant (terabytes), the second one is very expensive and limited (tens of gigabytes).

@vinno97 3 месяца назад

@@AICoffeeBreak Minor nitpick: It also allows you to fit multiple LoRa's in GPU memory at the same time and thus run inference on different finetunes at the same time P.S. Congrats on the PhD!

@TheRyulord 4 месяца назад

One important thing to note is that while this technique is more memory efficient it's also 17% slower in the setup the authors use. That's a pretty big deal, especially for pretraining.

@tornyu 4 месяца назад

But since it's more memory efficient, could you run more training in parallel, in a distributed or federated way?

@TheRyulord 4 месяца назад

@@tornyu Sure but it's still 17% more compute and that means 17% more money. With LLMs costing millions to train from scratch 17% more money is a big ask for pretraining.

@tornyu 4 месяца назад

But federated training could enable larger open source models, which until now could only be fine-tuned?

@proterotype 3 месяца назад

I ran into trouble trying to Fine Tune a Swin model with LoRA. That type of model isn’t supported yet for LoRA. I wonder if it’ll be the same for GaLoRA

@koiRitwikHai 2 месяца назад

The authors say LoRA is about low rank weight updates, which is a bad idea since all weight updates are not always low rank. But low rank gradients are a better alternative. My question is that the only difference between weight update matrices and gradient matrices is the multiplication of learning rate i.e. weight update matrix = learning rate * gradients Isn't it? So how come weight update matrices are not always low rank, but gradient matrices are? PS: congratulations on your defense :)

@AICoffeeBreak 2 месяца назад

Thank you! The trick is not that gradient matrices ARE low rank, but that the training process *converges* with low rank gradient matrices too, and this is what the authors also prove theoretically. Think of it this way: You want to move in Manhattan from A to B, but you can only do low-rank updates, meaning that you can only move up-down, left-right (in a subspace of the low-rank gradient update), but not also diagonally. Eventually, you can get to any B, by moving once left and once up, instead of once diagonally.

@AICoffeeBreak 2 месяца назад

This is a great question and if you have a follow-up, let me know. It is not easy to explain in a comment without any drawing. :)

@koiRitwikHai 2 месяца назад

@@AICoffeeBreak thank you so much for replying :) big fan of your channel you said, "the training process converges with low rank gradient matrices" does that also mean that the training process converges with low rank weight updates?

@AICoffeeBreak 2 месяца назад

Great to see you're following up! No, it does not mean that the training convergence is guaranteed by low-rank *weight* updates. Convergence refers to finding the set of weights that minimize the loss. If the correct updates are restricted due to low-rank weights, the optimal weights for convergence might never be found. However, similar to how movement in different directions can collectively bring you to your desired location, low-rank *gradient* updates can still lead to convergence.