If we knew what abstractions were handled layer by layer, we could make sure that the individual layers were trained to completely learn those abstractions. Let's hope Max Tegmark's work on introspection get us there.
Thanks for the simple and educating video! If I'm not mistaken, prefix tuning is pretty much the same as embedding vectors in diffusion models! How cool is that? 😀
As always amazing content! 😌 It's perfect to refresh knowledge & learn something new. I think interesting about LoRA is how strong it actually regularizes fine-tuning: Is it possible it overfit when using a very small matrix in LoRA? Can LoRA also harm optimization?
Still possible to overfit but more resistant to overfitting compared to a full finetune. All the work I've seen on LoRAs say that it's just as good as a full finetune in terms of task performance as long as your rank is high enough for the task. What's interesting is that the necessary rank is usually quite low (around 2) even for relatively big models (llama 7B) and reasonable complex tasks. At least that's all the case for language modelling. Might be different for other domains.
Why use weight matrixes to start with if you can use lora representation? Assuming you gain space, the only downside I can think of is the additional compute to get back the weight matrix. But that should be smaller then the gain of the speed up of backward propagation.
Thanks for this question. You do not actually start with the weight matrices, you learn A and B directly from which you reconstruct the delta W matrix. Sorry this was not clear enough in the video.
In LoRA, Wupdated = Wo + BA, where B and A are decomposed matrices with low ranks, so i wanted to ask you that what does the parameters of B and A represent like are they both the parameters of pre trained model, or both are the parameters of target dataset, or else one (B) represents pre-trained model parameters and the other (A) represents target dataset parameters, please answer as soon as possible
Great explanation, thanks for the video! I have a lingering question about LoRA: Is it necessary to approximate the low-rank matrices of the difference weights (the Delta W in the video). Or can we reduce the size of the original weight matrices? If I understood the video correctly, at the end of LoRA training, I have the full parameters of the roginal model + the difference weights (in reduced size). My question is why can't I learn low rank matrices for the original weights as well?
Hi, in principle you can, even though I would expect you could lose some model performance. The idea of finetuning with LoRA is that the small finetuning updates should have low rank. matrices. BUT there is work using LoRA for pretraining, called ReLoRA. Here is the paper 👉 arxiv.org/pdf/2307.05695.pdf There is also this discussion on Reddit going on: 👉 www.reddit.com/r/MachineLearning/comments/13upogz/d_lora_weight_merge_every_n_step_for_pretraining/
Absolutely awesome explanation. Would like to get your take on LoRA vs (IA)**3 as well. It seems that people still prefer LoRA over (IA)**3 even though the latter has a slightly higher performance?
Aren't we effectively using the same kind of trick when we train the transformer encoder / self-attention block? Assuming row vectors, we can use the form W_v⋅v.T⋅k⋅W_k.T⋅W_q⋅q.T. Ignoring the *application* of attention and focusing its calculation, we get the form k⋅W_k.T⋅W_q⋅q.T . Since W_k and W_q are projection matrices from embedding length to dimension D_k, we have the same sort of low rank decomposition where D_k corresponds to "r" in your video. Is that right?
Thank you sooooo much for this video. I started reading the paper, was very terrified by it, then I thought I should watch some RU-vid video, watch one video, was asleep half-way through the video. Woke up again and stumbled across your video, your coffee woke me up and now I got the LoRA. Thanks for your efforts.