No video :(

iMAML: Meta-Learning with Implicit Gradients (Paper Explained)

Подписаться 259 тыс.

Просмотров 23 тыс.

50% 1

Gradient-based Meta-Learning requires full backpropagation through the inner optimization procedure, which is a computational nightmare. This paper is able to circumvent this and implicitly compute meta-gradients by the clever introduction of a quadratic regularizer.
OUTLINE:
0:00 - Intro
0:15 - What is Meta-Learning?
9:05 - MAML vs iMAML
16:35 - Problem Formulation
19:15 - Proximal Regularization
26:10 - Derivation of the Implicit Gradient
40:55 - Intuition why this works
43:20 - Full Algorithm
47:40 - Experiments
Paper: arxiv.org/abs/1909.04630
Blog Post: www.inference.vc/notes-on-ima...
Abstract:
A core capability of intelligent systems is the ability to quickly learn new tasks by drawing on prior experience. Gradient (or optimization) based meta-learning has recently emerged as an effective approach for few-shot learning. In this formulation, meta-parameters are learned in the outer loop, while task-specific models are learned in the inner-loop, by using only a small amount of data from the current task. A key challenge in scaling these approaches is the need to differentiate through the inner loop learning process, which can impose considerable computational and memory burdens. By drawing upon implicit differentiation, we develop the implicit MAML algorithm, which depends only on the solution to the inner level optimization and not the path taken by the inner loop optimizer. This effectively decouples the meta-gradient computation from the choice of inner loop optimizer. As a result, our approach is agnostic to the choice of inner loop optimizer and can gracefully handle many gradient steps without vanishing gradients or memory constraints. Theoretically, we prove that implicit MAML can compute accurate meta-gradients with a memory footprint that is, up to small constant factors, no more than that which is required to compute a single inner loop gradient and at no overall increase in the total computational cost. Experimentally, we show that these benefits of implicit MAML translate into empirical gains on few-shot image recognition benchmarks.
Authors: Aravind Rajeswaran, Chelsea Finn, Sham Kakade, Sergey Levine
Links:
RU-vid: / yannickilcher
Twitter: / ykilcher
BitChute: www.bitchute.com/channel/yann...
Minds: www.minds.com/ykilcher

Опубликовано:

5 авг 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 51

@waxwingvain 4 года назад

you have no idea how relevant this is for me now, I'm currently working on an NLP problem using maml, thanks!

@amitkumarsingh406 2 года назад

interesting. what is it about?

@nDrizza 4 года назад

Awesome explanation! I really like that you took enough time to explain the idea clearly instead of trying to shrink the explanation down to something of 30mins which might have not been understandable.

@aasimbaig01 4 года назад

I learn new things everyday from your videos !!

@JackSPk 4 года назад

Really enjoyed this one. Pretty good companion and intuitons for reading the paper (specially the "shacka da bomb" part).

@arkasaha4412 4 года назад

This is one of your best videos! :)

@SCIISano 3 года назад

Ty for explaining the Implicit Jaccobian. This was exactly what I was looking for.

@leondawn3593 3 года назад

very clearly explained! great thanks!

@anthonyrepetto3474 4 года назад

thank you for the details!

@zikunchen6303 4 года назад

daily uploads are amazing, i watch your videos instead of random memes now

@AshishMittal61 4 года назад

Great Video! Really helped with the intuition.

@herp_derpingson 4 года назад

34:50 Mind blown. Great paper. Keep it coming! 39:40 What happens if the matrix is not invertible? Do we just discard that and try again? 41:50 This is kinda like the N-bodies problem but with SGD instead of gravity.

@YannicKilcher 4 года назад

I don't think that matrix is ever non-invertible in practice, because of the identity add. But if so, just take a pseudo inverse or something.

@tianyuez 4 года назад

Great video!

@marouanemaachou7875 4 года назад

Keep the good job !!

@ekjotnanda6832 3 года назад

Really good explanation 👍🏻

@JTMoustache 3 года назад

I've missed this one before, this juste highlights how useful it is to really master (convex) optimization when you want to be original in ML. Too bad I did not go to nerd school.

@ernestkirstein6233 4 года назад

The last step that he wasn't explicit about at 39:13 was that dphi/dtheta + 1/lambda * hessian * dphi/dtheta = Ident so (Ident + 1/lambda * hessian ) dphi/dtheta = Ident so Ident + 1/lambda * hessian is the inverse of dphi/dtheta.

@ernestkirstein6233 4 года назад

Another great video Yannic!

@YIsTheEarthRound 4 года назад

I'm new to MAML so maybe this is a naive question but I'm not sure I understand the motivation for MAML (versus standard multi-task learning). Why is it a good idea? More specifically, it seems that MAML is doing a multi-scale optimisation (one at the level of training data with \phi and one at the level of validation data with \theta), but why does this help with generalisation? Is there any intuition/theoretical work?

@YannicKilcher 4 года назад

The generalization would be across tasks. I.e. if a new (but similar) task comes along, you have good initial starting weights for fine-tuning that task.

@YIsTheEarthRound 4 года назад

@@YannicKilcher But why does it do better than 'standard' multi-task ML in which you keep the task-agnostic part of the network (from training these other tasks) and retrain the task-specific part for the new task? It seems like there's 2 parts to why MAML does so well -- (1) having learned representations from previous tasks (which the standard multi-task setting also leverages), and (2) using a validation set to learn this task-agnostic part. I was just wondering what role the second played and whether there was some intuition for why it makes sense.

@user-xy7tg7xc1d 4 года назад

Sanket Shah You can check out the new meta learning course by Chelsea Finn ru-vid.com/group/PLoROMvodv4rMC6zfYmnD7UG3LVvwaITY5

@S0ULTrinker 2 года назад

How do you backpropagate gradient through previous gradient steps, when you need multiple forward passes to get Theta for each of the K steps? 13:11

@alexanderchebykin6448 4 года назад

You've mentioned that first-order MAML doesn't work well - AFAIK that's not true: in the original MAML paper they achieve same (or better) results with it in comparison to the normal MAML (see Table 1, bottom). This also holds for all the independent reproductions on github (or at least the ones I looked at)

@shijizhou5334 4 года назад

Thanks for correcting that, I was also confused about this question.

@jonathanballoch 3 года назад

if anything the plots show that FOMMAML *does* work well, but much slower

@brojo9152 3 года назад

Which software you used to write things along with the paper?

@tusharprakash6235 2 месяца назад

In inner loop, for more than one step the gradients should be computed wrt initial parameters right?

@nbrpwng 4 года назад

Nice video, it reminds me of the e-maml paper I think you reviewed some time ago. Have you by chance considered making something like a channel discord server? Maybe it would be a nice thing for viewers to discuss papers or other topics in ML, although these comments sections are good too from what I’ve seen.

@YannicKilcher 4 года назад

Yes my worry is that there's not enough people to sustain that sort of thing.

@nbrpwng 4 года назад

Yannic Kilcher I’m not entirely sure about how many others would join, but I think maybe enough to keep it fairly active, at least enough to be a nice place to talk about papers or whatever sometimes. I’m in a few servers with just a few dozen active members and that seems to be enough for good daily interaction.

@andreasv9472 4 года назад

Hi, interesting video! what is this parameter theta? is it the weights of the neural nets? or how many neurons there are? or is it something like learning rate, step-size, or something like that?

@YannicKilcher 4 года назад

yes, theta are the weights of the neural nets in this case

@arindamsikdar5961 3 года назад

At 36:15 sec of your video derive the whole equation (both sides) w.r.t \phi and not \theta to get the equation 6 in the paper

@hiyamghannam1939 3 года назад

Hello Thank you so much!! Have you explained the original MAML paper ??

@YannicKilcher 3 года назад

Not yet, unfortunately

@freemind.d2714 3 года назад

Regularization is like turning the Maximum Likelihood Estimation (MLE) to Maximum A Posteriori (MAP)

@go00o87 4 года назад

hm... isn't grad_phi(phi)=dim(phi)? provided phi is a multidimensional vector, it shouldn't be 1. Granted it doesn't matter as it just rescales Lambda and that parameter is arbitrary anyways.

@herp_derpingson 4 года назад

I think you are confusing grad with hessian. grad operaton on a tensor doesnt change its dimensions. For example, if we take phi = [f(x) = x] then grad_x [x] which is equal to [1] or the identity matrix