PonderNet: Learning to Ponder (Machine Learning Research Paper Explained)

Yannic Kilcher

Подписаться 262 тыс.

Просмотров 22 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Опубликовано:

18 сен 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 57

@YannicKilcher 3 года назад

OUTLINE: 0:00 - Intro & Overview 2:30 - Problem Statement 8:00 - Probabilistic formulation of dynamic halting 14:40 - Training via unrolling 22:30 - Loss function and regularization of the halting distribution 27:35 - Experimental Results 37:10 - Sensitivity to hyperparameter choice 41:15 - Discussion, Conclusion, Broader Impact

@4knahs 3 года назад

Yasss! paper explained is back! :D

@freemind.d2714 3 года назад

About time...

@IoannisNousias 3 года назад

Thank you sir. An international treasure.

@redemptivedialectic6787 3 года назад

Thanks for the video explaining this article. I'm an auditory learner so it helps me understand things better.

@mgostIH 3 года назад

Thanks for reviewing this! I love papers that push for different approaches, I think another interesting field coming up is making more things differentiable like rendering (I am sure you saw that recent painting transformer paper) or optimization. A benchmark I wish they did for PonderNet was learning how to sum and do other operations on integers, since it seems to be something quite hard even for the largest transformers.

@WatchAndGame 3 года назад

Could you tell me what this "painting paper" is called? I am interested :)

@mgostIH 3 года назад

@@WatchAndGame Paint Transformer: Feed Forward Neural Painting with Stroke Prediction What they do is very similar to DETR (a paper Yannic reviewed), the architecture is quite simple but the core thing they need is a neural renderer, something that can take as input the strokes to draw and actually displays them on an image all while being differentiable in order to backpropagate to the rest of the architecture. This helps them in not needing to use Reinforcement Learning, which is usually much less stable.

@WatchAndGame 3 года назад

@@mgostIH Cool thanks!

@Supreme_Lobster 3 года назад

@@mgostIH is RL not differentiable. I'm quite new to ML and NNs and I'm entirely sure what "differentiable" means, other than "you can backpropagate"

@mgostIH 3 года назад

@@Supreme_Lobster The main issue of RL is that, while you can make *part of it* differentiable (Deep Q Learning, Policy Gradient), you usually don't have a differentiable model of the game state and no information over what causes a good reward (so you can't have backpropagate a loss over "Hey I want the end game screen to look like this"). Think for example in Chess: you get a reward only at the end of the game (win/lose) but you don't have information over what specific action was good and what was bad, this is called the "Score Assignment Problem" and a lot of algorithms try tackling this but it's still largely unsolved. This isn't to say that RL is impossible, but it's one of the areas where ML still struggles a lot, all methods we use are still very specific, are unstable (some runs may converge to a good game playing agent, some don't, out of pure chance) and require **TONS** of compute power for anything non trivial. Meanwhile if you check the paper of painting transformer, their differentiable renderer allowed them to just optimize everything based on the desired image loss; compared to other approaches that solve the same problem they trained it much faster and are able to run it faster too (check their benchmarks)

@lemurpotatoes7988 3 года назад

I believe that the recurrent structure is the reason that they're able to maintain stability despite attempting to solve two problems at once. My feeling is that the reason it's typically bad to solve two problems at once is that you will be inconsistent about credit assignment in ways that are determined by incidental noise. The incidental noise washes out, in a within-sample sense (as opposed to an across-sample one, which wouldn't be sufficient), due to the recurrent structure of the model. Learning how to do credit assignment correctly in the sense needed for the particular sample under situation is encouraged by the architecture. Across-sample washing out of incidental noise doesn't work because each sample has a different credit assignment problem associated with it. But for a given sample, at different time steps in the network's operation, the underlying credit assignment problem to be solved remains the same.

@sergiomanuel2206 3 года назад

Hello Yannic, you confused P with Lambda in the loss function. Pn=Ln * prod(1-Li). This is why the trivial solution is not making all lambdas equal to zero.

@kshitizmalhotra1394 3 года назад

He acknowledged that later

@nocomments_s 3 года назад

Amazing! So happy to see paper explained series back!

@dr.mikeybee 3 года назад

As always, you've made another fascinating video. Thank you. What I wonder is what kinds of models can be trained and used for inference using this architecture on small GPUs? Does this open up possibilities given resource constraints? Can I get GPT3-like performance on a K80 using PonderNet because my network isn't so deep? Or is this just a way to speed up inference? I suppose that with each pass through the model, the combinations of parameters multiply to a Cartesian product, but it's not intuitive to me how this works with a backward pass. After all, this doesn't seem to give new functionality over a feed forward model other than the ability to halt early. In other words, only the same kinds of things can be learned, but perhaps they can be learned more quickly.

@bdennyw1 3 года назад

Welcome back Yannic! I've missed your videos.

@Idiomatick 3 года назад

Nice! I normally take notes while watching these and often leave side notes to myself about stuff I didn't understand in order to look into further in the paper. But this time I paused and wrote a note that I was confused about the loss function because I don't get how they handle the risk of λ going to 0 and the 2 variable problem being unstable....... unpause and you say basically the exact same concerns. I feel like I actually must have understood an ML paper at first glance for once! It was very gratifying, haha. I think the regularization term does a lot of work in forcing the loss to push towards a sane output though. But that creates an assumption on calculations that might not follow in the real world. I mean, if I'm given a math problem, I don't gradually improve my understanding past some threshold, some math problems are instant, some I can't solve. At least at first glance, as I type this, I don't think that this algorithm will be as useful on types of problems that have highly variable amounts of computation needed but I'd probably have to implement to be certain.

@drdca8263 3 года назад

21:20 my impression is that the (?)regularization(?) or, err, the term they add to make it prefer to halt earlier if it can while still having good results, should somewhat counteract that? But maybe it wouldn’t be enough, I wouldn’t know Edit: nvm you were about to get to that part Oh good, I remembered that word “regularization” correctly.

@Mikey-lj2kq 3 года назад

i'm no expert but...seems like a dreamcoder punishing Kolmogorov complexity works better for parity, and the general idea of 'aligning model & task complexity?

@fiNitEarth 3 года назад

Omg a new papers explAIned video 😍 my brain is about to explode.

@priancho 3 года назад

So glad to watch your paper introduction video again :-)

@colinjacobs176 3 года назад

Love your work. Very clear explanation. Indeed an interesting innovation.

@srh80 3 года назад

Love such papers! So much better than 'all you need' hype

@denissergienko2001 3 года назад

Welcome Back!!!

@norik1616 3 года назад

What an interesting idea!

@patf9770 3 года назад

Consider doing a video on PerceiverIO, it's a major upgrade to vanilla Perceiver and I can easily see it's descendants taking over many areas

@herp_derpingson 2 года назад

I was kinda hoping for ablation for the KL divergence. Good stuff though.

@brll5733 3 года назад

I don't see how the training works with that added output of every timestep. By adding all possible outputs and their probabilties, you get an overall, statistical error but no feedback signal for individual outputs?

@nurkleblurker2482 3 года назад

Interesting. Good explanation

@borisyangel 3 года назад

I wonder if one can just use the expectation of the distribution induced by p_i as a regularizer. Such regularizer would not force a geometric shape on p_i, just ask it to make fewer steps. And the network would be able to model things like sudden changes in p_i more easily.

@Mikey-lj2kq 3 года назад

the recurrent part seems somewhat like GAN? the ACT is like ada boost while PonderNet is like boosting tree.

@choipetercsj7256 2 года назад

Hi, thanks for your video!. I plan to do a project on the complexity of tasks on image dataset like imagenet, cifar 100. If I use a vision transformer, then can I implement my project? and Is it meaningful?

@paxdriver 3 года назад

Maybe I'm just a noob and I'm missing something... But why not just train a feed forward network to do a halting mechanism on another simple CNN like a nn manager? Seems way simpler than integrating the halting procedure in a single network

@YannicKilcher 3 года назад

That's entirely possible in this framework. The step function can be two different NNs, or a combined one.

@bernardoramos9409 3 года назад

Yannic, please do a video on the new Fastformer

@ziquaftynny9285 3 года назад

41:00 "it is completely thinkable" lol I think the word you're looking for is plausible?

@ChaiTimeDataScience 3 года назад

It's Monday, folks!!!

@andres_pq 3 года назад

Hello Yannic! Can you teach us to matrix multiply without multiplying?

@Rizhiy13 3 года назад

22:18 Why can't you just add a small loss just for low probability, so that it tries to increase it?

@vishalmathur6545 3 года назад

Can you do a Tesla ai day review.

@siyn007 3 года назад

Did anyone catch how they normalized the probabilities (lambdas) across time?

@SirSpinach 3 года назад

There's a hyperparameter determining the minimum cumulative halt probability before ending network rollouts. I'm guessing that when calculating the expected loss, they normalize by the actual cumulative halt probability of the rollouts during training?

@konghong3885 3 года назад

Does the paper references universal transformers?

@mgostIH 3 года назад

Yes it does! In the bAbI, they compare them with transformers + pondernet and they seem to do better, but imo the big deal of the paper is that the architecture is very general and can be applied on anything you might think of

@aspergale9836 3 года назад

@@mgostIH So there isn't really an "architecture" in the sense of, say, Transformers vs LSTMs. The contribution is more: (1) The clearer formulation (?), and (2) The corrected term for the stopping probability. Yes?

@mgostIH 3 года назад

@@aspergale9836 Indeed, you can apply this method for pretty much any DL model you can think of, instead of putting more layers you use this procedure so that the network learns how deep it needs to be per each input. In this sense, it's similar to Deep Equilibrium Models, without the need to redefine backpropagation.

@swordwaker7749 3 года назад

QUICK YANNIC! THE TESLA AI DAY IS OUT!

@nocturnomedieval 3 года назад

No hurry. It can be stressful. Some are so eager that they do not love slow paced videos. But yeah, we would love you to present those Tesla snippets.