Deep Networks Are Kernel Machines (Paper Explained)

Подписаться 258 тыс.

Просмотров 59 тыс.

50% 1

#deeplearning #kernels #neuralnetworks
Full Title: Every Model Learned by Gradient Descent Is Approximately a Kernel Machine
Deep Neural Networks are often said to discover useful representations of the data. However, this paper challenges this prevailing view and suggest that rather than representing the data, deep neural networks store superpositions of the training data in their weights and act as kernel machines at inference time. This is a theoretical paper with a main theorem and an understandable proof and the result leads to many interesting implications for the field.
OUTLINE:
0:00 - Intro & Outline
4:50 - What is a Kernel Machine?
10:25 - Kernel Machines vs Gradient Descent
12:40 - Tangent Kernels
22:45 - Path Kernels
25:00 - Main Theorem
28:50 - Proof of the Main Theorem
39:10 - Implications & My Comments
Paper: arxiv.org/abs/2012.00152
Street Talk about Kernels: • Kernels!
ERRATA: I simplify a bit too much when I pit kernel methods against gradient descent. Of course, you can even learn kernel machines using GD, they're not mutually exclusive. And it's also not true that you "don't need a model" in kernel machines, as it usually still contains learned parameters.
Abstract:
Deep learning's successes are often attributed to its ability to automatically discover new representations of the data, rather than relying on handcrafted features like other learning methods. We show, however, that deep networks learned by the standard gradient descent algorithm are in fact mathematically approximately equivalent to kernel machines, a learning method that simply memorizes the data and uses it directly for prediction via a similarity function (the kernel). This greatly enhances the interpretability of deep network weights, by elucidating that they are effectively a superposition of the training examples. The network architecture incorporates knowledge of the target function into the kernel. This improved understanding should lead to better learning algorithms.
Authors: Pedro Domingos
Links:
TabNine Code Completion (Referral): bit.ly/tabnine-yannick
RU-vid: / yannickilcher
Twitter: / ykilcher
Discord: / discord
BitChute: www.bitchute.com/channel/yann...
Minds: www.minds.com/ykilcher
Parler: parler.com/profile/YannicKilcher
LinkedIn: / yannic-kilcher-488534136
BiliBili: space.bilibili.com/1824646584
If you want to support me, the best thing to do is to share out the content :)
If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: www.subscribestar.com/yannick...
Patreon: / yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Наука

Опубликовано:

16 июл 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 172

@YannicKilcher 3 года назад

ERRATA: I simplify a bit too much when I pit kernel methods against gradient descent. Of course, you can even learn kernel machines using GD, they're not mutually exclusive. And it's also not true that you "don't need a model" in kernel machines, as it usually still contains learned parameters.

@chriszhou4283 3 года назад

Another arXiv forever.

@florianhonicke5448 3 года назад

Thank you so much for all of your videos. I just found some time to finish the last one and here is the next video in the pipeline. The impact you have on the ai community is immense! Just think about how many people were starting in this field just because of your videos. Not even talking about multiplication effects by educating theire friends.

@YannicKilcher 3 года назад

Thanks a lot :)

@23kl104 3 года назад

yes, would fully appreciate more theoretical papers. Keep up the videos man, they are gold

@ish9862 3 года назад

Your way of explaining these difficult concepts in a simple manner is amazing. Thank you so much for your content.

@tpflowspecialist 3 года назад

Amazing generalization of the concept of a kernel in learning algorithms to neural networks! Thanks for breaking it down for us.

@andreassyren329 3 года назад

I think this paper is wonderful in terms of explaining the Tangent Kernel, and I'm delighted to see them showing that there is a kernel _for the complete model_, such that the model can be interpreted as a kernel machine with some kernel (the path kernel). It ties the whole Neural Tangent Kernel stuff together rather neatly. I particularly liked your explanation of training in relationship to the Tangent Kernel, Yannic. Nice 👍. I do think their conclusion, that this suggests that ANNs don't learn by feature discovery, is not supported enough. What I'm seeing here is that, while the path kernel _given_ the trajectory can describe the full model as a kernel machine, the trajectory it took to get it _depends on the evolution_ of the Tangent Kernel. So the Tangent Kernel changing along the trajectory essentially captures the idea of ANNs learning features, that they then use to train in future steps. The outcome of K_t+1 depends on K_t, which represents some similarity between data points. But the outcomes of the similarities given by K_t were informed by K_t-1. To me that looks a lot like learning features that drive future learning. With a kind of _prior_ imposed by the architecture, through the initial Tangent Kernel K_0. In short. Feature discovery may not be necessary to _represent_ a trained neural network. But it might very well be needed to _find_ that representation (or find the trajectory that got you there). In line with the fact that representability != learnability.

@nasrulislam1968 3 года назад

Oh man! You did a great service to all of us! Thank you! Hope to see more coming!

@Mws-uu6kc 3 года назад

I love how simple you explained such a complicated paper. Thanks

@al8-.W 3 года назад

Thanks for the great content, delivered on a regular basis. I am just getting started as a machine learning engineer and I usually find your curated papers interesting. I am indeed interested in the more theoretical papers like this one so you are welcome to share. It would be a shame if the greatest mysteries of deep learning remained concealed just because the fundamental papers are not shared enough !

@dermitdembrot3091 3 года назад

Agree, good that Yannic isn't too shy to look into theory

@dalia.rodriguez 3 года назад

"A different way of looking at a problem can give rise to new and better algorithms because we understand the problem better" ❤

@andrewm4894 3 года назад

Love this, Yannic does all the heavy lifting for me, but I still learn stuff. Great channel.

@minhlong1920 3 года назад

I'm working on NTK and I came across this video. Truly amazing explaination, it really clears things up for me!

@wojciechkulma7748 3 года назад

Great overview, many thanks!

@IoannisNousias 3 года назад

Thank you for your service sir!

@user-xs9ey2rd5h 3 года назад

I really liked this paper, puts neural networks in a completely different perspective.

@111dimka111 3 года назад

Thanks Yannic again for very interesting review. I'll give here also my 5 cents on this paper. Will start with some critization. The specific point of the paper's proof is to divide and multiply by path kernel (almost end of the proof). This makes coefficients a_i to be a function of input, a_i(x), which as noted by remark 1 is very different from a typical kernel formulation. This difference is not something minor and I'll explain why. When you say that some model is a kernel machine and that it belongs to some corresponding RKHS defined via kernel k(x_1, x_2), we can start explore that RKHS and see what are its properties (mainly its eigen-decomposition) and from them to deduce various model behaviours (its expressiveness and tendency for overfitting). Yet, the above step of division/multiplication allows us to express NN as kernel machine of any kernel. Take some other irrelevant kernel (not the path kernel) and use it similarly - you will obtain the result that now NN is a kernel machine of this irrelevant kernel. Hence, if we allow a_i to be x-dependent then we can tell that any sum of train terms is a kernel machine of arbitrary kernel. Not a very strong statement, in my opinion. Now with good parts - the paper's idea is very clear and simple, propagating overall research domain more into the right direction of understanding theory behind DL. Also, the form that a_i obtained (derivative weighted by the kernel and then normalized by the same kernel) may provide some benefits in future works (not sure). But mainly, as someone that worked on these ideas a lot during my PhD I think papers like this one, that explain DL via tangent/path kernels and their evolution during the learning process, will eventually give us the entire picture of why and how NNs perform so well. Please review more papers like this :)

@kaikyscot6968 3 года назад

Thank you so much for your efforts

@YannicKilcher 3 года назад

It's my pleasure

@NethraSambamoorthi 3 года назад

@yannic kilcher - Your brilliance and simplicity is amazing.

@morkovija 3 года назад

"I hope you can see the connection.." - bold of you to hope for that

@Sal-imm 3 года назад

Very good, pretty much straight forward linear.deduction.

@mlearnxyz 3 года назад

Great news. We are back to learning Gram matrices.

@kazz811 3 года назад

Unlikely that this perspective is used for anything.

@sinaasadiyan 10 месяцев назад

Great Explanation👍

@LouisChiaki 3 года назад

Very excited to see some real math and machine learning theory here!

@shawkielamir9935 2 года назад

Thanks a lot, this is a great explanation and I find it very useful. Great job !

@OmanshuThapliyal 3 года назад

Very well explained. The paper itself is written very well that I could read as a researcher outside of CS.

@pranitak 3 года назад

Hello. 👋😂

@OmanshuThapliyal 3 года назад

@@pranitak 😅

@JTMoustache 3 года назад

Ooh baby ! That was a good one ☝🏼

@master1588 3 года назад

This follows the author's hypothesis in "The Master Algorithm" that all machine learning algorithms (e.g. NN, Bayes, SVM, rule-based, genetic, ...) approximate a deeper, hidden algo. A Grand Unified Algorithm for Machine Learning.

@master1588 3 года назад

For example: lamp.cse.fau.edu/~lkoester2015/Master-Algorithm/

@herp_derpingson 3 года назад

@@master1588 The plot thickens :O

@adamrak7560 3 года назад

Or they are similar in a way because they are all universal. Similar to the Universal Turing Machines. They can each simulate each other. The the underlying algorithms may be the original proof that NNs are universal approximators.

@michelealessandrobucci614 3 года назад

Check this paper: Input similarity from the neural network perspective. It's exactly the same idea (but older)

@damienhenaux8359 3 года назад

I like very much this kind of videos on mathematical papers. I would like very much a video like this one on Stéphane Mallat paper : Group invariant scattering (2012). And thank you very much for everything

@abdessamad31649 3 года назад

i love your content from Morroco !!!!!! keep it going

@pranavsreedhar1402 3 года назад

Thank you!

@scottmiller2591 3 года назад

I think what the paper is saying is "neural networks are equivalent to kernel machines, if you confine yourself to using the path kernel." No connection to Mercer or RKHS, so even the theoretical applicability is only to the path kernel - no other kernels allowed, unless they prove that path kernels are universal kernels, which sounds complicated. I'm also not sanguine about their statement about breaking the bottleneck of kernel machines - I'd like to see a big O for their inference method and compare it to modern low O kernel machines. Big picture, however, I think this agrees with what most kernel carpenters have always felt intuitively.

@twobob 2 года назад

Thanks. That was helpful

@arthdh5222 3 года назад

Hey Great video, what do you use for annotating on the pdf, also which software do you use for it? Thanks!

@YannicKilcher 3 года назад

OneNote

@marvlnurban5102 2 года назад

The paper reminds me of a paper by Maria Schuld comparing quantum ML models with kernels. Instead of dubbing quantum ML models as quantum neural networks, she demonstrates that quantum models are mathematically closer to kernels. Her argument is that the dot product of the hilbert space in which you embed your (quantum) data implies the construction of a kernel method. As far as I understand the method you use to encode your classical bits into your qbits is effectively your kernel function. Now it seems like kernels connect deep neural networks to "quantum models" by encoding the superposition of the training data points..? - 2021 Schuld Quantum machine learning models are kernel methods - 2020 Schuld Quantum embedding for Machine learning

@paulcarra8275 3 года назад

About your comment in the video about the fact that hte theorem applies only for the full GD case, in fact it can be extended to SGD aswell, you only need to add an indicator (in the sum of graditents over the trianing data) at each step to spot the points that are sampled at this step (this is explained by the author in the video below). Regards

@mrpocock 3 года назад

I can't help thinking of attention mechanisms as neural networks that rate timepoints as support vectors, with enforced sparsity through the unity constraint.

@syedhasany1809 3 года назад

One day I will understand this comment and what a glorious day it will be!

@kimchi_taco 3 года назад

* NeuralNet with gradient descent is special version of kernel machine, which is sum_i(). * It means NeuralNet works well like SVM works well. NeuralNet is even better because it doesn't need to compute kernel (O(data*data)) explicitly. * is similarity score between new prediction y of x and training prediction yi of xi. * The math is cool. I feel this derivation is useful later.

@MrDREAMSTRING 3 года назад

So basically an NN trained with gradient descent is equivalent to a function that computes the kernel operations across all the training data (and across the entire training path!), and obviously NN runs so much more efficiently. That's pretty good; and very interesting insight!

@herp_derpingson 3 года назад

24:30 I wonder if this path tracing thingy works not only for neural networks but also for t-SNE. Imagine a bunch of points in the beginning of t-SNE. We have labels for all points except for one. During the t-SNE optimization, all points move. The class of the unknown point is equal to the class of the point to which its average distance was the least during the optimization process. . 41:57 I think it means we can retire the Kernel Machines because Deep Networks are already doing that. . No broader impact statement? ;) . Regardless, perhaps one day we can have a proof like. "Since kernel machines cannot solve this problem and neural networks are kernel machines, it implies that there cannot exist any deep neural network that can solve this problem". Which might be useful.

@YannicKilcher 3 года назад

very nice ideas. Yes, I believe the statement is valid for anything trained with gradient descent, and probably with a bit of modifications you could even extend that to any sort of ODE-driven optimization algorithm.

@hoaxuan7074 3 года назад

The dot product is an associative memory if you meet certain mathematical requirements it has especially relating to the variance equation for linear combinations of random variable. The more things it learns the greater the angles between the input vectors and the weight vector.If it only learns 1 association the angle should actually be zero and the dot product will provide strong error correction.

@fmdj 2 года назад

Damn that was inspiring. I almost got the full demonstration :)

@rnoro 2 года назад

I think this paper is a translation of NN formalism to a functional analysis formalism. Simply speaking, a gradient-descent on a loss function framework is equivalent to a Linear matching problem on a Hilbert space determined by the NN structure. The linearization process is characterized by the gradient scheme. In other words, the loss function on the sample space becomes a linear functional on the Hilbert space. This is all it's about the paper, but nothing more.

@ashishvishwakarma5362 3 года назад

Thanks for the explanation. Can you please , also attach the annotated paper link in the description of every video, it would be great help ?

@YannicKilcher 3 года назад

The paper itself is already linked. If you want the annotations that I draw, you'll have to become a supporter on Patreon or SubscribeStar :)

@jinlaizhang312 3 года назад

Can you explain the AAAI best paper 'Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting' ?

@YannicKilcher 3 года назад

thanks for the reference

@anubhabghosh2028 3 года назад

At around 19 min mark of the video, the way you describe the similarity between kernels and gradient descent leads me to believe that what this paper is claiming is that neural networks don't really "generalize" on the test data but rather compares similarities with samples it has already seen in the training data. This can be perhaps a very bold claim by the authors or probably I am misunderstanding the problem. What do you think? P.S. Really appreciate this kind of slightly theoretical paper review on deep learning in addition to your other content as well.

@YannicKilcher 3 года назад

Isn't that exactly what generalization is? You compare a new sample to things you've seen during training?

@anubhabghosh2028 3 года назад

@@YannicKilcher Yes intuitively I think so too. Like in case of neural networks, we train a network, learn some weights and then use the trained weights to get predictions for unseen data. I think my confusion is because about the explicit way they define this in case of these kernels, where they compare the sensitivity of the new data vector with that of every single data point in the training set 😅.

@Sal-imm 3 года назад

Now changing the weights in hypothetical sense means impaction reaction.

@faidon-stelioskoutsourelak464 2 года назад

1) Despite the title, the paper never makes use in any derivation of the particulars of an NN and its functional form. Hence the result is not just applicable to NNs but to any differentiable model e.g. a linear regression. 2) What is most puzzling to me is the path dependence. I.e. if you run your loss-gradient-descent twice from two different starting points which nevertheless converge to the same optimum, the path kernels i.e. the impact of each training data-point on the predictions, would be (in general) different. The integrand though in the expression of path kernels, involves dot products of gradients. I suspect that in the initial phases of training, i.e. when the model has not yet fit to the data, these gradients would change quite rapidly (and quite randomly) and most probably these dot products would be zero or cancel out. Probably only close to the optimum will these gradients and the dot-products stabilize and contribute the most to the path integral. This behavior should be even more pronounced in stochastic gradient descent (intuitively). The higher the dimension of the unknwon parameters, the more probable it'd be that these dot products are zero, even close to the optimum unless there some actual underlying structure that it is discoverable by the model.

@moudar981 3 года назад

thank you for the very nice explanation. What I did not get is the dL/dy. So, L could be 0.5 (y_i - y_i*)^2. Its derivative is (y_i - y_i*). Does that mean that if a training sample x_i is correctly classified (the aforementioned term is zero), then it has no contribution to the formula? Isn't that counter intuitive? Thank you so much.

@YannicKilcher 3 года назад

a bit. but keep in mind that you integrate out these dL/dy over training. so even if it's correct at the end, it will pull the new sample (if it's similar) into the direction of the same label.

@WellPotential 3 года назад

Great explanation of this paper! Any thoughts why they didn't include dropout? Seems like if they added a dropout term in the dynamical equation that the result wouldn't reduce down to a kernel machine anymore.

@YannicKilcher 3 года назад

maybe to keep it really simple without stochasticity

@Jack-sy6di 3 дня назад

Unfortunately this doesn't actually show that gradient descent is equivalent to kernel learning, or anything like that. That is, if we think of gradient descent as a mapping from training sets onto hypothesis functions, this paper doesn't show that this mapping is the same as some other mapping called "kernel learning". It's not the case that there is some fixed kernel (which depends on the NN's architecture and on the loss function), and running GD with that NN is equivalent to running kernel learning with *that* kernel. Instead, the paper just shows that after training, we can retroactively construct a kernel that matches gradient descent's final hypothesis. This almost seems trivial to me. If you take *any* function, you could probably construct some kind of kernel so that that function would belong to the RKHS. To me the interest of a result like "deep networks are kernel machines" would be to reduce gradient learning to kernel learning, showing that the former (which is mysterious and hard to reason about) is equivalent to the latter (which is simple and easy to reason about, because of how it just finds a minimum-norm fit for the data). This paper definitely does not do that. Instead it shows that GD is equivalent to kernel learning *provided* we run a different algorithm first, one which finds a good kernel (the path kernel) based on the training data. But that "kernel finding" module seems just as hard to reason about as gradient descent itself, so really all we've done is push back the question of how GD works to the question of how that process works.

@LouisChiaki 3 года назад

Why the gradient vector of y w.r.t. w is not the tangent vector of the training history of w in the plot? Shouldn't the update of the w by gradient descent always proportional to the gradient vector?

@YannicKilcher 3 года назад

true, but the two data points predicted are not the only ones. the curve follows the average gradient

@dhanushka5 8 месяцев назад

Thanks

@woowooNeedsFaith 3 года назад

3:10 - How about giving a link to this conversation in the description box?

@Pheenoh 3 года назад

ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-y_RjsDHl5Y4.html

@lirothen 3 года назад

I work with the Linux kernel, so I too was VERY lost when he's referring to a kernel function in this different context. I was just about to ask too.

@YannicKilcher 3 года назад

I've added it to the description. Thanks for the suggestion.

@dawidlaszuk 3 года назад

It isn't surprising that all function approximators, including NN and kernal methods, are equivalent since they all... can approximate functions. However, nice thing here is showing explicitly the connection between kernel methods and NN which allows easier knowledge transfer between methods' domains.

@joelwillis2043 3 года назад

All numbers are equivalent since they all... are numbers.

@Guztav1337 3 года назад

@@joelwillis2043 No. 4 ⋦ 5

@joelwillis2043 3 года назад

@@Guztav1337 Now if only we had a concept of an equivalence relation we could formally use it on other objects instead of saying "equivalent" without much thought.

@andres_pq 3 года назад

Is there any code demonstration?

@socratic-programmer 3 года назад

Makes you wonder the effect of something like skip connections or different network topologies has on this interpretation, or even something like the Transformer with the attention layer. Maybe that attention allows the network to more easily match against similar things and rapidly delegate tokens to functions that have already learnt to 'solve' that type of token?

@YannicKilcher 3 года назад

entirely possible

@kyuucampanello8446 2 года назад

Dot similarity with softmax is kind of 'equivalent' with distance. So I guess it's kind of similar like when we calculate the velocity gradient of a particle with sph method by using kernel function onto the distances between neighbouring particles and multiplying with their velocities to build a velocity-function corresponding to the distances. In attention machinism, it might be a function of the tokens' values corresponding to similarities

@joshuaclancy8093 3 года назад

So representations are constructed via aggregating input data... gasp!* but still its an interesting way of getting there. Correct me if I am wrong, but here is my overly simplified summary: Inputs that have similar output change with respect to weight change are grouped together as training progresses and so that means we can approximate neural networks with kernel machines.

@G12GilbertProduction 3 года назад

12:57 Hypersurface with a extensive tensor line? That's so looks like Fresnelian.

@YannicKilcher 3 года назад

sorry that's too high for me :D

@veloenoir1507 3 года назад

If you can make a connection with the Kernel Path and a resource-efficient, general architecture this could be quite meaningful, no?

@hecao634 3 года назад

Hey Yannic, could you plz gently zoom in or zoom out in the following videos? I really felt dizzy sometimes especially when you derive formulas

@YannicKilcher 3 года назад

sorry, will try

@Sal-imm 3 года назад

Mathematically limit definition of a function (for e.g.) and comes out of a new conclusion, that might be heuristic.

@JI77469 3 года назад

At 39:30 "... The ai's and b depend on x..." So how do we know that our y actually lives in the RKHS, since it's not a linear combination of kernel functions!? If we don't, then don't you lose the entire theory of RKHS?

@YannicKilcher 3 года назад

true, I guess that's for the theoreticians to figure out :D

@drdca8263 3 года назад

I didn't know what RKHS stood for. For any other readers of this comment section who also don't : It stands for Reproducing kernel Hilbert space. Also, thanks, I didn't know about these and it seems interesting.

@mathematicalninja2756 3 года назад

Next paper: Extracting MNIST data from its trained model

@Daniel-ih4zh 3 года назад

Link?

@salimmiloudi4472 3 года назад

Isn't it what visualizing activation maps does

@vaseline.555 3 года назад

Possibly Inversion attack? Deep leakage from gradients?

@DamianReloaded 3 года назад

Learning the most general features and learning to generalize well ought to be the same thing.

@albertwang5974 3 года назад

A paper: one plus one is a merge of one and one!

@guillaumewenzek4210 3 года назад

I'm not found of the conclusion. The NN at inference doesn't have access to all the historical weights, and runs very differently from their kernel. For me 'NN is a Kernel' would implies that K only depends on the final weights. OTOH I've no issue if a_i is computed from all historical weights.

@YannicKilcher 3 года назад

You're correct, of course. This is not practical, but merely a theoretical connection.

@guillaumewenzek4210 3 года назад

Stupid metaphor: "Oil is a dinosaur". Yes there is a process that converts dinosaur into oil, yet they have very different properties. Can you transfer the properties/intuitions of a Gaussian kernel to this path kernel?

@YannicKilcher 3 года назад

Sure, both are a way to measure distances between data points

@veedrac 3 года назад

Are you aware of, or going to cover, Feature Learning in Infinite-Width Neural Networks?

@YannicKilcher 3 года назад

I'm aware of it, but I'm not an expert. Let's see if there's anything interesting there.

@THEMithrandir09 3 года назад

I get that the new formalism is nice for future work, but isn't it intuitive that 'trainedmodel = initialmodel + gradients x learningrates'?

@YannicKilcher 3 года назад

true, but making the formal connection is sometimes pretty hard

@THEMithrandir09 3 года назад

@@YannicKilcher oh yes sure, this work is awesome, it reminds me of the REINFORCE paper. I just wondered why that intuition wasn't brought up explicitly. Maybe I missed it though, I didn't fully read the paper yet. Great video btw!

@hoaxuan7074 3 года назад

With really small nets you can hope to more or less fully explore the solution space say using evoutionary algorithms. There are many examples on YT. In small animals with a few hundred neuron you do see many specialized neurons with specific functions. In larger nets I don't think there is any training algorithm that can actually search the space of solutions to find any good solution. Just not possible ----- except there in a small sub-space of statistical solutions where each neuron responds to the general statistics of the neurons in the prior layer. Each neuron being a filter of sorts. I'm not sure why I feel that sub-space is easier to search through? An advantage would be good generalization and avoiding many brittle over-fitted solutions that presumably exist in the full solution space. A disadvantage would be the failure to find short compact logical solutions that generalize well, should they exist.

@paulcurry8383 3 года назад

I don’t understand the claim that models don’t learn “new representations”. Do they mean that the model must use features of the training data (which I think is trivially true), or that the models store the training data without using any “features” and just the result of the points on the loss over GD? In the latter it seems that models can be seen as doing this, but it’s not well understood how they actually store a potentially infinitely sized Gram Matrix. I’m also tangentially interested in how SGD fits into this.

@YannicKilcher 3 года назад

yea I also don't agree with the paper's claim in this point. I think it's just a dual view. i.e. extracting useful representations is the same as storing an appropriate superposition of the training data.

@fast_harmonic_psychedelic 3 года назад

The same thing can be said about the brain itself, neurons just store a superposition of the training data (the input from the senses when we were infants, weighed against the evolutionary "weights" stored in the DNA, and in every day experience whenever we see any object or motion, our neurons immediately compare that input to the weights that it stores in a complex language of dozens of neutrotransmitters and calcium ion and tries to find out what it best matches up with.. The brain is a kernel machine. That doesn't depreciate its power, neither the brain or the neural networks. they memorize.. that doesn't mean they're not intelligence. Intelligence is not magic, it IS essentially just memorizing input.

@kazz811 3 года назад

Nice review. Probably not a useful perspective though. SGD is critical obviously (ignoring variants like momentum, Adam which incorporate path history) but you could potentially extend this using the path integral formulation (popularized in Quantum mechanics though applies in many other places) by constructing it as an ensemble over paths for each mini-batch procedure, the loss function replacing the Lagrangian in Physics. The math won't be easy and it likely needs someone with higher level of skill than Pedro to figure that out.

@diegofcm6201 3 года назад

Thought the exact same thing. Even more like it when it states about “superposition of train data weighted by kernel path”. Reminds me a lot about wave functions

@diegofcm6201 3 года назад

It also looks like something from calculus of variations: 2 points (w0 and wf) connected by a curve that’s trying to optimize something

@willwombell3045 3 года назад

Wow someone found a way to rewrite "Neural Networks are Universal Function Approximators" again.

@neoli8289 3 года назад

Exactly!!!

@gamerx1133 3 года назад

@chris k Yes

@Lee-vs5ez 3 года назад

😂

@olivierphilip1612 3 года назад

@chris k Any continuous function to be precise

@drdca8263 3 года назад

@chris k tl;dr: If by "all functions" you mean "all L^p functions" (or "all locally L^p functions"?) for some p in [1,infinity), then yes. (But technically, this isn't *all* functions from (the domain) to the real numbers, for which the question seems not fully defined, because in that case, what do we mean by "approximated by"?) needlessly long version: I was going to say "no, because what about the indicator function for a set which isn't measurable", thinking we would use the supremum norm for that (actual supremum, not supremum-except-neglecting-measure-zero-zets), in order to make talking about the convergence well-defined, but then I realized/remembered that under that criteria, you can't even approximate a simple step function using continuous functions (the space of bounded continuous functions is complete under the supremum norm), and therefore using the supremum norm can't be what you meant. Therefore, the actual issue here is that "all functions" is either too vague, or, if you really mean *all* functions from the domain to the real numbers, then this isn't compatible with the norm we are presumably using when talking about the approximation. If we mean "all L^p functions" for some p in [1,infinity) , then yes, because the continuous functions (or the continuous functions with compact support) are dense in L^p (at least assuming some conditions on the domain of these functions which will basically always be satisfied in the context we are talking about) . Original version of this comment: deleted without ever sending because I realized it had things wrong about it and was even longer than the "needlessly long version", which is a rewrite of it, which takes into account from the beginning things I realized only at the end of the original. I'm slightly trying to get better about taking the time to make my comments shorter, rather than leaving all the broken thought process on the way to the conclusion in the comment.

@vertonical 3 года назад

Yannic Kilcher is a kernel machine.

@sergiomanuel2206 3 года назад

What happens if we do a training of just one step. It could be started from the last step of an existing training (this theorem doesn't require to have random weight at the start). In this case we don't need to store all the trainig path 😎

@YannicKilcher 3 года назад

yes, but the kernel must still be constructed from all the gradient descent path, that includes the path to obtain the initialization in your case

@sergiomanuel2206 3 года назад

@@YannicKilcher first: wonderful videos, thank you!!! Second: Correct me if I am wrong. The theorem doesn't tell us anything about the initialization weights. I am thinking about one-step-training with w0 obtained from a previews trainig. If we do one step of gradient descent using all the dataset, there is just one optimal path in the direction dw=- lr*dL/dw, this training will lead us to w1. Using w0 and w1 we can build the kernel and evaluate it. I think it is correct because all information about the trainin is in the last step (using the nn we can make predictions using just the last weigths, w1).

@YannicKilcher 3 года назад

@@sergiomanuel2206 sounds reasonable. W0 also appears in the Kernel

@eliasallegaert2582 3 года назад

This paper is from the author of "The master algorithm" where the big picture is explored of multiple machine learning techniques. Very interesting! Thanks Yannic for the great explanation!

@drdca8263 3 года назад

The derivation of this seems nice, but, maybe this is just because I don't have any intuition for kernel machines, but I don't get the interpretation of this? I should emphasize that I haven't studied machine learning stuff in any actual depth, have only taken 1 class on it, so I don't know what I'm talking about If the point of kernel machines is to have a sum over i of (something that doesn't depend on x, only on i) * (something that depends on x and x_i) , then why should a sum over i of (something that depends on x and i) * (something that depends on x and x_i) be considered, all that similar? The way it is phrased in remark 2 seems to fit it better, and I don't know why they didn't just give that as the main way of expressing it? Maybe I'm being too literal. edit: Ok, upon thinking about it more, and reading more, I think I see more of the connection, maybe. the "kernel trick" involves mapping some set to some vector space, and then doing inner products there, and for that reason, taking the inner product that naturally appears here, and wanting to relate it to kernel stuff, seems reasonable. And so defining the K^p_{f,c}(x,x_i) , seems also probably reasonable. (Uh, does this still behave line an inner product? I think it does? Yeah, uh, if we send x to the function that takes in t and returns the gradient of f(x,w(t)) with respect to w, (where f is the thing where y = f(x,w) ), that is sending x to a vector space, specifically, vector valued functions on the interval of numbers t comes from, and if we define the inner product of two such vectors as being the integral over t of the inner product of the value at t, This will be bilinear, and symmetric, and anything with itself should have positive values whenever the vector isn't 0, so ok yes it should be an inner product. So, it does seem (to me, who, remember, I don't know what I'm talking about when it comes to machine learning) like this gives a sensible thing to potentially use as a kernel. How unfortunate then, that it doesn't satisfy the definition they gave at the start! Maybe there's some condition in which the K^g_{f,w(t)}(x,x_i) and the L'(y^*_i, y_i) can be shown to be approximately uncorrelated over time, so that the first term in this would approximately not depend on x, and so it would approximately conform to the definition they gave?

@YannicKilcher 3 года назад

your thoughts are very good, maybe you want to check out some of the literature of neural tangent kernel, because that's pretty much into this direction!

@frankd1156 3 года назад

wow...my head get hot a little bit lol

@jeremydy3340 3 года назад

Don't Kernel methods usually take the form of a weighted average of examples? sum( a_i * y_i * K ) . The method given here is quite different, and depends on the labels largely implicitly via how they change the path c(t) through weight space. It isn't clear to me that y( x' ) is at all similar to y_i, even if K(x', x_i) is large. And the implicit dependence through c(t) on all examples means (x_i, y_i) may be extremely important even if K(x', x_i) is small.

@YannicKilcher 3 года назад

it depends on the output, yes, but not on the actual labels. and yes, the kernel is complicated by construction, because it needs to connect gradient descent, so the learning path and the gradient are necessarily in there

@NeoKailthas 3 года назад

Now someone prove that humans are kernel machines

@nellatl 3 года назад

When you're broke with no money, the last thing you want to hear is theoretically;. Especially theoretically and paper in the same sentence.

@mrjamess5659 3 года назад

Im having multiple enligthments while watching this video.

@MachineLearningStreetTalk 3 года назад

Does this mean we need to cancel DNNs?

@herp_derpingson 3 года назад

...or Kernel Machines. Same thing.

@diegofcm6201 3 года назад

*are approximately

@YannicKilcher 3 года назад

Next up: Transformers are actually just linear regression.

@michaelwangCH 3 года назад

Deep learning is further step of ML evolution. The Kernel methods are known since early 60's. No surprise at all.

@michaelwangCH 3 года назад

Hi, Yannic. Thanks for the explanation, looking forward for your next video.

@djfl58mdlwqlf 3 года назад

1:31 you made me worse by letting me into these kind of topics... your explanation is brilliant..... plz keep me enlightened until I die....

@raunaquepatra3966 3 года назад

Kernel is all you need 🤨

@albertwang5974 3 года назад

I cannot understand why such kind of topic can be a paper.

@dontthinkastronaut 3 года назад

hi there

@shuangbiaogou437 3 года назад

I knew this 8 years ago and I mathmatically proved a perceptron is just a dual form of linear kernel machine. A MLP is just a linear kernel machine with its input being transformed.

@paulcarra8275 3 года назад

Hi first the author make a presentation here "ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-m3b0qEQHlUs.html&lc=UgwjZHYH9cRyuGmD6e14AaABAg" Second (repeating a comment made on the presentation above) I was wondering why should we go through the whole learning procedure and not instead start at the penultimate step with the corresponding b and w's, wouldn't it save almost computational time ? I mean if the goal is not learn the DNN but to get an additive representation of it (ignoring the non linear transform "g" of the Kernel Machine) Regards

@az8134 3 года назад

i thought we all know that ...

@willkrummeck 3 года назад

why do you have parler?

@YannicKilcher 3 года назад

yea it's kinda pointless now isn't it :D

@marouanemaachou7875 3 года назад

First

@erfantaghvaei3952 3 года назад

Pedro is cool guy, sad to see the hate on him for opposing the distorts surrounding datasets

@IoannisNousias 3 года назад

What an unfortunate choice of variable names. Every time I heard “ai is the average...” it threw me off. Too meta.

@fast_harmonic_psychedelic 3 года назад

im just afraid this will lead us BACK into kernel machines and programming everything by hand, resulting in much more robotic, calculator-esque models, not the AI models that we have. ITs better to keep it in the black box. If you look inside you'll jinx it and the magic will die, and we'll just have dumb calculator robots again

@conduit242 3 года назад

“You just feed in the training data” blah blah, the great lie of deep learning. The reality is ‘encodings’ hide a great deal of sophistication, just like compression ensemble models. Let’s see a transformer take a raw binary sequence and match zpaq-5 at least on the Hutter Prize 🤷🏻‍♂️ choosing periodic encodings, stride models, etc are all the same. All these methods, including compressors, are compromised theoretically

@getowtofheyah3161 3 года назад

So boring who freakin’ cares

@SerBallister 3 года назад

Are you proud of your ignorance?

@getowtofheyah3161 3 года назад

@@SerBallister yes.

@sphereron 3 года назад

Dude, if you're reviewing Pedro Dominguez's paper despite his reputation as a racist and sexist, why not use your platform to give awareness to Timnit Gebru's work on "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?". Otherwise your bias is very obvious here.