'How neural networks learn' - Part III: Generalization and Overfitting

Arxiv Insights

Подписаться 95 тыс.

Просмотров 42 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Опубликовано:

30 сен 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 101

@owenliu4780 5 лет назад

That part of visualization via information theory blows my mind and your explanation is really consice and instructive. The inspiring problems you left in the latter part of the video are also well-observed contradictions backed with strong academic arguments, unlike some general nonprofessional problems raised by random non-tech people. It's fortunate for us to have youtuber like you on the Internet. Thank you a lot.

@delikatus 4 года назад

openreview.net/forum?id=ry_WPG-A-

@snippletrap 5 лет назад

Tishby is approaching the problem from classical information theory, while the idea that the shortest model is best comes from algorithmic information theory. Is the network synthesizing a program or compressing based on statistical regularities? If Tishby is right, then it is the latter. I suspect that he is. In this case neural networks are necessarily blind to certain regularities, just as Shannon entropy is. For example, the entropy of pseudorandom numbers, like the digits of pi, is high, while the Kolmogorov complexity is low. If networks are compressing then they need more parameters to encode the input. If they are learning optimal programs to represent the data, then the rate of growth will be much lower (logarithmic rather than linear). Hector Zenil has recently done some interesting work in this field, check him out.

@ArxivInsights 5 лет назад

Brilliant comment, thx for adding this! I would tend to agree that current deep learning models are basically doing representational compression rather than program search. However, I feel like fundamentally solving the generalization problem might require new ways of leveraging neural nets (not necessarily trained via SGD) to allow for model based, algorithmic reasoning, much like the scientific process where hypotheses are posited and subsequently rejected / refined through observation.

@BooleanDisorder 7 месяцев назад

A few years later and I think you're on to something. Look up AlphaGeometry! @@ArxivInsights

@АлексейТучак-м4ч 5 лет назад

Please, keep working on Arxiv videos, we need your clear explanations ask for help if you need it, but don't stop

@CosmiaNebula 4 года назад

Concerning neural network and shortest model, there is Schmidthuber's 1997 paper "Discovering neural nets with low Kolmogorov complexity and high generalization capability" where they directly searched for the least complex neural net and showed it has low generalization loss. I also want to mention Max Tegmark's speculative idea that neural networks have intrinsic bias towards learning physically meaningful functions. See for example "AI for physics & physics for AI " ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-pkJkHB_c3nA.html&ab_channel=MITCBMM

@longlongmaan4681 5 лет назад

Is it me or the volume is low in this video?

@jbhurruth 5 лет назад

It is extremely low. I've had to turn my amplifier well beyond a safe volume to hear the embedded video. If RU-vid decides to insert an advert it runs a real risk of breaking my speakers

@deathybrs 5 лет назад

WELCOME BACK! So good to see you again!

@olegmaslov3425 3 года назад

I worked with the article from the video, setting up my own experiments. The set of input data in the article is completely unrelated to reality but is modeled so that it is easier to calculate the mutual information. When conducting experiments with MNIST data, the result is strikingly different, there is no compression stage on any Relu layer. You can also see this in the article below. arxiv.org/pdf/2004.14941.pdf

@ValpoPhysics 4 года назад

Beautiful summary. Tishby's lectures often get bogged down in the details. You give a nice clear overview. I recommend this version of Tishby's talk on the Information Bottleneck: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-EQTtBRM0sIs.html He's frequently interrupted by the faculty, but I think he makes his points better as a result.

@Ceelvain 5 лет назад

One nice thing with your videos, is that I can watch them 50 times, I'll learn new things every time. Best ML youtube channel! Please consider doing some collabs to get more viewers.

@deepblender 5 лет назад

Thanks a lot for your videos! I love your detailed explanations as they have always been very useful if I wanted to dig deeper. As you know there are others who create videos about neural networks, but yours are the only ones which go to a point where I usually have a decent understanding of the important concepts. That's extremely valuable for me, thanks you so much!

@262fabi 3 года назад

Very nice video, although the audio volume is a bit low. I was easily able to listen & understand multiple videos I watched prior to this one at just 20% volume and I struggle to understand what he or the professor are saying at 100% volume.

@bingochipspass08 2 года назад

I thought it was just me lol,.. ya,.. the audio on this one is really low,..

@outdoorsismyhome7932 Год назад

Great visualization!

@sibyjoseplathottam4828 5 лет назад

Thank you for providing such concise explanations. I would have missed these important papers if not for you.

@Tygetstrypes 4 года назад

I love your channel! All your videos are clear and beautifully explained. This was a great video to watch after I watched Tishby's full talk: great summary of his results and presentation, and that's a fascinating line of research. Cheers :-)

@SundaraRamanR 5 лет назад

Link to Part 1 of this series: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-McgxRxi2Jqo.html (Feature visualisation) Part 2: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-4rFOkpI0Lcg.html (Adversarial examples)

@matrixmoeniaclegacy 5 лет назад

Hey, thanks for your helpful videos! => I think you misspelled Ilya Sutskever in your video description! In this video from MIT ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-9EN_HoEk3KY.html he is written Sut-s-keve-r. Cheers!

@mohamedayoob4699 5 лет назад

Absolutely helpful and intuitive videos, would you be able to do a video on reversible generative models please ? :)

@limsiryuean5040 15 дней назад

Damn, I wish I was here sooner to support your channel

@violetka07 3 года назад

I remember a couple of years a go he had a problem publishing this. Has it been published since? Thanks. Otherwise, extremely interesting research, indeed.

@paedrufernando2351 3 года назад

@20.53 About the shortest program to best explain the data.. I think he is referring to entropy i.e the less elements of surprise a program has the more it has learnt about the data i. pattern i.e regularities of the data,(Refer Entropy ),, similar is the concept being used in autoencoders I believe .. where we are compressing this information into lesser nodes than the input layer nodes(So if compression is achieved it means that the learning has been done very well.Kind of like a measure of learnedness or memoisation.). Very useful in Reinforcement learning, *I am a novice learner and hence I am trying to understand the mechanics and gaining insight. Please refute my points if you know it is not exactly correct and may be off track...

@gnorts_mr_alien 2 года назад

You not doing this more often is a crime against humanity. But I understand, you probably do other important things. Very interesting videos, thank you!

@norbertfeurle7905 2 года назад

I think of it as this, two random values have much more hemming distance between them, than two none random values. So the randomness makes the distinction space between two values bigger and much more clearer. So the network will learn much better on random data. The question then would be, how you get the usefull data out of a almost random noisy network output. The answer is to use something known in information theory to get a signal from almost noise, it's called barker codes, so you use barker codes neurons in the network and therby can allow much more noise, resulting in better learning and better destinction. Ok now I have to charge you something for this.

@paulcurry8383 3 года назад

The graphic used to show that more layers reduces the time of the compression phase is confusing to me because how do we know that the 1 layer mlp has the representational capacity to overfit on MNIST?

@mozartantonio1919 Год назад

awesome video and awesome publications. This has helped me a lot (im a phd student at univertity of cantabria hehe)

@benediktwichtlhuber3722 5 лет назад

I would like to point out, that both theories of Zhang and Tishby got debunked in recent research.

@ArxivInsights 5 лет назад

Hmm interesting! Can you share some links on this? That's the risk of making videos about Deep Learning :p

@benediktwichtlhuber3722 5 лет назад

@@ArxivInsights I am still glad you do videos though :) Talking about Zhang. I can still remember a paper from Kruger(Deep nets do not learn via memorization), where researchers pointed out that learning random data and learning real data are completely different tasks. For obvious reasons algorithms can't learn irrational data. This is at best a trivial exception from existing theories(like Goodfellow). The main contribution of Zhangs paper is probably showing the overwhelming capacity of neural networks. Tishbys theory got far less interesting with Saxe(On the information bottleneck theory of deep learning). They showed that there is no connection between compression and generalization. They found a few cases where there is no compression phase, but networks were still able to generalize. Meaning those findings do not hold true for the general case. I think this field is so freaking interesting, because a lot of researchers don't even dare to touch it. Nobody can really tell why and how DNNs make decisions. I am curious to find an answer :)

@0106139 5 лет назад

Hi, Great video! I only don't see how Inception "somehow manages to get a much better test accuracy on the true test set when trained on partially corrupted labels" @ 3:47. Could someone explain ?

@mendi1122 5 лет назад

Our brain learn to predict the next input/experience. i.e. the continuous reality is the supervisor.

@veronmath3264 5 лет назад

The standard of your videos are something out of this world, we are not only learning about programming but we are also learn to behave professional and build love on what you doing. This mind is giving a better production in our communities today. If the world continues to produce this type of behavior then everything well be good

@preritrathi3440 5 лет назад

Awesome vid Xander...keep up the good work...can't wait to see ya again😀😀😀

@azizulbinazizanaiskla9342 3 года назад

Your videos are very good. Wished that you code continue to remain youtubing.

@KhaledKimboo4 Год назад

So we should store/compress data as overfitted neural networks.

@NicolasIvanov 5 лет назад

Super nice video, Xander, thank you! Gonna share with my colleagues.

@wencesvm 5 лет назад

Superb content as always!!! Glad you came back

@hackercop 3 года назад

Just discovered thsi channel and its amazing thanks.

@meinradrecheis895 5 лет назад

A human has so many input streams (visual, tactile, audible, etc) that have high correlation. Isn't that supervised learning?

@StevenSmith68828 4 года назад

Maybe the next evolution of AI would be creating hormone like operations that overstimulate certain areas of the NN ahead of time.

@BlakeEdwards333 5 лет назад

Amazing content. Please post more like this!

@BlakeEdwards333 5 лет назад

Please turn up the volume of your video though! I nearly broke my car speaker when I got a text while listening to the video!

@dungeonkeeper42 Год назад

I have a bad feeling about all this..

@benjaminhezrony5761 Год назад

This is soooooo interesting!!!!

@BooleanDisorder 7 месяцев назад

Rest in peace Tishby

@threeMetreJim 5 лет назад

At a first guess i'd put forward that it's randomness in time rather than randomness in the learning function (SGD), that is the difference between an artificial network and a biological one. I'd put this down to signals being processed with random length of the biological connections (axons), and therefore random timing, as the signals take a finite time, even though it's small, to traverse the length of the connections. It would be just a difference in signal encoding between an artificial network (continuous values without the need for a concept of time) and a biological one (pulsed signals, actual 'value' could be determined by pulse density - using a time averaging activation function).

@theweareus 4 года назад

Came for autoencoders, stayed for your great explanations! (especially the GAN video, finally understood the meaning behind competition between neural nets). By the way, do you intend to focus on one field (e.g. GAN or reinforcement learning), or to cover all the main topics (e.g. Feed Forward NN, RNN, and CNN)? Also, will you do some Q&A related to neural networks? Thanks Xander!

@jasonlin1316 5 лет назад

@4:57 to the human there might not be structure by "random" corruption but to the machine there might be some statistical structure in the underlying manifold of the new distribution created.

@mainaksarkar2387 5 лет назад

Can you also do videos with specific examples from recurrent neural network models and LSTMs? Most of your existing examples are with images , feedforward networks and CNNs

@beauzeta1342 5 лет назад

Thank you for the great video. There are 2 questions that fundamentally trouble me behind all these interpretations. 1. The benefit of the gradient randomness should also apply for any conventional learning algorithms, and not just DNN. But this implies our learning problem is not uniquely defined by the objective function and the learning model. If numerical method conditions our solution, mathematically, this becomes a badly posed problem. 2. Is the DNN really not overfit ? Practical datasets often exhibit continuity and regularity. A network memorizing millions of training data will not necessarily exhibit overfit behavior on a test set, as we are essentially interpolating within the memorized points. Only when we test the model on data out of its conventional manifold, can we really see the extrapolation issue. So many papers have shown that NN can be easily fooled. My gut feeling is Big Data help us perform interpolation almost all the time. But we are actually overfitting.

@zaksmith1035 4 года назад

Dude, come back.

5 лет назад

Very good explanation. My mind needs to be cooled for a minute. don't stop making videos. I am waiting for a new one.

@TerrAkon3000 5 лет назад

Can someone point me to a more formal statement of Ilya's claim that 'short programs generalize best'? I haven't had any luck on google yet. I feel like one has to make rather strong assumptions to show its validity.

@RocketSpecialist 4 года назад

cant hear anything

@threeMetreJim 5 лет назад

Your green screen background looks sort of familiar. Was it generated by a neural network? I've been playing with a very small image storing neural network 48x6 (based on Andrej Karpathy's Convnet.Js) Turns out that it can reliably store as much information as can be encoded by all of the weights (no real surprise there). For 100x100 pixel RGB image requiring 30000 bytes to store an image without loss, it turns out that the 48x6 network has roughly 12,000 weights each taking 4 bytes to encode, giving a total of around 48000 total bytes of information (more than the uncompressed image itself would take up). It does seem also to be able to fit more than that, somehow discarding irrelevant image information (like using a smaller colour space, or encoding large same coloured areas somehow). Seems that the complexity of the image(s) determines the network capability in part. I still find it fascinating that it can store a colour for random x,y co-ordinate's being fed in, and for multiple images too (even ones that are randomly rotated, and you can associate the random position with another network input!). Shame I didn't have the patience to get a more decent resolution - it takes a long time for the smaller image details to start appearing. It also accepts binary (you could probably use a base system other than 2 as well), rather than the more common one's hot encoding method - the 'distance' between the codes has to be enough to not get overlap, in the way I was experimenting. The simple x,y and remember colour network only seems to work if you feed it the x,y randomly, trying to scan over an image, doesn't seem to work at all; reminds me of the Sierpinski triangle, where if you try to draw one without picking the direction to travel at random, it fails to work well, or at all.

@ArxivInsights 5 лет назад

I generated these moving backgrounds using a simple CPPN network I wrote in PyTorch. Simply initialize a small, random fully-connected network, feed it (x,y) coordinates as input + a moving latent vector (to create motion). The output of the network is the color for each pixel on the screen. Try a few random seeds and architectures until you get something that looks good! Then you simply run the network at whatever resolution you want :)

@threeMetreJim 5 лет назад

@@ArxivInsights If you have a look at a couple of videos I've done, you'll see what I've been playing with exactly. Storing images and then warping them by adjusting other inputs that was left static while storing the images; an example of extreme overfitting but makes some nice, but rather low resolution videos - ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-JFMaIxDXehU.html

@aidenstill7179 5 лет назад

great video. Tell me please. what do i need to know to create my own python deep learning framework? tell me the books and courses to get knowledge for this.

@chrismorris5241 5 лет назад

I'll probably have to watch this series many times but I think it is the best explanation I have seen. Thank you!

@hfkssadfrew 5 лет назад

You might want to look at “implicit regularization by SGD”

@ArxivInsights 5 лет назад

I actually found that paper, but only after I had recorded the video :p So it was too late to put it in there, but indeed, that's the stuff :)

@hfkssadfrew 5 лет назад

Arxiv Insights thanks for the reply. Indeed people have found long time ago that most that, if you choose to solve under determined linear system by sgd, as you know there are infinite solutions, sgd will only give you the mini-norm version.

@coemgeincraobhach236 3 года назад

Love that you reference everything! Thanks for these videos!

@user-or7ji5hv8y 5 лет назад

Is there another way to understand the axis of the chart that uses information theory? not quite getting it.

@MasterScrat 5 лет назад

Great video! I'm curious, where does the SGD visualisation at 19:38 come from?

@wiktormigaszewski8684 4 года назад

we do learn in a supervised method; this is called "school"

@Schematical 5 лет назад

Well done. Glad to see you are still doing these. They really helped on my MineCraft AI experiments.

@alighahramani2347 Год назад

@JithuRJacob 5 лет назад

Awesome Video! Waiting for more

@user-or7ji5hv8y 5 лет назад

but given that NN are universal function approximators, is it all that surprising that it managed to fit with 100%? or am I missing something?

@codyheiner3636 5 лет назад

No, it's not surprising that they managed to fit perfectly. It's surprising they manage to fit perfectly *while also* generalizing well to unseen data. How many possible functions are there over the domain of all n by n images that classify a certain handful as planes and another handful as trucks? Now how many of those will also classify unseen images of planes as planes and unseen images of trucks as trucks?

@mikealche5778 5 лет назад

This was very good! Thank you!! :)

@Gyringag 5 лет назад

Great video, thanks!

@michaelmcgrath9653 5 лет назад

Great video, thanks

@karthik-ex4dm 5 лет назад

Waiting for the next episode!!!

@AK-km5tj 5 лет назад

Your videos are very nice

@alexbooth5574 5 лет назад

Thanks for posting again

@yangxun253 5 лет назад

Really great talk! Keep it up!

@UDharrmony 4 года назад

Brilliant!

@hrhxysbdhdgxbhduebxhbd3694 5 лет назад

Yes!

@resonance999 5 лет назад

Great stuff!

@pierl 5 лет назад

Well done.

@matthieuovp8654 5 лет назад

Great !

@codyheiner3636 5 лет назад

For the apparent clash between the idea that simpler rules generalize better and deeper neural networks train better, it's important to focus on the point that the accuracy of a deep neural network is based upon what it learns during training, and the majority of our decisions when creating deep learning models are motivated by the goal of making our network learn better and faster. Mathematically, the simplest set of rules will generalize the best. But mathematically, we have no way to find out this simplest set of rules. So we turn to deep learning, which gives us a very complicated set of rules instead. Nevertheless, it gives us this set of rules in a short amount of time. I imagine regarding the general problem of finding the simplest model that solves a problem of the type we are currently approaching using deep learning, it will take humanity decades of research to make significant progress. Even then, I'm not convinced we'll have much more than a bag of heuristics and practical tricks reinforced by massive computational capacity.

@olegovcharenko8684 5 лет назад

Please make videos more often! One video saves a week of reading and searching for papers, thanks

@user-or7ji5hv8y 5 лет назад

great video! Lots of deep ideas.

@X_platform 5 лет назад

Humans are in a supervised environment though. In the early days, if we did something wrong, we simply die. Now it is more forgiving if we did something wrong we might lose some time or some money.

@Ceelvain 5 лет назад

That's not supervision. That's reinforcement.

@wolfisraging 5 лет назад

Your videos are filled with lots of knowledge, great work. But the amount of videos and topics covered in your channel are so limited. Kindly upload at least 1 video in every 2 weeks. BTW Thanks.