That part of visualization via information theory blows my mind and your explanation is really consice and instructive. The inspiring problems you left in the latter part of the video are also well-observed contradictions backed with strong academic arguments, unlike some general nonprofessional problems raised by random non-tech people. It's fortunate for us to have youtuber like you on the Internet. Thank you a lot.
Tishby is approaching the problem from classical information theory, while the idea that the shortest model is best comes from algorithmic information theory. Is the network synthesizing a program or compressing based on statistical regularities? If Tishby is right, then it is the latter. I suspect that he is. In this case neural networks are necessarily blind to certain regularities, just as Shannon entropy is. For example, the entropy of pseudorandom numbers, like the digits of pi, is high, while the Kolmogorov complexity is low. If networks are compressing then they need more parameters to encode the input. If they are learning optimal programs to represent the data, then the rate of growth will be much lower (logarithmic rather than linear). Hector Zenil has recently done some interesting work in this field, check him out.
Brilliant comment, thx for adding this! I would tend to agree that current deep learning models are basically doing representational compression rather than program search. However, I feel like fundamentally solving the generalization problem might require new ways of leveraging neural nets (not necessarily trained via SGD) to allow for model based, algorithmic reasoning, much like the scientific process where hypotheses are posited and subsequently rejected / refined through observation.
Concerning neural network and shortest model, there is Schmidthuber's 1997 paper "Discovering neural nets with low Kolmogorov complexity and high generalization capability" where they directly searched for the least complex neural net and showed it has low generalization loss. I also want to mention Max Tegmark's speculative idea that neural networks have intrinsic bias towards learning physically meaningful functions. See for example "AI for physics & physics for AI " ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-pkJkHB_c3nA.html&ab_channel=MITCBMM
It is extremely low. I've had to turn my amplifier well beyond a safe volume to hear the embedded video. If RU-vid decides to insert an advert it runs a real risk of breaking my speakers
I worked with the article from the video, setting up my own experiments. The set of input data in the article is completely unrelated to reality but is modeled so that it is easier to calculate the mutual information. When conducting experiments with MNIST data, the result is strikingly different, there is no compression stage on any Relu layer. You can also see this in the article below. arxiv.org/pdf/2004.14941.pdf
Beautiful summary. Tishby's lectures often get bogged down in the details. You give a nice clear overview. I recommend this version of Tishby's talk on the Information Bottleneck: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-EQTtBRM0sIs.html He's frequently interrupted by the faculty, but I think he makes his points better as a result.
One nice thing with your videos, is that I can watch them 50 times, I'll learn new things every time. Best ML youtube channel! Please consider doing some collabs to get more viewers.
Thanks a lot for your videos! I love your detailed explanations as they have always been very useful if I wanted to dig deeper. As you know there are others who create videos about neural networks, but yours are the only ones which go to a point where I usually have a decent understanding of the important concepts. That's extremely valuable for me, thanks you so much!
Very nice video, although the audio volume is a bit low. I was easily able to listen & understand multiple videos I watched prior to this one at just 20% volume and I struggle to understand what he or the professor are saying at 100% volume.
I love your channel! All your videos are clear and beautifully explained. This was a great video to watch after I watched Tishby's full talk: great summary of his results and presentation, and that's a fascinating line of research. Cheers :-)
Link to Part 1 of this series: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-McgxRxi2Jqo.html (Feature visualisation) Part 2: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-4rFOkpI0Lcg.html (Adversarial examples)
Hey, thanks for your helpful videos! => I think you misspelled Ilya Sutskever in your video description! In this video from MIT ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-9EN_HoEk3KY.html he is written Sut-s-keve-r. Cheers!
I remember a couple of years a go he had a problem publishing this. Has it been published since? Thanks. Otherwise, extremely interesting research, indeed.
@20.53 About the shortest program to best explain the data.. I think he is referring to entropy i.e the less elements of surprise a program has the more it has learnt about the data i. pattern i.e regularities of the data,(Refer Entropy ),, similar is the concept being used in autoencoders I believe .. where we are compressing this information into lesser nodes than the input layer nodes(So if compression is achieved it means that the learning has been done very well.Kind of like a measure of learnedness or memoisation.). Very useful in Reinforcement learning, *I am a novice learner and hence I am trying to understand the mechanics and gaining insight. Please refute my points if you know it is not exactly correct and may be off track...
You not doing this more often is a crime against humanity. But I understand, you probably do other important things. Very interesting videos, thank you!
I think of it as this, two random values have much more hemming distance between them, than two none random values. So the randomness makes the distinction space between two values bigger and much more clearer. So the network will learn much better on random data. The question then would be, how you get the usefull data out of a almost random noisy network output. The answer is to use something known in information theory to get a signal from almost noise, it's called barker codes, so you use barker codes neurons in the network and therby can allow much more noise, resulting in better learning and better destinction. Ok now I have to charge you something for this.
The graphic used to show that more layers reduces the time of the compression phase is confusing to me because how do we know that the 1 layer mlp has the representational capacity to overfit on MNIST?
@@ArxivInsights I am still glad you do videos though :) Talking about Zhang. I can still remember a paper from Kruger(Deep nets do not learn via memorization), where researchers pointed out that learning random data and learning real data are completely different tasks. For obvious reasons algorithms can't learn irrational data. This is at best a trivial exception from existing theories(like Goodfellow). The main contribution of Zhangs paper is probably showing the overwhelming capacity of neural networks. Tishbys theory got far less interesting with Saxe(On the information bottleneck theory of deep learning). They showed that there is no connection between compression and generalization. They found a few cases where there is no compression phase, but networks were still able to generalize. Meaning those findings do not hold true for the general case. I think this field is so freaking interesting, because a lot of researchers don't even dare to touch it. Nobody can really tell why and how DNNs make decisions. I am curious to find an answer :)
Hi, Great video! I only don't see how Inception "somehow manages to get a much better test accuracy on the true test set when trained on partially corrupted labels" @ 3:47. Could someone explain ?
The standard of your videos are something out of this world, we are not only learning about programming but we are also learn to behave professional and build love on what you doing. This mind is giving a better production in our communities today. If the world continues to produce this type of behavior then everything well be good
At a first guess i'd put forward that it's randomness in time rather than randomness in the learning function (SGD), that is the difference between an artificial network and a biological one. I'd put this down to signals being processed with random length of the biological connections (axons), and therefore random timing, as the signals take a finite time, even though it's small, to traverse the length of the connections. It would be just a difference in signal encoding between an artificial network (continuous values without the need for a concept of time) and a biological one (pulsed signals, actual 'value' could be determined by pulse density - using a time averaging activation function).
Came for autoencoders, stayed for your great explanations! (especially the GAN video, finally understood the meaning behind competition between neural nets). By the way, do you intend to focus on one field (e.g. GAN or reinforcement learning), or to cover all the main topics (e.g. Feed Forward NN, RNN, and CNN)? Also, will you do some Q&A related to neural networks? Thanks Xander!
@4:57 to the human there might not be structure by "random" corruption but to the machine there might be some statistical structure in the underlying manifold of the new distribution created.
Can you also do videos with specific examples from recurrent neural network models and LSTMs? Most of your existing examples are with images , feedforward networks and CNNs
Thank you for the great video. There are 2 questions that fundamentally trouble me behind all these interpretations. 1. The benefit of the gradient randomness should also apply for any conventional learning algorithms, and not just DNN. But this implies our learning problem is not uniquely defined by the objective function and the learning model. If numerical method conditions our solution, mathematically, this becomes a badly posed problem. 2. Is the DNN really not overfit ? Practical datasets often exhibit continuity and regularity. A network memorizing millions of training data will not necessarily exhibit overfit behavior on a test set, as we are essentially interpolating within the memorized points. Only when we test the model on data out of its conventional manifold, can we really see the extrapolation issue. So many papers have shown that NN can be easily fooled. My gut feeling is Big Data help us perform interpolation almost all the time. But we are actually overfitting.
Can someone point me to a more formal statement of Ilya's claim that 'short programs generalize best'? I haven't had any luck on google yet. I feel like one has to make rather strong assumptions to show its validity.
Your green screen background looks sort of familiar. Was it generated by a neural network? I've been playing with a very small image storing neural network 48x6 (based on Andrej Karpathy's Convnet.Js) Turns out that it can reliably store as much information as can be encoded by all of the weights (no real surprise there). For 100x100 pixel RGB image requiring 30000 bytes to store an image without loss, it turns out that the 48x6 network has roughly 12,000 weights each taking 4 bytes to encode, giving a total of around 48000 total bytes of information (more than the uncompressed image itself would take up). It does seem also to be able to fit more than that, somehow discarding irrelevant image information (like using a smaller colour space, or encoding large same coloured areas somehow). Seems that the complexity of the image(s) determines the network capability in part. I still find it fascinating that it can store a colour for random x,y co-ordinate's being fed in, and for multiple images too (even ones that are randomly rotated, and you can associate the random position with another network input!). Shame I didn't have the patience to get a more decent resolution - it takes a long time for the smaller image details to start appearing. It also accepts binary (you could probably use a base system other than 2 as well), rather than the more common one's hot encoding method - the 'distance' between the codes has to be enough to not get overlap, in the way I was experimenting. The simple x,y and remember colour network only seems to work if you feed it the x,y randomly, trying to scan over an image, doesn't seem to work at all; reminds me of the Sierpinski triangle, where if you try to draw one without picking the direction to travel at random, it fails to work well, or at all.
I generated these moving backgrounds using a simple CPPN network I wrote in PyTorch. Simply initialize a small, random fully-connected network, feed it (x,y) coordinates as input + a moving latent vector (to create motion). The output of the network is the color for each pixel on the screen. Try a few random seeds and architectures until you get something that looks good! Then you simply run the network at whatever resolution you want :)
@@ArxivInsights If you have a look at a couple of videos I've done, you'll see what I've been playing with exactly. Storing images and then warping them by adjusting other inputs that was left static while storing the images; an example of extreme overfitting but makes some nice, but rather low resolution videos - ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-JFMaIxDXehU.html
great video. Tell me please. what do i need to know to create my own python deep learning framework? tell me the books and courses to get knowledge for this.
Arxiv Insights thanks for the reply. Indeed people have found long time ago that most that, if you choose to solve under determined linear system by sgd, as you know there are infinite solutions, sgd will only give you the mini-norm version.
No, it's not surprising that they managed to fit perfectly. It's surprising they manage to fit perfectly *while also* generalizing well to unseen data. How many possible functions are there over the domain of all n by n images that classify a certain handful as planes and another handful as trucks? Now how many of those will also classify unseen images of planes as planes and unseen images of trucks as trucks?
For the apparent clash between the idea that simpler rules generalize better and deeper neural networks train better, it's important to focus on the point that the accuracy of a deep neural network is based upon what it learns during training, and the majority of our decisions when creating deep learning models are motivated by the goal of making our network learn better and faster. Mathematically, the simplest set of rules will generalize the best. But mathematically, we have no way to find out this simplest set of rules. So we turn to deep learning, which gives us a very complicated set of rules instead. Nevertheless, it gives us this set of rules in a short amount of time. I imagine regarding the general problem of finding the simplest model that solves a problem of the type we are currently approaching using deep learning, it will take humanity decades of research to make significant progress. Even then, I'm not convinced we'll have much more than a bag of heuristics and practical tricks reinforced by massive computational capacity.
Humans are in a supervised environment though. In the early days, if we did something wrong, we simply die. Now it is more forgiving if we did something wrong we might lose some time or some money.
Your videos are filled with lots of knowledge, great work. But the amount of videos and topics covered in your channel are so limited. Kindly upload at least 1 video in every 2 weeks. BTW Thanks.