The lecture was so much fun, I used to think that deep learning is somehow very different than the other ml algos, but I realize now that this is not the case. But, the shocking thing was how accurate the neural net's predictions were at the end, simply based on the photo.
beautiful and intuitive explanation of why SGD is working for NNs even if it is so stupid as an optimization algorithm. Really illuminating. I've been struggling with this thing for months. thank you! this course is by far the most insightful course on ML I've ever seen
The intuition here was so so satisfying. The way it all comes together at the end, when he points out that the sigmoidal functions people used to use (because of emulating neuronal activation functions) have these flat parts which slow down the gradient. Not only is the slowed learning bad, but that slowed learning dampens the ability of the noisy SGD to escape the thin deep wells which represent ideal parameters only for a SPECIFIC data set. I.e. the thin deep wells = overfitting, the noise of SGD escapes them along with big alpha, and a slowed gradient from the sigmoidal flat parts causes an effective reduction in learning rate, which leads to getting trapped in the wells even with SGD, which causes overfitting. Just awesome.
2:00 Begin 2:25 Neural networks a just a simple extension of linear classifiers 5:00 Chain rule 15:30 Gradient descent 16:00 No longer working with convex functions because of the transition function. Where we start matters. Not with all zero vector. Initialization is a big deal. 20:25 SGD is really important 28:45 We end up at some of the large holes (not necessarily deep) that we can’t escape from. Throwing away training data and test data gives us a different function. Wider less likely to change a lot. SGD can only find these! 33:40 Two tricks: mini-batch and initial large learning rate then lower by a factor of 10. 37:40 If you wanted to do bagging with neural networks… Ensemble several networks and don’t need to resample. 38:50 Why they’re called neural networks. 43:00 Discusses why ReLU, which is non-differentiable. Good at not getting trapped in local minima 44:50 Demo 46:00 ReLU is better at complex problems but not smaller problems like the demo. 47:00 playground.tensorflow.org demo
Thank you so much for your teaching, this is the best content I found in four years in this field. I am close to applying for a PhD in your laboratory haha!
Great Lecture! You made learning fun. Complexity in understanding the concepts through Bishop's book vs these lectures = huge number, even though there is no doubt that the book may be one of the best.
Hey I'm planning to start BISHOP's book after finishing this, is this the right thing to do or are you saying that book is just complexly defined version of Prof. Killian lectures?
Am I the only one who is sad that I have only one more lecture to go? :( (Of course, I will probably come back to some of the classes but there is nothing like discovering it for the first time)
thanks for the lecture as you said if you have enough data the function of the training set is close to the testing set , but any way the SGD tends to go to the wider local minimum , in such case isn't better to use other classifier which put you on much narrow local minimum because as you said the function would be close to the function of the training data
The danger is that the loss surface changes as you switch to different (test) data. So a narrow minimum in the training set may actually not be very deep for the test data. Wider minima are often considered more stable.
I first would like to thank you for these amazing lectures on ML ! I just have a question about the SGD. Can we say that the SGD can escape the local minima because the landscape of the single loss function ( or batch loss function ) is different from the loss function of the whole dataset ?
To some degree, but it is important that you are changing the mini-batch for every gradient update. Probably a better way to think about it is that it is not that easy to get stuck in local minima / saddle points. You need the precise gradient information to hit it exactly (a little like hitting the moon with a rocket - only if you aim very carefully will you be successful). If you estimate your gradient with a mini-batch your gradients will be way too noisy to hit the local minima, and you will shoot past it - eventually converging near a global minima from which it is very hard to escape. Hope this helps.
yeah link of homework, but not projects. Apparently dear Kilian does not share them publicly because the solution might become available to future Cornell students.