the least sum of squared errors is possible only when lamda value is zero (which is what i think important to have low error) but slope is high tough , for lamda value 40 the slope is low compared to lamda = 0 but sum of squared errors is high compared to lambda = 40 . why we are looking to minimize slope when sum of squared error is important
We're not necessary trying to minimize the slope. However, reducing the slope a little bit compared to the least squares slope, might help the model perform better in the long run. For details, see: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-Q81RR3yKn30.html
Was about to start this topic. Thanks @Statquest Hellowww , I just have quick question. by what time we should expect your video on Neural networks? And I have a request, could you add your upcomming videos on your website in a separate sections. By looking at the topic and the date odr month it will be uploaded. It will be very helpfull for students to buy a subscription plan of your channel in order to get early access to your videos.
disagree ... the overall Lasso/Ridge SSE is basically the fixed base SSE + a very curved quadratic as a function of slope, if the curved quadratic over-take the fixed SSE curve, then the center would surely lie at the minimum of the curved quadratic.
Hello Josh, Ridge and Lasso clearly visualized :) I must say that if one thing that makes your videos clearly explained to curious minds like me is that the visual illustrations that you provide in your stat videos. Glad. Thank you very much for your efforts.
You are truly an angel. Your videos on Ridge, Lasso and Elastic Net really helps with my understanding. It's way better than the lectures in my university.
I was wondering why I missed out on this video while going through the ones on Ridge and Lasso Regression from Sept-Oct 2018. Then I noticed this is a video you put out only a few days ago. Awesome. Much gratitude from Malaysia. 🙇
this channel is saving my ass when it comes to applied ml class. so frustrating when a dude who has been researching Lasso for 10 years just breaks out some Linear algebra derviation and then acts like your suppose to instantly understand that...... thanks for taking the time to come up with an exhbition that makes sense.
Thanks, josh for this amazing video. I promise to support this channel once I land a job offer as a data scientist. This is the only video on youtube, that practically shows all the algo's.
Hi Josh, pse accept my heartfelt thanks to such a wonderful video. I guess your videos are an academy in itself. Just follow along your videos and BAM!! you are a master of Data Science and Machine Learning. 👏
Thank you for your work as always. Its AWESOME. I just got some questions. Why is there a kink in the SSR curve for Lasso Regression ? Is it because we are adding lambda * |slope| which is a linear component ? And Does the curve for Ridge Regression stay parabola because we are adding lambda*slope^2 which is a parabola component ?
thank you!!!!! i have question do you have time series model or time series forecasting?? please please make those video with you amazing explanation!!!! :):)
Many people on the Internet explain regularization of regression using polynomial features in the data that ridge and lasso are allegedly used to reduce the curvature of the line, but in fact in this case you just need to find the right degree of the polynomial. You are one of the few who have shown the real essence of regularization in linear regression and the bottom line is that we simply fine the model by exchanging the bias for a lower variance through slope changes. By the way, real overfitting in regression can be well observed in data with a large number of features, some of which strongly correlate with each other, as well as a relatively small number of samples, and in this case that L1/L2/Lasso will be useful. Thank you so much for a very good explanation.
You simply amaze me with each of your videos. The best part is the way you explain stuff is so original and simple. Will really love if you could also pen down a book on AI/ML. Would be a bestseller i reckon for sure. Keep up the good work and enlightening us :)
Really excellent video Josh. You consistently do a great job, and I appreciate it. Could you make a video showing the use of Ridge regression and especially Lasso regression in parameter selection? I had to do that once, and it is complicated. From your example it seems that using neither penalty gives you the best response. So, in what circumstances do you want to use the regression to improve your result? If you are using lasso regression to find the top 3 predictive parameters, how does this work? What are the dangers? How do you optimally use it? A complicated subject for sure! I'm sorry if this is covered in your videos on Lasso and Ridge regression individually, I am watching them next. I agree with your naming convention btw, squared and absolute-value penalty is MUCH more intuitive!
Watch the other regularization videos first. I cover some of what you would like to know in about parameter selection in my video on Elastic-Net in R: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-ctmNq7FgbvI.html
@@statquest I will check out those videos, thanks. I actually did use elastic net regularization. The whole issue is complex (for somebody without a decent stats background) because the framework of how everything works isn't covered very well AND simply anywhere that I could find, without going down several pretty deep rabbit holes. Some of the parameter selections that I remember were suggested depended on the assumption that the parameters were independent, which was NOT the case in my situation. I'm still not sure what the best approach would have been.
@@statquest As an additional note, I've always found that examples and exercises are even more important than theory, while theory is essential at times too. In many math classes concepts were laid out in formal and generalized glory, but I couldn't get the concept at all until I put hard numbers or examples to it. It's probably not the subject of your channel or in your interest, but I think some really hand-holding examples of using these concepts in some kaggle projects, or going through what some interesting papers did, would be a great way of bringing the theory and the real world together.
@@omnesomnibus2845 I do webinars that focus on the applied side of all these concepts. So we can learn the theory, and then practice it with real data.
Another example on why lasso and ridge just makes no sense. The best fit was the normal linear regression. Lasso and ridge both made the line fit worse. So I conclude that lasso and ridge are pointless and best avoided.
Again, the dataset are small in order to show the concepts behind how these methods work. In practice, we have more data, but not so much that we can be super confident that whatever line we fit will accurately make predictions with new data that the original line was not fit to. Ridge and Lass Regression compensate for the lack of a sufficient amount of data to begin with.
I agree with you, these simple example doesn't show any benefits, and a lot of people saying "Thank you, you saved me" I want to know what did they do with this knowledge, repeting like a parrot on a test?
Excellent. I have just one question. In case of L1 penalty, isn't the line with lambda equal 40 (or slope 0) giving a bad line? I mean with blue line, we were getting a better fit since it didn't completely ignore weight in predicting height and sum of residuals is smallest?
And thats the reason why lasso does a kind of feature selection and sets many weights to 0 compared to ridge regression. And now I know the reason behind it thanks a lot❤
Fortunately, I asked you :) I agree squared and absolute penalty are better word choices for these regularization methods. Thanks again for making my ML at Scale a tad bit easier.
Thanks a lot for thes awesome videos, you deserver milllion followers, and a lot of credits :) I just love these and they are KISS. so simple and understandable. I owe you a lot of thanks and credits :D
Your are a super professor and I'll give you a infinity BAM !!!!!!!!!!!. I really like the way your repeat the earlier discussed topic to refresh the student memory and that really helpful and you have a-lot of patience. Once again you proved that a picture is worth a thousand words.
I've never been good with this kind of math/statistics because when I encounter the book formulas I tend to forget or not understand the symbols. Your videos make it possible to go beyond the notation and to learn the idea behind these concepts to apply them in machine learning. Thank you !
I believe it is because when the parameters start get smaller, the square penalty becomes much smaller relative to the parameter itself. For example, when the parameter is 0.01, then the penalty is 0.01^2 = 0.0001. In contrast, the absolute value penalty remains as large as the parameter. When the parameter is 0.01, the absolute value penalty is 0.01, which is much larger than the square penalty for the same parameter value.
Amazing video as always Josh! Just to be sure if I got it correctly, the plot between RSS error and slope represents a parabola in 2D. So when we do the same thing in 3D i.e. With 2 parameters, does it represent the same bowl shaped cost function that we try to minimise?
It looks linear, but it's just that the sum of the squared residuals really start to dominate the equation in a huge way - just like how squaring residuals makes outliers dominate a normal regression.
Ridge Regression (L2-norm) never shrinks coefficients to zero, but Lasso Regression (L1-norm) may shrink coefficients to zero, and that's the reason Lasso can perform feature selection while Ridge can't.
I have a doubt in this video, at time stamp 3:43 when you say "The residuals are smaller than before, so the Sum of Squared Residuals is smaller than before...". This particular line is not clear. For same value of slope of parameter and given value of lambda, how can [Sum of Squared Residuals (SSR) + penalty] be lesser than [SSR only]. For example let say at slope = 0.45 the SSR is 0 when we have not applied Ridge. After Ridge, the loss fn is SSR + 10 * 0.45^2 = 0 + 2.025 = 2.025. So when we apply Ridge the loss is slightly increased which is also visible with orange parabola. Second point I have not understood is at time stamp 5:06, the Note says "We can also see that when lambda = 10, the lowest point in the parabola is closer to 0 to than when lambda = 0. But the lowest point of orange parabola is slightly away from zero as compared to blue parabola. As per my understanding, that what Ridge does. it increases the loss a little bit by making lowest point of parabola away from zero to make model less prone to overfitting. And it does so by reducing the slope so that model is not over sensitive to any small change in parameter value. I am sure I am missing some piece of puzzle here. Thanks and regards.
At 3:43 we are just comparing the residuals for when the slope = 0 to when the slope = 0.2. When the slope is increased from 0 to 0.2, then the residuals are smaller. As a result, the sum of the squared residuals are also smaller for slope = 0.2 than slope = 0. Now, when we add the regularization penalty to equation things are a little more interesting, however, the decrease in the residuals more than compensates in the increase in the regularization penalty.
I guess I'm still not following why ridge regression can't reduce the slope to zero...for the ridge penalty: 0^2 = 0, and for lasso abs(0) = 0 , they both can equal zero... ?
I'm not sure I understand what you are saying with the math. If the slope is 0, then what you say is correct, 0^2 = 0. But the slope doesn't start out that way. How does it get there if the bottom of the curve is always > 0?
OMG!!! I've always thought that Ridge is a better method for solving overfitting because it introduces squared penalty to cost function that reduces weights more heavily and faster close to 0. Now you've changed my mind
Thank you very much for the video!!! All u r Videos are really easy to understand... thanks alot.. could you please!!.... upload a video for SCAD (Smoothly Clipped Absolute Deviation Method) Regularization Method....
Amazing series on regularization (As usual) I just didn't quite understand why in the ridge regression the weights/parameters never ever reach zero, I didn't give it much thought but it didn't pop right at me like it usually does in your videos lol, but again great series!
"Why can't it be zero?" I struggled with this idea. But I think I finally found it! Could that be the reason for this? Slopes = (x ^ T.X + aI) ^ - 1 * (X ^ T.y) equation ... Even if we increase a (Reg Term) infinitely, it cannot be zero!
All your videos are great, but the regularization ones have been a fantastic help. Was wondering if you were planning any on selective inference from lasso models? That would complete the set for me haha
Hi, thanks for this explanation, it really helped! In my previous work place almost everyone said that lasso could be used for feature selection and it was kind of given. Like no matter what the lambda value is but it solely depends on the lambda right? It may not remove any features at all? And increasing lambad value to the maximum isn't always most beneficial?