@Emm-- not sure how / if I can reply to your comment. An iso-surface is the set of points such that a function f(x) has constant value, e.g. all x such that f(x) = c. For a Gaussian distribution, for example, this is an ellipse, shaped according to the eigenvectors and eigenvalues of the covariance matrix. So, the iso-surfaces of theta1^2 + theta2^2 are circles, while the iso-surfaces of |theta1|+|theta2| look like diamonds. The iso-surface of the squared error on the data is also ellipsoidal, with a shape that depends on the data. Alpha scales the importance of the regularization term in the loss function, so higher alpha means more regularization. I didn't prove the sparsity assertion in the recording, but effectively, the "sharpness" of the diamond shape on the axes (specifically, the discontinuous derivative at e.g. theta1=0) means that it is possible for the optimum of the sum of (data + regularization) to have its optimum at a point where some of the parameters are exactly zero. If the function is differentiable at those points, this will effectively never happen -- the optimum will effectively always be at some (possibly small, but) non-zero value.
I have had courses and put a lot of effort reading material online, but your explanation is by far the one that will remain indelible in my mind. Thank you
Whoa, I wasn't ready for the superellipse, that's a nice suprise. That helps me understand the limit case of p -> inf. Also exciting to think about rational values for P such as the 0.5 case. Major thanks for the picture at 7 minutes in. I learned about the concept of compressed sensing the other day, but didn't understand how optimization under regularized L1 norm leads to sparsity. This video made it click for me. :)
Thank you for the great explanation. Some questions: 1. At 2:09 the slide says that the regularization term alpha x theta x thetaTranspose is known as the L2 penalty. However, going by the formula for Lp norm, isn't your term missing the square root? Shouldn't the L2 regularization be: alpha x squareroot(theta x thetaTranspose)? 2. At 3:27 you say "the decrease in the mean squared error would be offset by the increase in the norm of theta". Judging from the tone of your voice, I would guess that statement should be self-apparent from this slide. However, am I correct in understanding that this concept is not explained here; rather, it is explained two slides later?
+RandomUser20130101 "L2 regularization" is used loosely in the literature to mean either Euclidean distance, or squared Euclidean distance. Certainly, the L2 norm has a square root, and in some cases (L2,1 regularization, for example; see en.wikipedia.org/wiki/Matrix_norm) the square root is important, but often it is not; it does not change, for example, the isosurface shape. So, there should exist values of alpha (regularization strength) that will make them equivalent; alternatively, the path of solutions as alpha is changed should be the same. offset by increase: regularization is being explained in these slides generally; using the (squared) norm of theta is introduced as a notion of "simplicity" in the previous slides, and I think it is not hard to see (certainly if you actually solve the values) that to get the regression curve in the upper right of the slide at 3:27 requires high values of the coefficients, causing a trade-off between the two terms. Two slides later is the geometric picture in parameter space, which certainly also illustrates this trade-off point.
Very few videos online give some key concepts here, like what we're truly trying to minimize with the penalty expression. Most just give the equation but never explain the intuition behind L1 and L2. Kudos man
Sometimes I wish some profs would present a RU-vid playlist of good videos instead of giving their lectures themselves. This is so much better explained. There are so many good resources on the net, why are there still so many bad lectures given?
Just replace the "regularizing" cost term that is the sum of squared values of the parameters (L2 penalty), with one that is the sum of the absolute values of the parameters.