This lecture series is such a great service to the community! The pace is perfect and the practical examples are on point, not lost in the theory. Many many thanks.
Hi killian. I had a doubt. I’ve often heard people talking about making your dataset balanced, which basically is, let’s take spam classification problem, the number of spam instances should be around number of non spam instances. But when you’re applying algorithms like naive bayes which takes the prior into consideration, shouldn’t we just leave it to being unbalanced which actually captures how the emails are in the inbox in real life. Also Thank you for the amazing lectures !
The issue with highly imbalanced data is that it is just tricky to optimize the classifier. E.g. if 99% of your data is class 1, and 1% class -1, then just predicting +1 for all samples gives you 99% accuracy, but you haven’t actually learned anything. My typical approach is to re-balance the data set by still keeping all samples, but assigning smaller weights to the more common classes, so that the weights add up to be the same for all classes.
This lecture series is great. However it is usually impossible to hear the students' questions. If Dr. Weinberger could repeat the question before answering it or annotate the video with the student's question that would really help.
Hi prof. Kilian, thanks for the great lecture! at 6:56, P(Y|X=x, D)=∫P(y|θ)P(θ|D)dθ, could you explain why `X=x` is omitted in the body of the integral for true Bayesian? Shouldn't it be P(Y|X=x, D)=∫P(y|θ, X=x, D)P(θ|D, X=x)dθ?
Hi killian. I had a doubt. I don't know if this makes sense. But how do we write P(D;theta) as P(y|x;theta) when go ahead and derive Cost function for Linear Regression and Logistic Regression. Shouldn't P(D;theta) = P([x1,...,xN],[y1,...,yN]) or P({x1,y1},..{xN,yN}). Also, after we find the theta for P(X,Y) we are using the same theta for P(y|x;theta).Does this mean that P(X,Y) and P(Y|X) are parameterized by the same theta, or since they are related by Bayes theoram, we can write it this way??? This could be a very dumb doubt but I am confused!!!!! Also Thank you for the amazing lectures !
Yes, good question! Maybe take a look at these notes: www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote08.html Basically the idea is you want to model P(y,x), but you really only model the conditional distribution P(y|x;theta), and you assume P(x) is some distribution that is given (but you don't model it) and _importantly_ it is independent of theta. E.g. P(x) is the distribution over all emails that are sent and P(y|x) is the probability that a _given_ email x is spam or not spam (y). Then P(y,x;theta)=P(y|x;theta)P(x;theta)= [because x does not depend on theta] =P(y|x;theta)P(x) now if you take the log and optimize the log-likelihood with respect to theta, you will realize that P(x) is just a constant that has no influence on your choice of theta. So you can drop it. Hope this helps.
@@kilianweinberger698 Thank you so much for the reply. I understand but still would like to clarify some things so as to bolster my understanding further. Can we look at this , in this way: 1. If we consider a joint P(x,y;theta), and the marginal of this joint to be p(x;theta), which is not dependent on theta, we can write it as p(x). So, basically arg-maxing over the product of conditional probabilities will do the job. Is this right? Moreover, this raises another question in my mind, if the above understanding is correct, then if the joint is dependent on theta, wont the marginal also be dependent on theta. Like the marginal which P(X) derived by integrating over P(X,Y;theta) for all values of Y, shouldn't that also be affected by theta. If we have P(X), we can simply find P(X=xi). 2. Could it be that since we are doing Discriminative Learning, we just ASSUME P(X) to be given and independent of theta, which allows us to model using conditional probabilities. 3. Or am I looking it in a very different and WRONG way. P(X) will be independent of theta since our xi's are sampled from the original probability distribution. Is this what you mean when you say that P(x), i.e. coming from the original distribution P(X,Y). But by this logic, won't P(yi) also be independent of of theta,and accordingly can we write P(x,y;theta) = p(x|y;theta)p(y)??? I honestly feel that i have trapped myself in a ditch with these doubts, for something as trivial but as important as MLE. Help will be highly appreciated!!!!!!
In generative learning, we model P(Y,X), so P(X) would also depend on theta. In discriminative learning, we only model P(y|X), and P(X) does *not* depend on theta. However, P(Y,X) still depends on theta, because P(Y,X)=P(Y|X;theta)P(X)
@@kilianweinberger698 Hi, I see theta as a parameter in distribution, for example, mean and variance in gaussian. If you have time, would you explain what is theta in this case (spam email classifier)?
At 15:00, when you provided a prior (1H 1T) the MLE started very randomly (all all the place) and then slowly converged to 0.7. Which was not the case when there was no prior. Why would having a prior impact MLE the way it did?
Any plans to teach this course again? I might just wait to take it at cornell if you plan to in the next year or so... Until then, thank you so much! this video series is a gem :)
Yes, but that also holds for words. Typically raw pixels are a bad representation of images, if you don't extract features (e.g. as in a convolutional neural network). If you were to extract SIFT or HOG features from the image and use them as input, the NB assumption could be more valid.
How did we get this eqn P(y|x) = INTEGRAL_theta p(y|theta)*p(theta|D) d_theta, shouldnt it be P(y|x, D) =INTEGRAL_theta p(y|x, D, theta)*p(theta | D) d_theta ?
I don't know why but my comment as a reply under my original comment is not visible when watching this video, without signing in. So, I am reposting it in the main comment section as well Thank you so much for the reply, Kilian. I understand but still would like to clarify some things so as to bolster my understanding further. Can we look at this , in this way: 1. If we consider a joint P(x,y;theta), and the marginal of this joint to be p(x;theta), which is not dependent on theta, we can write it as p(x). So, basically arg-maxing over the product of conditional probabilities will do the job. Is this right? Moreover, this raises another question in my mind, if the above understanding is correct, then if the joint is dependent on theta, wont the marginal also be dependent on theta. Like the marginal which P(X) derived by integrating over P(X,Y;theta) for all values of Y, shouldn't that also be affected by theta. If we have P(X), we can simply find P(X=xi). 2. Could it be that since we are doing Discriminative Learning, we just ASSUME P(X) to be given and independent of theta, which allows us to model using conditional probabilities. 3. Or am I looking it in a very different and WRONG way. P(X) will be independent of theta since our xi's are sampled from the original probability distribution. Is this what you mean when you say that P(x), i.e. coming from the original distribution P(X,Y). But by this logic, won't P(yi) also be independent of of theta,and accordingly can we write P(x,y;theta) = p(x|y;theta)p(y)??? I honestly feel that i have trapped myself in a ditch with these doubts, for something as trivial but as important as MLE. Help will be highly appreciated!!!!!!
Looks like RU-vid flagged your comment as spam :-/ I think your confusion comes from what exactly the modeling assumptions are. Basically your (2) is right, in discriminative learning, we assume that P(X) is given by mother nature, but we model P(Y|X;theta) with some function (which depends on theta). So when we try to estimate theta, we maximize the likelihood of P(y_1,...,y_n | x_1,...., x_n ;theta). This can then be factored out, because the different (y_i,x_i) pairs are independently sampled. In generative learning, things are different. Here we model P(Y,X;theta').
P(Data) is not all my data, right. It is the data that I am interested, e.g. getting Heads (not Heads and Tails). Is this applied to real datasets, like a House Prediction dataset?
I have a doubt: i thought that if you had P(X=x | Y=y), and then you said that X and Y were independent variables, then P(X=x | Y=y) = P(X=x); (i found this thing in wikipedia too: en.wikipedia.org/wiki/Conditional_probability_distribution , search for "Relation to independence"). So my question is why did you write P(X = x | Y=y) = mul from alpha=1 to d of (P(X_alpha = x_alpha | Y = y)) ? Maybe it's because inside P(X=x | Y=y), X and x are vectors, so if you factored it out yo would write P(X_1 = x_1, X_2 = x_2 , ... , X_d = x_d | Y = y), but then why can't i write mul from alpha=1 to d of (P(X_alpha = x_alpha)) instead of mul from alpha=1 to d of (P(X_alpha = x_alpha | Y = y))
Oh, the assumption is not that X and Y are independent but that the different dimensions of X are independent *given* Y. So call the first dimension of X to be X1 and the second X2. Then we get P(X|Y)=P(X1,X2|Y)=P(X1|Y)P(X2|Y)