Machine Learning Lecture 8 "Estimating Probabilities from Data: Naive Bayes" -Cornell CS4780 SP17

Kilian Weinberger

Подписаться 22 тыс.

Просмотров 47 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Cornell class CS4780. (Online version: tinyurl.com/eC... )
Lecture Notes: www.cs.cornell....

Опубликовано:

13 сен 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 48

@prashantsolanki007 4 года назад

Best Series of lectures on ML with perfect combination of math and statistic reasoning for algorithms.

@siyuren6830 5 лет назад

This is a really good course of ML.

@venkatasaikumargadde1489 5 лет назад

The illustration on the effect of prior distribution is awesome

@rajeshs2840 4 года назад

Wow!!! Human Brain is Bayesian.... Amazing Explanation.. Hats off to Prof. Kilian ...

@goksuyamac2908 4 года назад

This lecture series is such a great service to the community! The pace is perfect and the practical examples are on point, not lost in the theory. Many many thanks.

@aydinahmadli7005 4 года назад

I was speechless and extremely satisfied when I heard you telling oven-cake story. It made full sense😄

@abhinavmishra9401 3 года назад

It's 12:56 in Germany and I cannot go to sleep because of this treasure I stumbled upon tonight! You, Sir, are beautiful.

@lambukushireddy424 8 дней назад

best lecture on naive bayes.

@user-me2bw6ir2i 2 года назад

Thanks for this amazing lecture! Had a course on statistics in my uni, but you gave much better understanding.

@SohailKhan-zb5td 2 года назад

Thanks a lot Prof. This is indeed one of the best lectures on the topic :)

@rohit2761 2 года назад

What a beautiful lecture series. Please upload more videos on other topics

@vishchugh 4 года назад

Hi killian. I had a doubt. I’ve often heard people talking about making your dataset balanced, which basically is, let’s take spam classification problem, the number of spam instances should be around number of non spam instances. But when you’re applying algorithms like naive bayes which takes the prior into consideration, shouldn’t we just leave it to being unbalanced which actually captures how the emails are in the inbox in real life. Also Thank you for the amazing lectures !

@kilianweinberger698 4 года назад

The issue with highly imbalanced data is that it is just tricky to optimize the classifier. E.g. if 99% of your data is class 1, and 1% class -1, then just predicting +1 for all samples gives you 99% accuracy, but you haven’t actually learned anything. My typical approach is to re-balance the data set by still keeping all samples, but assigning smaller weights to the more common classes, so that the weights add up to be the same for all classes.

@doyourealise 3 года назад

21:30 is start of naive bayes, have to start from tomorrow. GOod night :), btw great lecture killian sir,

@sdsunjay 4 года назад

This lecture series is great. However it is usually impossible to hear the students' questions. If Dr. Weinberger could repeat the question before answering it or annotate the video with the student's question that would really help.

@kilianweinberger698 4 года назад

Yes, will definitely do so the next time I record the class.

@mavichovizana5460 2 года назад

@kilianweinberger698 2 года назад

Oh yes! Thanks for pointing this out.

@xiaoweidu4667 3 года назад

very insightful lecture

@amarshahchowdry9727 4 года назад

Hi killian. I had a doubt. I don't know if this makes sense. But how do we write P(D;theta) as P(y|x;theta) when go ahead and derive Cost function for Linear Regression and Logistic Regression. Shouldn't P(D;theta) = P([x1,...,xN],[y1,...,yN]) or P({x1,y1},..{xN,yN}). Also, after we find the theta for P(X,Y) we are using the same theta for P(y|x;theta).Does this mean that P(X,Y) and P(Y|X) are parameterized by the same theta, or since they are related by Bayes theoram, we can write it this way??? This could be a very dumb doubt but I am confused!!!!! Also Thank you for the amazing lectures !

@kilianweinberger698 4 года назад

Yes, good question! Maybe take a look at these notes: www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote08.html Basically the idea is you want to model P(y,x), but you really only model the conditional distribution P(y|x;theta), and you assume P(x) is some distribution that is given (but you don't model it) and _importantly_ it is independent of theta. E.g. P(x) is the distribution over all emails that are sent and P(y|x) is the probability that a _given_ email x is spam or not spam (y). Then P(y,x;theta)=P(y|x;theta)P(x;theta)= [because x does not depend on theta] =P(y|x;theta)P(x) now if you take the log and optimize the log-likelihood with respect to theta, you will realize that P(x) is just a constant that has no influence on your choice of theta. So you can drop it. Hope this helps.

@amarshahchowdry9727 4 года назад

@@kilianweinberger698 Thank you so much for the reply. I understand but still would like to clarify some things so as to bolster my understanding further. Can we look at this , in this way: 1. If we consider a joint P(x,y;theta), and the marginal of this joint to be p(x;theta), which is not dependent on theta, we can write it as p(x). So, basically arg-maxing over the product of conditional probabilities will do the job. Is this right? Moreover, this raises another question in my mind, if the above understanding is correct, then if the joint is dependent on theta, wont the marginal also be dependent on theta. Like the marginal which P(X) derived by integrating over P(X,Y;theta) for all values of Y, shouldn't that also be affected by theta. If we have P(X), we can simply find P(X=xi). 2. Could it be that since we are doing Discriminative Learning, we just ASSUME P(X) to be given and independent of theta, which allows us to model using conditional probabilities. 3. Or am I looking it in a very different and WRONG way. P(X) will be independent of theta since our xi's are sampled from the original probability distribution. Is this what you mean when you say that P(x), i.e. coming from the original distribution P(X,Y). But by this logic, won't P(yi) also be independent of of theta,and accordingly can we write P(x,y;theta) = p(x|y;theta)p(y)??? I honestly feel that i have trapped myself in a ditch with these doubts, for something as trivial but as important as MLE. Help will be highly appreciated!!!!!!

@kilianweinberger698 4 года назад

In generative learning, we model P(Y,X), so P(X) would also depend on theta. In discriminative learning, we only model P(y|X), and P(X) does *not* depend on theta. However, P(Y,X) still depends on theta, because P(Y,X)=P(Y|X;theta)P(X)

@sekfook97 3 года назад

@@kilianweinberger698 Hi, I see theta as a parameter in distribution, for example, mean and variance in gaussian. If you have time, would you explain what is theta in this case (spam email classifier)?

@aibada6594 Год назад

@@sekfook97 theta is a probability of data being either spam or non-spam; this is similar to the theta in the coin toss example

@GraysNHearts 2 года назад

At 15:00, when you provided a prior (1H 1T) the MLE started very randomly (all all the place) and then slowly converged to 0.7. Which was not the case when there was no prior. Why would having a prior impact MLE the way it did?

@DavidKim2106 3 года назад

Any plans to teach this course again? I might just wait to take it at cornell if you plan to in the next year or so... Until then, thank you so much! this video series is a gem :)

@kilianweinberger698 3 года назад

Fall 2021. But don't be too disappointed when you realize I re-use the same jokes every year :-)

@DavesTechChannel 4 года назад

For an image, the NB hypothesis should not work since pixel values seem very related to each other

@kilianweinberger698 4 года назад

Yes, but that also holds for words. Typically raw pixels are a bad representation of images, if you don't extract features (e.g. as in a convolutional neural network). If you were to extract SIFT or HOG features from the image and use them as input, the NB assumption could be more valid.

@DommageCollateral Год назад

he looks like a real coder to me: no foot, no sleep, no break, all optimized

@compilations6358 3 года назад

@mohammaddindoost2634 Год назад

does anyone know what were the projects? or where we can find it?

@amarshahchowdry9727 4 года назад

I don't know why but my comment as a reply under my original comment is not visible when watching this video, without signing in. So, I am reposting it in the main comment section as well Thank you so much for the reply, Kilian. I understand but still would like to clarify some things so as to bolster my understanding further. Can we look at this , in this way: 1. If we consider a joint P(x,y;theta), and the marginal of this joint to be p(x;theta), which is not dependent on theta, we can write it as p(x). So, basically arg-maxing over the product of conditional probabilities will do the job. Is this right? Moreover, this raises another question in my mind, if the above understanding is correct, then if the joint is dependent on theta, wont the marginal also be dependent on theta. Like the marginal which P(X) derived by integrating over P(X,Y;theta) for all values of Y, shouldn't that also be affected by theta. If we have P(X), we can simply find P(X=xi). 2. Could it be that since we are doing Discriminative Learning, we just ASSUME P(X) to be given and independent of theta, which allows us to model using conditional probabilities. 3. Or am I looking it in a very different and WRONG way. P(X) will be independent of theta since our xi's are sampled from the original probability distribution. Is this what you mean when you say that P(x), i.e. coming from the original distribution P(X,Y). But by this logic, won't P(yi) also be independent of of theta,and accordingly can we write P(x,y;theta) = p(x|y;theta)p(y)??? I honestly feel that i have trapped myself in a ditch with these doubts, for something as trivial but as important as MLE. Help will be highly appreciated!!!!!!

@kilianweinberger698 4 года назад

Looks like RU-vid flagged your comment as spam :-/ I think your confusion comes from what exactly the modeling assumptions are. Basically your (2) is right, in discriminative learning, we assume that P(X) is given by mother nature, but we model P(Y|X;theta) with some function (which depends on theta). So when we try to estimate theta, we maximize the likelihood of P(y_1,...,y_n | x_1,...., x_n ;theta). This can then be factored out, because the different (y_i,x_i) pairs are independently sampled. In generative learning, things are different. Here we model P(Y,X;theta').

@JoaoVitorBRgomes 3 года назад

P(Data) is not all my data, right. It is the data that I am interested, e.g. getting Heads (not Heads and Tails). Is this applied to real datasets, like a House Prediction dataset?

@kilianweinberger698 3 года назад

Yes, it is a little sloppy notation. In discriminative settings with i.i.d. data, P(Data)=\prod_i P(y_i | x_i)

@massimoc7494 4 года назад

I have a doubt: i thought that if you had P(X=x | Y=y), and then you said that X and Y were independent variables, then P(X=x | Y=y) = P(X=x); (i found this thing in wikipedia too: en.wikipedia.org/wiki/Conditional_probability_distribution , search for "Relation to independence"). So my question is why did you write P(X = x | Y=y) = mul from alpha=1 to d of (P(X_alpha = x_alpha | Y = y)) ? Maybe it's because inside P(X=x | Y=y), X and x are vectors, so if you factored it out yo would write P(X_1 = x_1, X_2 = x_2 , ... , X_d = x_d | Y = y), but then why can't i write mul from alpha=1 to d of (P(X_alpha = x_alpha)) instead of mul from alpha=1 to d of (P(X_alpha = x_alpha | Y = y))

@kilianweinberger698 4 года назад

Oh, the assumption is not that X and Y are independent but that the different dimensions of X are independent *given* Y. So call the first dimension of X to be X1 and the second X2. Then we get P(X|Y)=P(X1,X2|Y)=P(X1|Y)P(X2|Y)

@massimoc7494 4 года назад

@@kilianweinberger698 Ty

@mehmetaliozer2403 3 года назад

Thanks for this amazing series, could you share the script at 11:00 . Regards.

@kilianweinberger698 3 года назад

I can give you python code:

@kilianweinberger698 3 года назад

import matplotlib.pyplot as plt import numpy as np from pylab import * from matplotlib.animation import FuncAnimation N=10000; P=0.6; # true probability M=0; # number of smoothed examples Q=0.5; # prior XX=0 def onclick(event): global P,N,Q,XX,M cla() if event.x0: title('MAP %1.2f' % Q) color='r' else: title('MLE') color='b' counts=cumsum(rand(N)

@mehmetaliozer2403 3 года назад

@@kilianweinberger698 awesome, thanks a lot :)

@JoaoVitorBRgomes 3 года назад

But How do you know if my prior makes sense?!

@kilianweinberger698 3 года назад

You don't :-) But that's always the case with assumptions. You also assume a distribution ...