NOTE: The StatQuest LDA Study Guide is available! statquest.gumroad.com Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/
Hi Josh, Love your content. Has helped me to learn a lot & grow. You are doing an awesome work. Please continue to do so. Wanted to support you but unfortunately your Paypal link seems to be dysfunctional. Please update it.
the funny thing is, so many materials from this channel are for those university students (like me) but he keeps treating us like kindergarten children. Haha feels like i'll never be growing up, by watching your videos sir! QUADRO BAAM SIR, THIS WORLD HAS BEEN GONE TOO SERIOUS, THANK YOU FOR BRINGING BACK THE JOY
Just spent hours so confused, watching my lectures where the professor used only lin alg and not a single picture. Watched this video and understood it right away. Thank you so much for what you do!
Awesome! Even I get it and love it! I'm going to share one of your stat-quest posts as an example of why simple explanations in everyday language is far superior to using academic jargon in complex ways to argue a point. Also, it's a great example of how to develop an argument. You've created something here that's useful beyond statistics! Three cheers for the liberal arts education!!!! Three cheers for Stat-Quest!!
This was honestly helpful, i am an aspiring behavioral geneticist (Aspiring because I am still an undergraduate of biotechnology) with really disrupted fundamentals of math especially statics. Your existence as a youtube channel is a treasure discovery to me !
You, sir, you are a life saver. Now in every complicated machine learning topics I look for your explanation, or at least wonder how you would have approached this. Thank you, really.
10/10 intro song 10/10 explanation using PCA, I can reduce these two ratings to just one: 10/10 is enough to rate the whole video using LDA, the RU-vid chapters feature maximizes the separation between these 2 major components (intro and explanation) of the video
Hello Josh. As always, thank you for your super intuitive videos. I won't survive college without you. I do have an unanswered conundrum about this video, however. For Linear Discriminant Analysis, shouldn't there be at least as many predictors as the number of clusters? Here's why. Say p=1 and I have 2 clusters. In this case, there is nothing I can do to further optimize the class separations. The points as they are on the line already maximizes the Fisher Criterion(between-class scatter/in-class scatter). While I do not have the second predictor axis to begin with, even if I were to apply a linear transformation on the line to find a new line to re-project the data on, it will only make the means closer together. Extending this reasoning to the 2D case where you used gene x and gene y as predictors and 3 classes, if the 3 classes exist on a 2D plane, there is nothing we can do to further optimize the separation of the means of the 3 classes because re-projecting the points on a new tilted 2D plane will most likely reduce the distances between the means. Now, if each scatter lied perfectly vertically such that as Gene Y goes up the classes are separated distinctly, then we could re-project the points on a new line(that would be parallel to the invisible vertical class separation line) to further minimize each class's scatter, but this kind of case is very rare. Given my reasoning, my intuition is that an implicit assumption for LDA is that there needs to be at least as many predictors as the number of classes to separate. Is my intuition valid?
Hi Josh, Helpful to understand the differences between PCA and LDA and how LDA actually works internally. You're indeed making life easier with visual demonstrations for students like me :) God bless and Thank you!
Hi, Joshua. Thank you for your videos, it’s really helpful. I have a question: so when you have a LDA to categorize n categories, does it mean that you need (n-1) axis to separate the points? In that case, how can I visualize them?
Hi, Josh is it possible to make your account accept gifts? I feel like I owe you a lot, I failed a couple of interviews in the past in ML theory but binge-watched videos on your channel within 2 months and have landed multiple offers.
"But what if we used data from 10k genes?" "Suddenly, being able to create 2 axes that maximize the separation of three categories is 'super cool'." Well played, StatQuest, well played!
The song at the beginning made my day, even though I took wrong tutorial of Linear discriminant analysis in data science. Just awesome. Love it a lot. We need more and more funny teachers like you.
What you say is true - however, to quote the Wikipedia article on LDA: "The terms Fisher's linear discriminant and LDA are often used interchangeably" en.wikipedia.org/wiki/Linear_discriminant_analysis
Another excellent video just as great as the one on PCA. I read a Professor's view on most of the models and algorithms stuff in ML where he recommended understanding the concepts well so that we know where to apply and not worry too much about the actual computation at that stage. The thing that is great in your videos is that you explain the concept very well.
@@Sachin-vr4ms I'm sorry that it is confusing, but let me try to explain: At 9:46, imagine rotating the black line a bunch of times, a few degrees at a time, and using the equation shown at 8:55 to calculate a value at each step. The rotation that gives us the largest value (i.e. there is a relatively large distance between the means and a relatively small amount of scatter in both clusters) is the rotation that we select. If we have 3 categories, then we rotate an "x/y-axis" a bunch of times, a few degrees each time, and calculate the distances from the means to the central point and the scatter for each category and then calculate the ratio of the squared means and the scatter. Again, the rotation with the largest value is the one that we will use. Does that help?
What the heck is a gene transcript. i really hate it when these things are mentioned casually and the listener is assumed to know them already. NO i dont know what is a gene transcript. now i have to pause the video and google about gene transcripts. ugghhh
Why does LDA here seem totally different from how LDA is presented in the ISLR textbook? In ISLR, we simply assume P(X | Y = k) is Gaussian for all k classes. Then we literally just plug in estimates into bayes theorem. So now we have an estimate for P(Y = k | X), which is the desired probability for classifying a feature vector X into a class k. Then, we take the log of bayes theorem, with our estimates, and we get a linear discriminate function that is used for classifying. The way it's presented in ISLR is similar to how it's presented in this lecture: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-_m7TMkzZzus.html I just don't see why the same topic of LDA seems vastly different in this video? Edit: Apparently there are different "discriminate rules". The one I'm referring to is called Bayes Discriminant Rule. The type presented here is called Fisher’s linear discriminant rule.
You seem to have figured out. This video describes LDA as Fisher originally presented it way back in the day - as an attempt to maximize variance between groups and minimize variance within groups - not the Bayesian approach.
Yep, thanks. This also clears up why Wikipedia says that LDA is similar to ANOVA. At first I didn't see how the two were similar, but it makes sense now.
The video shows how LDA reduces dimensions and we can clearly see a newly constructed axis (like with PCA) which - in LDA analysis - maximizes the separation. That was very clear!. How does this line relates to a line that actually separates the two categories on an original XY plain you refer to on 2:48 minute of your video?. After all it is this line (do we call it a discriminant function?) which is usually used to show the separation?. The latter is intuitively understood as a separation border, the former explains how we reduced dimension. What is the link between the two?
That's a good question. There are a few options for coming up with a threshold that allows you to classify new observations into one of the categories in your training dataset. The simplest is to transform the new observation using the transformation that the training dataset created, and then measure the euclidean distance between the new observations and the center of each classification. The classification that is closest to the new observation is used to classify the new observation.
Regarding LDA for 3 categories, how do you maximize the distance between the central point and each category's central point? These points are always the same, aren't they? So how do you maximize something that does not change?
Thanks for the video! It really helps! May I check is that PCA and LDA are similar in the sense that they both reduce dimension, but PCA is unsupervised learning while LDA is supervised learning?
hey, thanks for an amazing explanation! I have a question: In the video you mentioned the mean difference is squared in order to prevent negativity, you can just as well use the absolute value, is there a reason the prefer squaring?
I know in any dimension this is an NP complete problem because it has the same cardinality of solution space as the subset sum problem: which is 2^N where N= number of data points, therefore 2^N is the number of all subsets one must check: i.e. project all possible subsets down to the new axis. Of course, once on a single ordered 1D axis, since the data points will now have a fixed order relative to each other, one has at most N places to make the separation.
Great video! I initially couldn't understand LDA looking at the math equations elsewhere, but when I came across this video, I was able to understand LDA very well. Thanks for the effort.
The number of dimensions is determined by the number of categories, not the amount of data. If you have 2 categories, you get 1 LD. If you have 3 categories, you get 2 LDs. If you have more categories, you get num. categories - 1 LDs.
Really great videos, saved me from my data science classes. I'm applying for graduate program at UNC, hope I can have the opportunity to meet the content creators sometime in the future.
Amazing. Thank you for this excellent video. Explained everything super clearly to me in a super concise manner without all the academic jargon getting in the way.
Can you provide an example where high variance in the data from PCA is more important than high ‘separability’ of the data from LDA for a classification problem?
I often use PCA to answer the question "are the data what I think they are, or were they mislabeled". PCA is useful when you want to use an "objective" method to see how your data clusters. For example, when I'm worried that I mislabeled my data, PCA would find clusters without using the labels. However, LDA requires knowing how I want to separate things - so I can't use it to determine if I labeled things correctly. Does that make sense?
@@statquest Would you consider Cluster analysis as another method to check on your labeling? I do that when I want to validate the manual labeling of my project data. I would love to hear the views from a real life practitioner like yourself.
Thank you for the explanation, it's pretty clear. But there is something I dont understand. When you have 3 categories, LDA creates 2 axes to seperate the data. But what if you have 4 categories or more? How many axes will LDA create to seperate the data?
LDA always creates one axis fewer than the total number of categories. So if you have 30 categories, you'll get 29 axes. That said... the axes are given in order of importance. So the first axis is the most important one. The second axis is the second most important one. etc. etc. etc. So it is often the case that even though LDA gives you 29 axes (if you have 30 categories), you only need the first 2 or 3.
@@statquest So the first axis is the most important one because this axes seperates the categories the best? And the second axis is the second most important one because this axis seperates the categories the second best?
@@alialsaady5 Yes. Technically, the first axis accounts for the most variation among the categories. The second axis accounts for the second most variation among the categories. etc. etc. etc.
@@alialsaady5 You can cite the videos. I can't remember exactly where I got all of the information for this video - probably wikipedia and The Elements of Statistical Learning - but I also read the original manuscripts and I scour the internet for examples and derivations.
Great video! Just wanted to point out that LDA is a classifier, which involves a few more steps than the procedure described here, such as assumption that the data is gaussian. The procedure here described is only the feature extraction/dimensionality reduction phase of the LDA. G
You are correct! I made this video before I was aware that people had adapted LDA for classification. Technically we are describing "Fisher's Linear Discriminant". That said, using LDA for classification is robust to violations to the gaussian assumptions. For more details, see: sebastianraschka.com/Articles/2014_python_lda.html
StatQuest with Josh Starmer That said, I must admit I am having a really hard time understanding how the fisherian and baysian approach lead to the same conclusion even with completely different routes. If you have any source on that it would be of enormous help for my sanity haha
On the one hand ,we try to find the maximum distance between the two groups mean ,On the other hand ,we need to make sure the variance of the single group to be the minimum.This method can be as much as possible to separate the two sets of data.I understand right?
Very nice explanation. The only issue I have is that the first and second axes for both PCA and LDA are not Gene 1 and Gene 2. They are instead some linear combination of Gene 1 and Gene 2. So in a 10,000 gene space, you will get some combination of some of the 10,000 genes that clearly separate the two groups. For example, LD1 could be one third of Gene 12 plus one third of gene 45 plus one sixth of gene 456 plus one sixth of gene 1,234.
For a two-class problem, we can go with a single axis. a three-class problem has to go with at least 2 axes (three mean points) and similarly four-class has to have at-least three axes. Is my understanding right ?
@@statquest This video was amazing! I have a similar question, regarding the space in which the newly created axes lie: if we have 2 genes and, say, 4 Categorical Response level (e.g. drug works, works a little, harms a little, harms only). Then LDA would give us 3 axes because the central points of each group together define a tetrahedron (thus 3 dimensions). Does this not mean that we traded a 2D graph for a 3D graph, because our new axes are now 3 and all perpendicular to each other? And in extension, what if we have >4 levels and the new axes become impossible to display, how does that benefit us in separating the data?
@@vinceb8041 Just because you can't draw a picture of the result, doesn't mean that it isn't useful. You can do LDA with 5 categories, and get a 4 dimensional result (which you can't draw on paper), and still use it to classify new samples. Just apply the same transformation to the new samples and classify them using k-Nearest Neighbors. Even though you can't see it, you can still calculate the euclidian distances and determine which previously classified samples are closest to the new samples. Does that make sense?
@@statquest Thanks for the reply, I think I understand what you mean. So the LDA doesn't necessarily produce an output you can visualize but the output still has the useful properties we need for further analysis?
Hi, thank you for your explanation. It seems like the LDA as you have explained is different from the LDA explanation I found in the following video: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-IMfLXEOksGc.html. I understand the math behind the video, but I am wondering how and why your explanation of LDA is equivalent to the video's explanation of LDA? From my understanding, it seems like that the math suggests LDA looks at pre-labeled data, calculates appropriate means and covariance(s) for each class label, and draws decision boundaries based on which class label for a given point would give the highest likelihood function value. How does that, as you said, maximize (\mu_1 - \mu_2)^/(s_1^2 + s_2^2) ? Also, what would the function we are trying to maximize look like when we have more than two labels? Could you also refer me to a paper that I may find helpful? I am comfortable with reading rigorous math. Thanks so much
This video covers the dimension reduction method that is also called "Fisher's linear discriminant". The LDA in the other video is a generalization of this method. For more details seeen.wikipedia.org/wiki/Linear_discriminant_analysis