Using Linear Models for t-tests and ANOVA, Clearly Explained!!!

Подписаться 1,2 млн

Просмотров 404 тыс.

50% 1

This StatQuest shows how the methods used to determine if a linear regression is statistically significant (covered in part 1) can be applied to t-tests and ANOVA. It also introduces the concept of a "design matrix". Part 1 of this series on GLMs (general linear models) is here: • Linear Regression, Cle...
For a complete index of all the StatQuest videos, check out:
statquest.org/video-index/
If you'd like to support StatQuest, please consider...
Patreon: / statquest
...or...
RU-vid Membership: / @statquest
...buy my book, a study guide, a t-shirt or hoodie, or a song from the StatQuest store...
statquest.org/statquest-store/
...or just donating to StatQuest!
www.paypal.me/statquest
Lastly, if you want to keep up with me as I research and create new StatQuests, follow me on twitter:
/ joshuastarmer
Correction:
7:40 There should be parentheses around the SS differences in the F-statistics to have correct equations; (SS(mean)-SS(fit))/(p_fit-p_mean)
#statquest #regression

Опубликовано:

6 авг 2017

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 397

@statquest 4 года назад

Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/

@falaksingla6242 2 года назад

Hi Josh, Love your content. Has helped me to learn a lot & grow. You are doing an awesome work. Please continue to do so. Wanted to support you but unfortunately your Paypal link seems to be dysfunctional. Please update it.

@elnurazhalieva1262 4 года назад

Rarely do I recommend a youtube channel for someone, but this channel is must-watch!

@statquest 4 года назад

Thank you! :)

@redaaitouahmed8250 5 лет назад

You're making the life of a student so much easier and happier ... Thankkkkk youuuuuu !!!

@statquest 5 лет назад

You're welcome!!! :)

@luig2121 Год назад

I literally watch your videos as if I'm watching TV. I don't know how you've pulled this off but you are incredible

@statquest Год назад

Wow, thank you!

@hadihadiyar1185 4 года назад

Hi, I got my master in Epidemiology, trying to review statistics and found your channel, you are awesome, you really make statistics easy to understand, TRIPLE BAM for you

@statquest 4 года назад

Thank you very much! :)

@rwei2049 6 лет назад

this is the clearest explanation of design matrices I've ever seen!! Thank you soooo much Joshua!

@Russet_Mantle 3 года назад

This is a really smooth transition from linear models to ANOVA, which is sadly not covered in many stats textbooks.

@statquest 3 года назад

Thanks!

@justind6931 5 лет назад

It actually takes me a while to realize the F-statistic shown in this video is the same as standard T-statistics. Great vid!

@statquest 5 лет назад

Thanks!!! I know, it's a little weird to look at a t-test from this perspective, but it shows how the F-statistic is a generalization of T-statistics. (Here's a cool hint - just like the F-statistic is a generalization of T-statistics, Chi-square statistics are a generalization of normal statistics....)

@PunmasterSTP 3 месяца назад

I remember learning about t-tests well before linear regression, but it's cool seeing things applied in a different way, especially while going into the deeper concepts. This whole playlist is a stats and machine-learning goldmine!

@statquest 3 месяца назад

BAM! Yes, usually t-tests are taught before linear regression, but I like teaching them in this order (regression first) since the extension of a t-test into ANOVA is way more obvious.

@PunmasterSTP 3 месяца назад

@@statquestThat sounds like a good plan.

@clarasavary6265 5 лет назад

Thank you very much for all your clear explanations. It's a real pleasure to listen to you and learn more about Statistics !

@statquest 5 лет назад

You're welcome! I'm glad to hear you think the videos are helpful. :)

@amorrismusic 3 года назад

Never in my life has learning math been easier. Excellent work Josh!

@statquest 3 года назад

Thank you very much!!! :)

@charlotteiosson6235 6 лет назад

These videos are brilliant! I'm completing my PhD and there really isn't enough statistics support available which is as accessible as these videos (and considering we're meant to be doing research, that's not really good enough!) - thanks!

@lilmoesk899 6 лет назад

Thanks for the video. I'll have to watch this one a couple more times to fully digest it. It's the first time I've heard of a design matrix, so I'll have to spend some time looking into that.

@howardip7965 4 года назад

Your videos are very well-prepared and informative. Great teaching materials. You are so generous. Thanks a million.

@statquest 4 года назад

Thank you very much! :)

@zahrahadavand2290 2 года назад

Awesome, there's nothing that can't be understood when you explain it, thanks a millionnnn

@statquest 2 года назад

Thank you very much! :)

@DamianEQuijanoA 5 лет назад

Hablo muy poco inglés, pero tu metodología de enseñanza( es muy profesional) es magnífica. A pesar que es inglés, yo logro entender mejor que todas las clases de estadísticas en español. Haces un enorme esfuerzo para que tus clases sean intuitivas y fáciles de comprender para personas no expertas en estadísticas. Te felicito.

@statquest 5 лет назад

Muchas gracias!!!!

@xxMissCaprIce Год назад

I think you might have just saved my life. This is so clearly explained, thank you!

@statquest Год назад

Glad it helped!

@justalittleguy733 Год назад

i am seriously failing my beginner stats course because try as i might the lectures are quite literally incomprehensible. i owe you my life!!! thank you for these amazing videos -- i feel like this is the first time ALL semester I am understanding something!

@statquest Год назад

HOORAY! I'm glad the videos are helpful.

@alexanderkononov4068 5 лет назад

Maaan! I found you, I found glm, finally! Thanks!

@statquest 5 лет назад

Hooray! :)

@baharehbehrooziasl9517 2 месяца назад

The interesting thing about this video is that it taught me something that I haven't noticed I didn't know!

@statquest 2 месяца назад

bam! :)

@user-bz7fj1fk2m 3 года назад

You are blessed and STAY BLESSED. You significantly changed my life with STAT!!!

@statquest 3 года назад

Thank you very much! :)

@SreenikethanI 26 дней назад

StatBlessed :D

@Sn-nw6zb 5 лет назад

Wow, this is smart way to explain ANOVA test, it looks so complicated at first, now it looks straight forward after resembling with linear regression. Great video!!!

@statquest 5 лет назад

Hooray!!! I'm so glad you like this video - it's one of my all time favorites. :)

@statquest 5 лет назад

Hooray! :)

@_Chafia 5 лет назад

I hope you will have the time to answer just in few words please! R sqr tell us how x is useful to predict y, so in the case of a t test or anova how to use it? we just talk about F & p, can we say it explains some % of the variance between treatments or it's useless!? Thank you so much Mr. Starmer

@statquest 5 лет назад

This is a great question. The traditional way to teach and perform t-tests (and ANOVA) only results in 't' or 'F' statistics and a p-value - no R-squared. However, as you see in this video, it's easy to also report R-squared - you just have to want to do it. The case of t-tests and ANOVA are just like regression and R-squared tells you the same thing - it gives you an estimate on the magnitude of the difference. The p-value just tells you that it is significant. If you did a t-test and got a small p-value, but also a small R-squared, then you could easily deduce that there's not a huge difference between the two groups (even if is statistically different). In contrast, if you did a t-test and got a small p-value and a large R-squared, then you would know that there's a big difference between the two groups. So we can see that R-squared is useful for even the t-test. I suspect that one reason presenting R-squared with t-test results is rare, is that often with t-tests, it is easy and very common to plot the data - so people will show you their data and give you the p-value. Seeing the data is sort of like a "visual R-squared" - you can see if the data are very close to each other or far apart.

@_Chafia 5 лет назад

THANK YOU SO MUCH.... YOU ARE VERY KIND SIR. I summarize if you allow : "significant p-value + R-squared" = how much is the différence Really GREAT! Thanks again & Good luck!

@emmafoley8987 5 лет назад

I've really had trouble understanding what a t test *is* and this was super helpful.

@statquest 5 лет назад

Hooray!!!! :)

@Hajar1992ful Месяц назад

Thank you for your amazing videos Josh. You make us smarter!

@statquest Месяц назад

Glad you like them!

@Kaaaaaaaam 6 лет назад

These videos are great! Thanks!

@autumnp4077 4 года назад

Really appreciate the refresher of the regression on the side of the t-test! REINFORCEMENT FOR THE WIN!

@statquest 4 года назад

Yes! :)

@hsinyenwu 6 лет назад

Thanks so much for this video!!! Never heard anyone explain those concepts so well. Do you have any plan to make videos about multiple comparisons adjustment?

@mariaaureliano8411 3 года назад

Thank you! Really great and helpful videos!

@statquest 3 года назад

Glad you like them!

@junmingzheng7456 5 лет назад

OMG, now that's how ANOVA and linear regression is connected.

@mohammadalidastgheib2688 Год назад

Thank you for your clear explanations.

@statquest Год назад

Bam! :)

@wisamtariq4412 5 лет назад

Many thanks, great channel! I have a question please.. does t test approach here is what's called "one way ANOVA".. and f test for "factorial ANOVA" since there are more levels for the categorical variable?

@ducvu2109 6 лет назад

Hey Johua, why we should sing lolz. Love your lesson man!

@369standrealfine 5 лет назад

Thank you so much for your videos.

@statquest 5 лет назад

Thanks!

@alvarorodriguez3552 3 года назад

Best statistics teacher on internet!!!!

@statquest 3 года назад

Thank you very much!!!! :)

@hongdalin5953 6 лет назад

hi Joshua, thanks for sharing. These videos are step-by-step processing and makes so much sense to me than the hedious textbooks. I was wondering if you can make a videos on repeated measures ANOVA biting into small pieces, thanks in advance.

@urdeathisnear885 4 года назад

Hi Josh, great work on these videos, very helpful! One question: is it safe to say that ANOVA is just a generalized t-test for >2 groups?

@statquest 4 года назад

Sure, I think that is a safe thing to say.

@Dekike2 4 года назад

First of all, Thank you so much, Josh, for the time you spend sharing your knowledge about statistics. Students need more people like you... I wanted to ask something likely silly, can you make an ANOVA with an unbalanced sample? What can I do if some categories have more data than others? Thanks again, Josh!! I am looking forward to hearing from you!!!

@statquest 4 года назад

ANOVA works fine with unbalanced samples. You just have more rows in your design matrix for one category than another.

@seanpitcher8957 10 месяцев назад

Bought the book. Nicely done and useful!

@statquest 10 месяцев назад

Awesome, thank you!

@markobe08 4 года назад

I will just go on a liking spree on all of your videos

@statquest 4 года назад

Hooray! :)

@aickoyvesschumann3400 4 года назад

Great video! I think you should put parentheses around your SS differences in the F-statistics to have correct equations; (SS(mean)-SS(fit))/(p_fit-p_mean). Divisions have generally a higher priority than differences, but you want to first subtract and then divide.

@statquest 4 года назад

Great suggestion! I've added your correction to a pinned comment that will be easy for other people to find.

@ronykroy 4 года назад

I keeep coming here to hear the Baaaaam !! :)

@statquest 4 года назад

Hooray! :)

@ashokmulchandani2841 5 лет назад

I love your voice both while singing and explaining statistical concepts. Thank a ton for these videos. Do you mind if I can request you the videos on the following topics 1) 2 or more factor ANOVA (to be used as reducing the number of the independent variable) 2) Linear Multiple regression (to be used as reducing the number of the independent variable) 3) DOE and Taguchi

@statquest 5 лет назад

Glad you like the videos! I've added Taguchi, DOE and 2 or more factor ANOVA to my to-do list. I believe that my video on Multiple Regression in R may already satisfy your second request: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-hokALdIst8k.html

@ashokmulchandani2841 5 лет назад

StatQuest with Josh Starmer Thanks 😀

@markaitkin 6 лет назад

Love your videos. I have 3 requests... 1. Degrees of freedom 2. Linear regression with regularisation 3. Log linear regression and why coefficient indicates % change Thanks so much!

@statquest 6 лет назад

Thanks so much! The degrees of freedom StatQuest is high, high on the to-do list. It is never far from my mind. I have it about 1/2 done in my head, but the second half is tricky - some situations are easier to illustrate then others - but it's just a matter of setting aside time just for it and nothing else and it will get done. The good news is that I'm maybe 1 or 2 months away from doing StatQuests on ridge, lasso and elastic-net regression - all examples of linear regression (or, more generally, generalized linear regression since these ideas can be applied to logistic regression) with regularization. So that's sure to happen soon (just as soon as I can!) The last one, log-linear regression, is the logical follow up to logistic regression. I may do a "big picture/main ideas" StatQuest on that as soon as I can. It's on the list!

@markaitkin 6 лет назад

StatQuest with Josh Starmer thanks for your reply. Can't wait for the next videos

@woodypham6474 4 года назад

What else i can say about this clip? You're the best

@statquest 4 года назад

Hooray!!! :)

@ai1888 6 лет назад

Will the F-statistic calculated from this method be equal to the t-statistic? I understand that you are trying to standardize the way to calculate the t-test by using methods from linear regression, but does it produce the same values that a regular t-test does?

@benedettodiciaccio3024 5 лет назад

According to this website [ onlinecourses.science.psu.edu/stat501/node/297/ ], the t-statistic and F-statistic produce equivalent p-values when the F-statistic's degrees of freedom in the numerator is 1. The relationship is t^2(n-p) = F(1,n-p), which apparently means the p-values for each will be identical. Don't know why that is but videos on the relationship between those two distributions may help. Anyway, I assume the relationship applies here in which the df = 1 for the F-statistic numerator when comparing two groups. As a side note, most slopes for p-values in multiple linear regression are calculated with t-tests. However, F-tests comparing the variance between models with and without the slope produce an identical p-value due to the above mentioned relationship. Thinking of slope significance in terms of how much more variance the model explains with vs without the slope seems much more intuitive to me, and I'm glad I found these videos.

@redcat7467 2 года назад

I just a video on Confidence Intervals back from 2015 and the song was pretty much the same, yet what a difference!

@statquest 2 года назад

@usfbge Месяц назад

Hi Josh, Your vidoes are amazing, easy to follow and understand. Just wondering if you could upload video on GLMM, LMM models and when to use which model? This will help to clarify.

@statquest Месяц назад

I hope to do that one day, however, it will probably be a while since I'm writing a book on neural networks right now.

@shichengguo8064 3 года назад

Hi Josh, It's time to bring linear mixed models. Thankkkk Youuuuu!!!

@statquest 3 года назад

I'll keep that topic in mind.

@TheAugustinePark 4 года назад

At 3:15 of the video, on the t-test graph we fit a horizontal line to get the least-squares fit. Intuitively, wouldn't having a line with the same placement but any slope (meaning also a different y-intercept) result in the same value for the least-squares fit since all the data points have the same x-value? Thank you

@statquest 4 года назад

Any point at the mean of the data will have the same fit. I use a line to make it easier to see.

@kartikeyachaudhary4983 4 года назад

Bro, thank you so much man......

@statquest 4 года назад

Thanks! :)

@drzun 5 лет назад

Thanks for the awesome video. I have a question about the p-value generated from the DE analysis by DESeq2. According to the description in DESeq2, the p-value seems calculated from "negative binomial GLM fitting for βi and Wald statistics". I wondered is this the same concept in the video? Is negative binomial regression also a kind of general linear model, and the variance of the negative binomial (μ+α μ^2) same with to the SS(Mean) and SS (fit)? Also, is Wald test the same with the t-test in the video, except that n is large in Wald test? Sorry for asking so many questions, I'm so confused.

@statquest 5 лет назад

GLM stands for two things "General Linear Models" and "Generalized Linear Models". Unfortunately, those two things are different - but when most people say "GLM", they most frequently mean "Generalized Linear Models". Generalized Linear Models are, in essence, a way to adapt the concept of a "design matrix" to a variety of problems and models. For example, in this video, we used design matrices to do t-tests and ANOVA. However, these same design matrices can be used with Logistic Regression (see those videos if you're interested) and they can also be used for DE analysis with DESeq2. However, the underlying math is different in all three cases. So the good news is that if you understand design matrices, you can do amazing things in a wide variety of contexts. The bad news is that SS(mean) and SS(fit) in these videos may or may not correspond to something in another system, like with DESeq2 or Logistic Regression. Logistic Regression, for example, doesn't use least squares at all, but instead relies on maximum likelihood to optimize the fit. Does this make sense?

@drzun 5 лет назад

@@statquest Thanks for the reply! I think I got your point. So the basic idea is to use the generalized linear model (GLM), which is more like a concept, to fit the data, and in the video the linear regression, which is more like a method, is used for the fitting. In programs like DESeq2, they use the negative binomial regression method to fit the RNA-Seq read counts, but the overall idea is still using GLM to describe how experimental factors (e.g. genotype and treatment) determine the expression of a gene (by a design matrix), and the p-value is kind of telling me how well the GLM fits ( or how convincing the result is).

@statquest 5 лет назад

@@drzun You've got it!

@drzun 5 лет назад

@@statquest Hooray! Before watching your videos, I had a really hard time understanding the statistics behind the data analysis of RNA-seq, and I can't express how grateful I am to you & the videos.

@statquest 5 лет назад

@@drzun Hooray!!! That's great. I'm glad my videos were so helpful! :)

@leontxyee 4 года назад

I wish the channel existed when I was taking the statistics classes in college and I might be in a different profession now. Could you please do more quests to dive into the GLMs, concepts like the EDM, link functions, when to use what, etc.?

@statquest 4 года назад

I have a series of videos on Logistic Regression if you are interested in that: ru-vid.com/group/PLblh5JKOoLUKxzEP5HA2d-Li7IJkHfXSe

@zzzluke8906 9 месяцев назад

Your videos are extremely helpful! Can you go through things like kruskal-wallis test and why it is not sensitive to normal distribution? If you can share some insights on chi-squared test etc, it would be really helpful too!

@statquest 9 месяцев назад

I'll keep those topics in mind.

@Dominus_Ryder 4 года назад

StatQuest, is there a version of a T-Test, or an ANOVA Test, that allows me to compare the Standard Deviation, Skewness, of Kurtosis of two or more sample means to see if there is any statistical difference between the two? If not, is there any particular reason why? To me, it seems as if knowing if these statistical quantities were different from each other would also provide useful information or features for machine learning algorithms.

@statquest 4 года назад

This is a great question. Unfortunately there are not many good or well known tests to compare standard deviations and other features (other than means). I'm not sure, but it could be that this is due to the lack of a central limit theorem like concept for standard deviations etc. (That's just a guess, so don't quote me on that).

@casualcasual1234 2 года назад

Thanks a lot and at 8:40, after obtaining F value, to obtain p value, is it the same as in the linear regression video? another sample of data (n=9) --> obtain SS(mean) & SS(fit) --> obtain F --> plug into F value histogram and repeat... --> obtain distribution and obtain F value of original data --> p value? Thanks again in advance :)

@statquest 2 года назад

The histogram that I used in the linear regression was intended to illustrate what an F-distribution represents, and it is the same here as well.

@shamshersingh9680 4 месяца назад

Hi Josh, at time stamp 6.48 when you write the equation y = mean of control + mean of mutant, where have the residuals gone. How will we get the value of y using this equation without residuals. As y = mx + c in linear regression helps get y values from given x and same concept is being applied here. So why are dropping the residuals.

@statquest 4 месяца назад

We drop the residuals because it doesn't make any sense to include them in the predictions we make with this equation. The residuals only make sense when we are evaluating how well the model fits the data. But with predictions based on new data, we don't know the actual values, so we don't know the residuals.

@oliseh2285 4 года назад

Amazing video Josh!!! Could you also do a video of two-way ANOVA with block design and calculating the significance of the factors, their interaction, block and the residuals? It would be great!

@statquest 4 года назад

I'll keep it in mind.

@oliseh2285 4 года назад

@@statquest that will be awesome. Triple BAM!!!

@haydrick 4 года назад

Hi Josh - I am struggling to understand what the p-value means in this scenario. What would be the hypothesis statement that the p-value enables us to accept / reject?

@statquest 4 года назад

The null hypothesis is that there is no difference. Thus, the p-value tells us if having parameters (other than just the intercept) are useful for distinguishing between groups. If there is no difference, then we should fail to determine that the estimated parameters values are significantly different from 0.

@Tyokok 5 лет назад

Hi Josh, quick Q. Isn't the test you explained here F-test? Isn't t-test use t-score=(slope beta-0)/standarderror , and then get p-value from t-table? or are they the same thing? little confused here. Thank you!

@statquest 5 лет назад

This is a great question. The t-test is just a specific type of F-test. If you have statistics software, you can compare the results and see that the p-values are the same (however, the F-statistic itself will be the square of the t-statistic. Why the square? Because, as you saw in the first video in this series, the F-statistic can never be negative, but the t-statistic can.) There are multiple ways to calculate a t-test, this using an F-test is my favorite because it is much more flexible. Does that make sense?

@Tyokok 5 лет назад

@@statquest I knew you would took it to the further level. So basically the two tests are both about model parameters hypothesis significance test, just use different methods, so p-value should refer the same thing. BAM! Thank you so much!

@nr7507 2 года назад

Thank you, I had a few questions. At 6:37, is there a reason we did not include the residuals in the overall equation of y? Also, why do we need the y equation at 6:13 to create a design matrix? Is it just not just a matrix where the number of ones corresponds to the number of data points for control and zero for mutant and vice versa for the next data point number of entries? Also, does the sample size have to be the same per category to create a design matrix? Great Tutorial!

@statquest 2 года назад

1) This equation simply represents what goes into to the design matrix. The residual is the difference between this equation and what is observed. 2) The equation just illustrates how we create the design matrix and what it represents. 3) You don't need to have equal numbers of samples for each category (they can be different).

@TheAugustinePark 4 года назад

In terms of when we should use linear regression vs. t-tests vs. ANOVA for testing our data, is linear regression for when our independent variable is continuous while t-tests and ANOVA for when our independent variable is discrete (e.g. categorical variables)? Thank you!

@statquest 4 года назад

Technically, it is all linear regression. However, they give it different names. t-tests are when you have two distinct groups and ANOVA is when you have more than 2 distinct groups.

@yenhoeooi9 2 года назад

Hi Josh, great video here. Would really appreciate if you have a statquest on the F-statistics/f-value and also on degree of freedom. Its kinda hard for me to grasp the concept of these two topics.

@statquest 2 года назад

The first video in this series explains F-statistics and f-values: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-nk2CQITm_eo.html

@4wanys 3 года назад

hi thank you for the vedio ,Is the t-test is the machine learning regression with discrete inputs ?

@statquest 3 года назад

I'm not sure what your question is. A t-test is a way to compare to categories of things (like "normal diet" vs "special diet") when you measure something continuous (like weight).

@siddharthkhattak8381 4 года назад

I have a question, while calculating F we use SS(mean) and SS(fit) but for t-test or ANOVA there will be 2 and 5 (respectively as per this video) SS(fit) then do we take an avg of all the SS(fit) or add them.....??

@statquest 4 года назад

SS(fit) is the sum of all of the squared residuals (the difference between the actual observation and the lines we fit to the data.)

@heisenbergren1556 Год назад

Hi Josh,Should it be like F= (SS(mean)-SS(fit))/(p_fit-p_mean) on the top of the formula?(one more bracket)

@statquest Год назад

yep

@beautyisinmind2163 2 года назад

Hi Professor Josh, Anova(F-test) is often used in Filter method for feature selection. Theory says, Anova should be used for feature selection when target is Binary but I saw in some practical use people also uses Anova when target is multi class. So Anova(F-test) can also be applied if our target is not binary and has multiple classes? another question Anova assumes features to be normally distributed, But in practice most of the time we encounter data that are not fully normal in such case does it matter much to apply it? or Transformation is compulsion?

@statquest 2 года назад

ANOVA is really only intended to be used when the dependent variable is continuous.

@brunog.campos3236 5 лет назад

If the t-test indicates that mutant and control are different, but the anova indicates that there was no difference between the groups what should I do?

@akyanus7042 3 года назад

Hi, So how to do a two-sample t test with bootstrapping for rna seq data? There are hardly any examples in the literature. Considered as an alternative method to EdgeR, but is it possible to get a bootstrapped t test for each gene in group comparison (like the model matrix in edgeR)? So how is the bootstrap t test used for gene expression analysis? (e.g. boot package in R). I 'dont understand how is identified differential expressed genes with botstrapping. Can you share information on the subject?

@statquest 3 года назад

I have a video that shows how bootstrapping can be used for a t-test here: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-isEcgoCmlO0.html

@akyanus7042 3 года назад

@@statquest Thank you very much I checked it. I understood hypothesis for mean between two groups, bu still I do not understand how it is used for genes. This is complicated I think. I wanted to see a table for t and p values of genes. Am I thinking wrong?

@statquest 3 года назад

@@akyanus7042 Replace the responses people had to the drugs (feeling better or worse) with the read counts for a gene in different samples. For example, you might have 3 samples that took drug A and 3 samples that took drug b. For Gene "X", bootstrap the read counts for the genes and calculate p-values as described.

@akyanus7042 3 года назад

@@statquestThank you.

@ghassencawabunga406 4 года назад

Thanks for the video, but I have a question: in the design matrix you didn't take into account the residual and then when you calculated the P(fit) you also ignored it ! , I am having trouble understanding that, I thought it should be included as a paramater

@statquest 4 года назад

The residual is the difference between our model's prediction and the actual value. Mathematically, it is "Observed value - model = residual", where model is the design matrix times the parameters. If we added the residual to our design matrix, we would get "Observed value - (model + residual) = Observed value - model - residual = (observed value - model) - residual = residual - residual = 0." And that wouldn't be very helpful.

@ghassencawabunga406 4 года назад

@@statquest thank you for the clarification!

@danielsobczynski2107 2 года назад

Hi Josh, great video as always. Just wanted to ask, what happens to the residual in the equations earlier in the video that had “+ residual” in them? Thanks so much for your help, definitely learning alot

@statquest 2 года назад

What time point, minutes and seconds, are you asking about? (However, I'm guessing that you are asking about the difference between the equation that perfectly fits the data, because it includes the means + the residuals, and the equation that generates the residuals (because it only includes the means). The equation that does not include the residuals is the one we use to make predictions with future data.

@danielsobczynski2107 2 года назад

@@statquest Thanks Josh, that is the point I was asking about, I will review the video again once more

@DanWhalen 4 года назад

still confused how do i interpret/operationalize "y=control1*2.2 + control2*3.6"? like at 6:54, are we saying "y=(4*2.2)+(4*3.6)"?

@statquest 4 года назад

If you go back to 6:11, you see that the "design matrix" is formed from the 1's and 0's that turn on/off the values for the control mean and the mutant mean. So when you have y = column1 * 2.2 + column2 * 3.6, to predict a value for a new control sample, you plug in 1 for column1 and 0 for column2 and thus, the prediction is y = 1 * 2.2 + 0 * 3.6 = 2.2.

@albertrodrigo2432 2 года назад

It would be a triple BAM if you could do a quick Stat Quest about residual diagnosis in linear models!

@statquest 2 года назад

I'll keep that in mind.

@pg4234 2 года назад

At 10:42 if we get a small p value from the F-statistic, how do we know which of the categories is significant?

@statquest 2 года назад

We then have to test each one separately to identify which one is significantly different.

@BeefLoverMan 3 года назад

This channel is a gift from the math gods. Question: I'm having a hard time linking this to Design of Experiments methods. It seems like it should be an easy connection, but I somehow can't quite work it out in my head. How would one use this to calculate the explained variation by individual terms of a linear model? 1 term == 1 "category"? And how do degrees of freedom factor into it?

@statquest 3 года назад

The next video in this series may help you understand how to design experiments: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-CqLGvwi-5Pc.html

@josephgan1262 3 года назад

Hi Thanks for the video! Can i know how to link this t-test to the one-tail, two tail t-test used in the hypothesis testing? Thanks

@statquest 3 года назад

This t-test gives the exact same results as the two-tailed t-test used in hypothesis testing. If you want it to represent a one-tailed test, just divide the p-value by 2.

@josephgan1262 3 года назад

@@statquest Hi Josh! thanks for the response! What I mean is i am used to seeing t-test in comparing means of sample in hypothesis testing by finding t-statistic with sample mean and sample error to get the p-value . Hence i am a bit confused on the concept of computing a "F" on a "T-test" mentioned in this video.

@statquest 3 года назад

@@josephgan1262 Most people are taught about the t-test from the perspective of the t-Distribution. This is fine, but the t-test using a t-distribution is not as flexible. In contrast, the F-distribution is a generalization of the t-distribution. The t-distribution only allows us to compare two means. The F-distribution lets us compare 2 or more means. If you want to convert your F-statistic to a t-statistic, just take the square root of it. If you want to convert a t-statistic to an F-statistic, square it.

@Doctor_CCC 3 года назад

Hi Josh, thank you for your great videos! Is it necessary to perform a post-hoc test to determine which of the groups performed better than the others (using multiple comparisons between groups with some adjustment for multiple tests, such as Bonferroni)?

@statquest 3 года назад

It depends on the goals of the experiment. However, typically people will do post-hoc tests with a multiple testing correction - however, FDR is way better than Bonferroni, so use FDR if you can.

@Doctor_CCC 3 года назад

@@statquest Thanks for your reply. If we use multiple linear regression models to replace ANOVA, whether the t tests on the regression coefficients is like the post-hoc tests in ANOVA without multiple testing correction?

@statquest 3 года назад

@@Doctor_CCC Pretty much

@Doctor_CCC 3 года назад

@@statquest Thank you very much for your explanations. On the premise that the t tests on the regression coefficients is like the post-hoc tests in ANOVA *without* multiple testing correction, I am wondering how to appropriately interpret the p value of t tests on the regression coefficients in multiple linear regression analysis? (To mitigate against multiple comparison problems). By the way, is there any learning resource about using post-hoc tests with FDR after multiple linear regression analysis?

@statquest 3 года назад

@@Doctor_CCC I should clarify. The t-tests compare the model with and without individual variables. This is different from Post-hoc tests in ANOVA, where we test all possible pair-wise combinations. Testing all possible pair-wise combinations can quickly add up to a lot of tests, necessitating adjusting p-values. In contrast, when we just test the model with and without individual variables, we only do as many tests as we have variables - and usually this means we've only done a few extra tests, which, typically, does not necessitate adjusting the p-values. However, if you have a ton of parameters (variables), then you should adjust them with FDR. In R, this is super easy: stat.ethz.ch/R-manual/R-devel/library/stats/html/p.adjust.html

@mook1481 3 года назад

please do a MANOVA video !! this was so useful, Im doing a 2x2x3 MANOVA for my research project and would really appreciate a video :)

@statquest 3 года назад

I'll keep that in mind.

@minederguy4932 4 года назад

How do you calculate the residuals for the equation + design matrix? Wouldn't that involve subtracting a matrix from a scalar?

@statquest 4 года назад

The design matrix is just a general way to specify how each measurement fits into the equation.

@alexandergarcia6479 4 года назад

hi joshua, what happen if i have some vector that is the mean of more vectors each one and i want to diferentiate them? ej: Xmean=(X1+X2+X3)/3, Ymean=(Y1+Y2+Y3)/3... where each vector have n coordinates of means and i whant to prove that the mean vectors Xmean, Ymean... are from diferent poblation or not? thank you.

@statquest 4 года назад

Do you have multiple columns of means? If so, just use those as normal data. You might also want to watch my StatQuest on Design Matrices: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-CqLGvwi-5Pc.html

@alexandergarcia6479 4 года назад

@@statquest thank you, i'll do it. and thank you for all your videos

@ravikiranrao05 3 года назад

Such a great video, Josh. Really enjoyed your videos. Can you please recommend a text book which reflects your way of teaching? Are there any such which I'll be hooked at reading (just like your videos)? Thanks

@statquest 3 года назад

I'm writing my own book right now. I hope it is out next year.

@ravikiranrao05 3 года назад

@@statquest Woah! Looking forward to read that.

@rookiedrummer6838 3 года назад

Thanks @Josh i have a some questions:- 1] Suppose we have 5 independent variables and a label ,How does ANNOVA calculates p-value for each feature in this case? 2] Does it fits a regression for each indipendentVariable~Label separately and than calculates p-value?

@statquest 3 года назад

I describe how p-values are calculated for individual features in these videos: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-zITIFTsivN8.html ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-hokALdIst8k.html The concepts apply to ANOVA in the exact same way.

@eye_oph Год назад

Hi Josh, great video as always. Just wanted to ask, How to do the post hoc tests in linear models just like post hoc tests in ANOVA to explore differences between two groups? Thank you.

@statquest Год назад

Post-hoc tests with ANOVA are just a matter of defining your "design matrices", which I illustrate in the next video in this series: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-CqLGvwi-5Pc.html

@eye_oph Год назад

@@statquest If there are three drugs: drug A, drug B, and drug C, we use drug A as the reference level. We then use dummy coding to compare B vs. A; C vs. A in the linear model. In the linear model, we can determine the difference of B vs. A; C vs. A by calculating the p value of the coefficient. However, it seems that we can not determine the difference of B vs C in the above linear model? Thank you for your reply.

@visheshsharma2115 2 года назад

8:22 above graph for t test is the fitted one or the mean one ???

@statquest 2 года назад

I'm not sure I understand your question, however, the graph in the top right corner at 8:22 shows a horizontal solid black line at the average of the y-axis coordinates.

@krishnag5734 3 года назад

Hi Josh, Thanks for the video. :) What about adding residuals to the equations at 6:27 and 6:57 ? Isn't it necessary ?

@statquest 3 года назад

The residuals squared and added when we solve for the optimal parameters. For details, see: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-nk2CQITm_eo.html

@krishnag5734 3 года назад

@@statquest thank you josh :)

@yimingshao4240 2 года назад

Thanks a lot for your video, it's really helpful! but i have a question, why the equation of y can be written as y= mean (control) + mean (mutant), where are the residuals in each set of data?

@statquest 2 года назад

I'm not sure I understand your question. The residual for each measurement is paired with that measurement, so it is easy to keep track of.

@somasundar8030 2 года назад

You are the best

@statquest 2 года назад

Thanks!

@nikosterizakis Год назад

I might have missed that, but what is the value of 'n' in the formula?

@statquest Год назад

The number of observations or data points.

@cristianleoni6852 4 года назад

Great job as usual, but this is still quite a confusing topic for me, will Pmean aways be one? Also is there a nice explanation for the formula of the F value? And how does F value relate to p value?

@statquest 4 года назад

Did you watch part 1 in this series? If not, it should answer all of your questions: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-nk2CQITm_eo.html

@esperanzazagal7241 3 года назад

Is the overall mean always on the y axis because it is the outcome of interest? are we never interested in means on the x-axis?

@statquest 3 года назад

We are predicting the y-axis value, and that is why we are interested in the y-axis more than the x-axis (the stuff on the x-axis is only being used to predict y-axis values.)

@danzz7583 2 года назад

Hi, I have to ask, what's the point of this t-test if we could just do a simple hypothesis test to prove that mutants and control are different? I.e. null hypothesis is that mutant mice and control mice are the same, then simply find the p-value for the mutant mice mean?

@statquest 2 года назад

You might not realize that this t-test is the same test you would use to do a simple hypothesis test. The equations might look different, but they are equivalent, and the p-values are the exact same.

@danzz7583 2 года назад

@@statquest that cleared things up, thanks a lot :)

@TheAugustinePark 4 года назад

At 4:20 of the video, you mention the reason we combine the two lines of best fit into a single equation is to make the steps for computing "F" identical for regression and the t-test meaning a computer can do it automatically. In terms of what this actually looks like, I think this means having a single equation means one value for SS(fit) (instead of 2) which means we can use the "F" equation for regression. Is my reasoning correct? Also, why does a single equation mean a computer can do it automatically? Why could a computer not do it automatically if we had 2 equations? Thanks I love your videos!

@statquest 4 года назад

Sure, a modern computer can handle more than one equation. But back in the day memory was limited and that limited the number of tests a computer could perform. So the the original idea was to unify as much of linear models into a single framework called "General Linear Models", with the idea that one equation could be used in a general setting on a computer without having to check a bunch of different conditions. In the early days, different conditions meant different look-up tables for figuring out the p-values and since computers had very little memory, this limited what they could do.

@janakiramanbalachandran504 4 года назад

Hi what does the variable 'n' mean in the formula? Is it the total number of samples? Also can you provide some intuition on this statistics and how its used for a good fit (vs) poor fit

@statquest 4 года назад

'n' is the total number of observations (which I call "samples" in this video). Often, the more data we have (the larger 'n' is) the more confidence we can have in the predictions because the p-value becomes smaller.

@janakiramanbalachandran504 4 года назад

@@statquest Thank you for clarifying 'n'. my other question ware regarding intuition on F-statistic. For example what is its range and what does the lower end of values imply, compared to higher end etc.

@statquest 4 года назад

@@janakiramanbalachandran504 I explain the F-statistic, it's range and what lower end values imply in Part 1 in this series: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-nk2CQITm_eo.html

@silentsuicide4544 2 года назад

i don't think if i get it completely. when we have two features like in the first example, over the graph is written "t-test", but we are calculating f-score, which using f-distribution gives us the p-value, but the definition for t-test is that it is every hypothesis test in which the test statistics follows a t-distribution under the null hypothesis. My question is why is it called "t-test" if we are using f-score and f-distribution to get p-value?

@statquest 2 года назад

The F-distribution is a generalization of the t-distribution. In other words, the F-distribution can do everything we can do with a t-distribution and more.

@vanya.antonov 5 лет назад

Hello, Joshua! I am a bit confused at 7:42. If I understand correctly, you estimate the t-test p-value by computing the F-value (and using the F-distribution?). Although, according to Wikipedia, the test statistics in t-test follows the Student's t-distribution (and not the F-distribution). So, I was wondering if the t-test you describe here is the same as the standard t-test from the Wikipedia?

@juliar5741 3 года назад

I have the same question here. @StatQuest

@thomasamet5853 3 года назад

Thank you so much Josh for all your amazing content and great silly songs. I don't manage to wrap my head around the reason you say the fit equation is written out like: y = mean_control + mean_mutant at 6:48 and 9:05. I would have written something like y = mean_control * x + mean_mutant (1-x), x taking 1 or 0. Any explanation on that from you or someone else is appreciated.

@statquest 3 года назад

Because my equation is being multiplied by the design matrix, it is essentially the exact same thing that you have.

@thomasamet5853 3 года назад

@@statquest Bam!! Thank you for the explanation

@user-ht7gw9ww1c 5 лет назад

He explains very simple concepts .

@bernaridho 6 месяцев назад

Where is part 1? I did not find it your description.

@statquest 6 месяцев назад

ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-nk2CQITm_eo.html

@glaswasser 3 года назад

can you make a statquest about linear mixed models / random effects? I'm extremely confused about them, when to use them and how to interpret the results...

@statquest 3 года назад

I'll keep that in mind.

@kautsarfadlyfirdaus1879 4 года назад

Thank you for the amazing video, as always. If you have time to spare, I want to ask about 'how to test the model with the new data?' If I understand correctly, then we just need to calculate the new data with the following equation y = switch*mean_control + switch*mean_mutant edit: when i watch the video again, it seems like the purpose is to find wether the mean between the values is significant or not. Am i correct?

@statquest 4 года назад

The purpose of the t-test is to determine if there is a significant difference between mice with the normal gene and mice with the mutant gene. However, we can also use the model to make predictions with new data. If my test tells me that there is a significant difference between normal and mutant mice, if you tell me you have a mutant mouse, I can tell you that the gene expression should be the mean of the mutant mice. If my test tells me that there is not a significant difference, then I will use the mean of all the mice, normal and mutant, as my prediction.

@kautsarfadlyfirdaus1879 4 года назад

@@statquest I see, now I undertand better, thank you Mr. Josh.

@minakshimathpal8698 3 года назад

Hi Josh. I have a question. If I want to check the correlation between a categorical variable( more than two class) and continuous variable, then theory says use anova. But I am failed to understand how equating the means gives an idea about correlation. Pardo me if this question is out of scope of this video. Plzz help.

@statquest 3 года назад

You can look at the R-squared value.

@minakshimathpal8698 3 года назад

@@statquest hi Josh I have categorical target variable(three levels) and continuous Independent variable. Can I still use r2 to check the association between two. TIA

@statquest 3 года назад

@@minakshimathpal8698 yep.

@SergeySenigov 7 месяцев назад

"Correlation" means - you get significantly different mean outputs on different inputs. To prove this we make a null-hypothesis "All means are equal" and try to reject it. If we failed to reject it then our data lacks enough evidence of correlation.