Design Matrices For Linear Models, Clearly Explained!!!

Подписаться 1,2 млн

Просмотров 129 тыс.

50% 1

In order to use general linear models (GLMs) you need to create design matrices. At first, these can seem intimidating, but this StatQuest puts together a bunch of examples and illustrates them all so that they are clearly explained. The examples in this video are worked out in R in this video: • Design Matrix Examples...
For a complete index of all the StatQuest videos, check out:
statquest.org/video-index/
If you'd like to support StatQuest, please consider...
Patreon: / statquest
...or...
RU-vid Membership: / @statquest
...buy my book, a study guide, a t-shirt or hoodie, or a song from the StatQuest store...
statquest.org/statquest-store/
...or just donating to StatQuest!
www.paypal.me/statquest
Lastly, if you want to keep up with me as I research and create new StatQuests, follow me on twitter:
/ joshuastarmer
#statquest #glm #statistics

Опубликовано:

6 янв 2019

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 192

@statquest 2 года назад

Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/

@macroxela 4 года назад

Statistics never appealed to me since it always seemed boring ... until I started watching your videos a few days ago. Now I'm hooked. Thanks for making statistics so fun and intuitive to learn!

@statquest 4 года назад

Awesome! I'm glad you are enjoying learning stats! It's a fun Quest! :)

@PunmasterSTP 3 месяца назад

Out of curiosity, what do you think of stats three years later?

@fanzhang3746 2 года назад

SQ is so addictive. A simple concept clarification youtube search led me down hours and hours of SQ contents. Thank you, thank you, thank you!

@statquest 2 года назад

Wow! Thank you!

@taotaotan5671 3 года назад

WOWW. I have been watching this video for at least 5 times and I always learned something new! I was confused by people saying "regress out the batch effect", but it's that simple!!! Thanks Josh.

@statquest 3 года назад

BAM!!!

@paulpaschert6215 5 лет назад

"turning something on by letting it be" - some proper life advice there

@leylayim 3 года назад

thank you! I can't believe how clear you are explaining this, seriously thank you!

@statquest 3 года назад

Glad it was helpful!

@MsDontBlink 4 года назад

im so happy when i look for a topic and see that you've covered it.

@statquest 4 года назад

Bam! :)

@laurag.6122 4 года назад

I never get tired of watching your videos, I have learned a lot. This is my favorite channel :) Thank's!!!!!! Would you consider making a video on assessing the significance of mixed models? Please! this topic is complicated

@statquest 4 года назад

I hope to cover mixed models in the future.

@amoghbharadwaj9252 4 года назад

wow so helpful, this cleared my doubt of combining and interpreting categorical and continuous predictors. Thanks a ton:)

@statquest 4 года назад

Hooray!!! Thank you very much. :)

@summerxia7474 3 года назад

So clear！！！ Thank you for answering my confusion in a such simple way!

@statquest 3 года назад

Glad it was helpful!

@jaychan3207 4 года назад

BAM!!! crystal clear explanation! Thanks!

@statquest 4 года назад

Glad it was helpful!

@rrrprogram8667 5 лет назад

Awesome one josh.... Keep up the great work...

@statquest 5 лет назад

Thank you!

@GregSteg 5 лет назад

6:06 , having flashbacks to week one and two of Andew Ng's ML Coursera course, but now it feels more intuitive!

@statquest 5 лет назад

Wow! That's quite a complement. Thank you. :)

@claradong4649 5 лет назад

I love your video, really easy to understand

@statquest 5 лет назад

Thank you! :)

@sudinroy7979 3 года назад

All the topics of statquest are well explained. Thank you sir for this nice statistics subject based channel Statquest.Good wishes and happy journey for this successful statquest youtube channel.

@statquest 3 года назад

Thank you!

@mesmaeili1 3 года назад

Great. Really clearly explained. Thanks.

@statquest 3 года назад

Glad you liked it!

@gabrielcournelle3055 4 года назад

Awesome video as usual. Thank you

@statquest 4 года назад

Thanks! :)

@kventinho 5 лет назад

Statquest is getting bigger, watch out! hahaha i can't stop humming this

@statquest 5 лет назад

Nice!!! :)

@dainegai 4 года назад

Great video (as usual)! You're definitely one of my favorite "thing-explainers" I've come across :D I was left with a question near the end though, with respect to "correcting for batch effects". After a quick online search, I see this is usually an issue and many packages to attempt to correct it. I could imagine two explanations that lead to different explanations for the difference: i) "We ran the exact same protocol in two different labs. However, the sensors were differently calibrated, so there is a bias in readouts." -> This suggests the batch-effect correction. ii) "We ran the exact same protocol in two different labs. We ensured the sensors were equally calibrated, but there's *still* a bias in readouts." -> This could just be due to inherent variability in the sample, right? (It is probably not *too* likely for the data to be the same, just shifted down a bit. But it's possible! Questions: 1) This correction *assumes* that the difference in batches is *not* due to inherent variability in the features we're measuring (but is instead due to e.g. technician error), right? There would be no way to *prove* it one way or the other, would there? 2) If it's (ii), wouldn't "correcting for batch effects" throw out useful information about the response variable's distribution? 3) Ideally, hopefully both labs calibrated their sensors via e.g. blanks, so (1) shouldn't be immediately the reason. How would you suggest teasing out sensor bias (1) vs sample variability (2)? Would we have to assume a model for the data and compare whether Lab A's two group's parameters significantly differ from Lab B's? (Or maybe the "ideal" situation happens infrequently enough that going for (1) is usually not unreasonable?) Thanks again! Will continue to Quest On :D

@statquest 4 года назад

If you are worried about whether or not "sensor bias" plays a big role in your measurements from two labs, you can always do technical replicates. In other words, have each lab do the experiment 3 different times. If Lab A is always higher than Lab B (or the other way around, or a t-test suggests that the results are significantly different), then you can be pretty confident you have a batch effect due to sensor bias or the technician or something like that.

@yulinliu850 5 лет назад

Thanks a lot!

@statquest 5 лет назад

You're welcome! :)

@dbarkan1 5 лет назад

In the last part where you combine the linear regression and the t-test, you have a regression line for each category, but the slopes of the lines are identical. Isn't this rare? How would the equation change if you had two lines with different slopes?

@quanxu1 3 года назад

my guess is: use one parameter for each slope. For the first parameter, load control group's weights as is but keep mutant group's weights at 0. For the other parameter, do the opposite

@abdoualgerian5396 Год назад

i think he didn't wanna make it more sophisticated as it seems to be , looks like there is more to it than his simple explanation , maybe in another quest he will be talking about it ps: this reply is 4 years afer your comment , and by now he might have done it , im watching the videos one by one

@Samurai_Jack__ 10 месяцев назад

@@abdoualgerian5396 have u found something like this by now

@parthbhardwaj8435 5 лет назад

do a statquest for wald's test , chi- squared test and fisher's exact test please!!

@WalyB01 3 года назад

First you would need random variables for the walds

@gren287 5 лет назад

I love u, its out now, but i need a longer intro. :)

@statquest 5 лет назад

This one is way too short! :)

@yaozhang8368 4 года назад

Hi Josh, I really like your videos that clearly explain lots of things! In the last example of this video, is it similar with using a mixed model? Intuitively, it seems the lab was treated as a random variable and mutant was treated as a fixed variable, and here we are interested in the difference between mutant and control after removing the impact of lab A/B.

@statquest 4 года назад

You are correct! You could definitely analyze this data with a mixed model.

@alonsomartinez9588 Год назад

It would be nice to see a video on how matrices and matrix multiplication in neural networks transform data, and edit the dimensionality of the inputs. Transforming data makes a lot of sense spatially, but what does changing dimensions do? What are the different ways in which you can interpret matrix notation? Talk about the special relationships once you organize the data in that particular format

@statquest Год назад

I'll keep that in mind.

@PunmasterSTP 3 месяца назад

Design matrices? More like "Dang good videos are these!"

@statquest 3 месяца назад

Ha! BAM! :)

@CompBioQuest 2 года назад

it would be great one video about interaction terms! and how to use for deconvolution of cell types. :-)

@statquest 2 года назад

I'd like to do that one day.

@gabrielpadilha8638 2 года назад

Do a statsquest on the F distribution, please

@statquest 2 года назад

I talk about the F-distribution in this video: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-nk2CQITm_eo.html

@saileshpatra2488 4 года назад

Though I tried a lot but unable to digest all the concepts. Thanks for all detailed explanation. Bdw can we expect another video on design matrix and it's real use with a little simpler explanation!!! if possible

@statquest 4 года назад

Did you start from the very start of this series (this video is part 3), with linear regression? Earlier videos in this series cover simpler design matrices Here's the whole playlist, in correct order: ru-vid.com/group/PLblh5JKOoLUIzaEkCLIUxQFjPIlapw8nU

@karannchew2534 2 года назад

7:46 Compare mean model vs type-only model, p>0.05 12:17 Compare size-weight-type model vs type-only model: p=0.0025

@statquest 2 года назад

bam! :)

@kt9509 2 года назад

pls do some videos on recommendation systems (like collaborative filtering) and distributed learning (like mapreduce)!!

@statquest 2 года назад

I'll keep those topics in mind.

@BeefLoverMan 3 года назад

I'd love to see you do a central composite design matrix for response surface modelling at some point! I seem to understand the topic enough to implement it in Python, but I'm still struggling to put it into a more general context. The final ANOVA output people usually show is the most confusing. I get that you do an ANOVA for each single term in the model (so each term is a "group"), but then there's always an extra "residual" ANOVA and I can't figure out what that is calculated on.

@statquest 3 года назад

I'll keep that in mind.

@ismailel-shimy7431 Год назад

Thank you for your wonderful videos which I became addicted to recently :D I found this one particularly useful for me to understand the concept of design matrices and how one can use them not only to turn on/off certain terms in equations for categorical variables, but also scale terms for continuous variables. Now, I have a question regarding the batch effect example you kindly provided. You assume that the difference between mutant and control mice in lab A is the same as in lab B and one can represent this as the average difference of the 2 labs. You also mentioned in the comments that if we had more measurements from one lab than the other, we can use a weighted average of the differences in the 2 labs. What if the mutant-control difference in lab A really differs from that in lab B. You can already see that in the dot plot. Would it make sense to add an offset term for the difference too as you did for the lab A control mean? In this case the equation should be Y = lab A control mean + lab B offset for the mean + lab A difference + lab B offset for the difference ?

@statquest Год назад

I believe so.

@dreama1375 2 года назад

Thank you very much for your video! Can you please tell me, what you explained (when combined t-test and regression) is also called mixed modelling or multilevel modelling?

@statquest 2 года назад

I believe mixed models are used when you do not have enough data to create a proper design matrix like this.

@dandyyu8561 4 года назад

Thank you very much! I really learned a lot from your channel. I have a question, at 13:25, the second term Lab B offset. Is the Lab B offset = lab B control mean - lab A control mean?

@statquest 4 года назад

Yes.

@dandyyu8561 4 года назад

@@statquest Thank you so much! By the way, can I treat the last example as a simple linear mixed model?

@statquest 4 года назад

No, mixed models are different. My understanding is that in this case, we are measuring distinct labs and not trying to generalize to other labs, and this constitutes a fixed effect. There is no random effect, so there is no "mixture" of effects.

@briankirk962 4 года назад

First of all props for your excellent series of videos. First rate introduction to some really hard stuff. An answer to your question at 2:46 is that the problem in general implies two distinct linear equations but algorithms for linear models (eg lm() function in R) only allow for one general linear equation. So yes you can solve it by hand using two separate linear equations but the algorithm won't let you enter the problem in that format. So how do you get around this problem of having only one linear equation to work with when you have two linear equations in reality? Ans: Break up the linear model by using dummy variables in a thoughtful manner or let the algorithm do it for you but check that it's not messing with you. Either way you got to know what's going on and here's one explanation.... For the mutant/control example we have the following linear model: (Note i should be read as a subscript for the ith term and e is the error term. So ei is NOT some madness in the complex plane but simply the ith error term. If ei throws you just treat it as a symbol related to irreducible error (the noise that is always around). That's all it is.) yi = B0 + B1 xi + ei (eq 1) (pretty much y = b + mx with e as some reality thrown in) where B1 is the slope of the line with xi as its associated input values (eg the labels mutant and control but as values in this example) and B0 is the y-intercept. As you can see there are no input values associated with B0 so we can not directly associate input values to B0 through the first column of the design matrix. This explains why the first column of the design matrix is fixed to all ones. This is essentially saying that B0 exists for all i and it's up to the gods of regression to determine what B0 becomes. All is not lost though. Nothing says we can't monkey with the linear model (eq 1) through its variable xi in a creative way that ends up associating B0 with a label. And that's what we're going to do. But first we need to deal with the issue that our labels are not numbers and this creates an opening for some linear equation monkey business without defying the gods. Since our equation won't work on labels we need to assign numerical values (dummy variables) and by selecting the appropriate dummy variables for our labels, we can separate the general equation into two separate equations each of which corresponds uniquely to each label. Word of caution though, how we select our dummy variables determines how our labels get assigned to the separate equations, so it's not an arbitrary choice. So let's try the following: xi = 1 if i is a mutant xi = 0 if i is a control (0 and 1 are the dummy variables and here we are assigning actual numerical values to xi. These are the values assigned in the second column of the design matrix) in which case yi = B0 + B1 xi + ei (eq 1) becomes yi = B0 + B1 + ei if i is a mutant (eq 2) yi = B0 + ei if i is a control (eq 3) (Notice that there is no longer any separate xi term in eqs 2 & 3 since xi has been assigned dummy variable values) and this allows us to interpret our controls relative to B0 whereas our mutants correspond to B0 + B1. Pretty slick and no lightning bolts from above. In this case B0 is the mean for the controls (intercept in the summary report), whereas B0 + B1 is the mean for the mutants. It's important to note that B1 (what is returned second in the summary report) is the mean difference between mutants and controls (ie mean of mutants-mean of controls). If the p-value for B1 is significant that means adding the the difference of mutants-controls to our model is significant with respect to the control alone (ie mutants are different relative to the controls and what we're interested in). Now if we switched our dummy variables, xi=1 for controls and xi=0 for mutants, then B0 would be the mean for mutants; B0 + B1 is the mean for the controls; and B1 is the mean of controls-mean of mutants (ie got reversed). If the p-value for B1 is significant here that means adding the difference of controls-mutants to our model is significant with respect to MUTANTS alone (ie controls are different relative to mutants so equivalent to what we want but is kinda upside down and weird). To get totally weird we could assign xi=1 for mutants and xi=-1 for controls then B0 would be the overall average for the combination of mutants and controls. Bottom line: How we set our dummy variables determines how we can interpret B0 (as well as B0+B1 and B1) and is a slick trick that allows us to separate out from our linear model, two linear equations that uniquely correspond to our labels. An Introduction to Statistical Learning by James Gareth gives some nice examples of this on pg 84 at this level of math. Available for free on-line and also provides details on how to assess the quality of your model which is critical. And if you've gotten this far....Hey Josh, how about some banjo??? Some Ola Belle Reed would fit nicely here.....I've endured, I've endured, how long can one endure!!!!

@statquest 4 года назад

Wow! You get a prize for longest comment ever. You even have equation numbers. Very nice! :)

@briankirk962 4 года назад

@@statquest Curious how the mathematics is short and concise whereas the exegesis on equations delves into the land of Proust. Perhaps there is something to this mathematics thing...

@Harshavardhan-bu2tp 3 года назад

in the last example for batch effect, Did you suppose that difference(mutant-control) is same for both the labs?

@statquest 3 года назад

Yes

@sharan9993 11 месяцев назад

2:40 It might be because, the standard needs only one bit to represent both values, since only change is 2nd bit. We can just ignore the 1st bit while storing thus reducing the size. Just a speculation.

@statquest 11 месяцев назад

Perhaps

@AnimeshSharma1977 5 лет назад

Awesome, wondering how you deal with missing values in such cases?

@statquest 5 лет назад

That's a great question. I think you have to use some method to impute the missing values.

@AnimeshSharma1977 5 лет назад

@@statquest will the random forest imputation method you suggested work here :)

@statquest 5 лет назад

@@AnimeshSharma1977 Yes it would. But, depending on what you're modeling, there might be some specialized method that may work better. I'd look around and if I didn't find anything, try out the Random Forest method. I love how flexible it is.

@alexandergarcia6479 4 года назад

maybe moving average?

@RadomName3457 2 года назад

Hi Josh, could I ask which distribution table u looked at or based on to get the pvalue for the F I calculated?

@statquest 2 года назад

See: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-Hrr2anyK_5s.html

@angelamilton5134 2 года назад

Please do you have any video on how to build the A matrix from a stochastic model?

@statquest 2 года назад

Not yet!

@gianmarcolevantino1239 24 дня назад

hi! can you explain briefly how to obtain p-values from F? i'm currently preparing "advanced statystics for business" in management engineering course in the university of Palermo and i'm really enjoying your videos! thanks a lot!

@statquest 24 дня назад

I give the concepts in this video: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-nk2CQITm_eo.html

@munaalhammadi4237 3 года назад

Thank you for this clear explanation. I just have a question in8:26 if the lines had different slopes, how will be the design matrix in that case?

@statquest 3 года назад

If the lines have different slopes, then you have something called an "interaction". This means the mutation has a different effect on different weights. So we would add an "interaction term" to help compensate for this. Interactions are a whole StatQuest for themselves and one day I'll make it.

@channel_panel193 3 года назад

@@statquest oooo +1 for a StatQuest on interaction terms plz!

@statquest 3 года назад

@@channel_panel193 If all goes according to plan this month, I should have that video out soon.

@MR-yi9us 2 года назад

@@statquest yes! +1 for a StatQuest on interaction terms too!

@Russet_Mantle 3 года назад

13:32 About the term for difference(mutant - control), is that an average of Lab A's difference(mutant - control) and Lab B's difference(mutant - control)?

@statquest 3 года назад

In this case, yes. If the data are unbalanced (i.e. we have more measurements from lab A than lab B), it might be the weighted average.

@Russet_Mantle 3 года назад

@@statquest Got it. Thanks a bunch!

@PeihuiBrandonYeo 5 лет назад

I am here for the singing intro

@statquest 5 лет назад

Hooray! :)

@nicholaskiulia9649 2 года назад

Could you also kindly explain dirichlet regression and also the segmented regression

@statquest 2 года назад

I'll keep that in mind.

@HunterDriguez 4 года назад

Awesome video! In the Control vs Mutant scatterplot @6:16, is each individual data point meant to represent the expression of a single gene within a control or mutant replicate (4 reps per group)? I'm just trying to make sense out of a design matrix that I got from my RNA-seq data with thousands of genes. I wonder if the SSmean for each sample is being calculated using the mean expression of all 18,000+ genes....I'm wondering the same for the SSfit...sigh.

@statquest 4 года назад

If you're doing RNA-seq, then you are probably using edgeR or DESeq2. Both of those methods use just the genes with similar expression to get a sense of SSmean and SSfit.

@HunterDriguez 4 года назад

@@statquest thanks! I will go read more about what edgeR does.

@deletedacc27834 7 дней назад

Hey Josh! Thanks for this video. For the first example, I was curious about the equation y = control intercept + mutant offset + slope. Suppose our slopes for our two lines were different. Would the exchange become: y = control intercept + mutant intercept offset + control slope + mutant slope offset? Thanks!

@statquest 6 дней назад

If the slopes are different, the you have something called an "interaction" between the classes (control vs mutant) and the things we are measuring. Interactions have to be dealt with in a special way and would require an entire video to explain. In the mean time, check out this page: developer.nvidia.com/blog/a-comprehensive-guide-to-interaction-terms-in-linear-regression

@this-is-bioman Год назад

You actually need to watch these two parts backwards as the purpose of the t-test becomes clear at the end of this video. I wish you showed us first why we do that and then explained how.

@statquest Год назад

Noted

@janakiramanbalachandran504 4 года назад

Excellent video. However, I had a few questions. How to design a design matrix for regression+control group when the slopes are not the same for the two groups. Similarly for comparing Lab A and Lab B measurements, which difference in the mean values (mutant-control) should be taken, since there are two values from the two labs.

@statquest 4 года назад

When you have 2 different slopes, then you need to add something called an "interaction term". That's just another column in the design matrix. And when you want to compare Lab A and Lab B, you just pick one to be the "base" and the other one will be the difference from the base.

@janakiramanbalachandran504 4 года назад

@@statquest Thank you again. The interaction terms seems like a nice way to handle 2 different slopes

@krisc9211 4 года назад

@@statquest When the 2 slopes are different, you need this interaction term. But I'm wondering how the mutant offset is defined? When the 2 slopes are the same, the mutant offset is the same everywhere. When the 2 slopes are different, then the offset will vary. So where do you define the mutant offset?

@paulap.8132 3 года назад

TRIPLE BAMMMMM!

@statquest 3 года назад

@manouheart4906 3 года назад

Hi, thx for all the hard work. Could you explain what the difference(mutant-control) at 13:36 is?

@statquest 3 года назад

The difference(mutant-control) is the difference between the mutant and control groups.

@manouheart4906 3 года назад

@@statquest So, is that equal to control_mean(LAB_A)-mutant_mean(LAB_A)+control_mean(LAB_B)-mutant_mean(LAB_B)?

@statquest 3 года назад

@@manouheart4906 Off the top of my head I can't remember exactly how it is calculated, but I suspect it is some sort of weighted average of the differences between control and mutant in labs A and B.

@patelprateekramesh2442 5 лет назад

What if the slopes are different? Do we consider two separate terms for each slope? (and similarly for more terms when there are more variables)

@statquest 5 лет назад

If the slope are significantly different, then there is an "interaction" this means the drug or whatever it is you are comparing, has a different effect on the different groups. In this case, you add an "interaction term".... and that's the subject of another video.

@jasperkirton6848 4 года назад

@@statquest Thanks for this! Do you have this other video or can point me in the right direction?

@pendantdrop3710 4 года назад

How do you interpret r-square if you have those 2 regression lines? about what line does r-square speak?

@statquest 4 года назад

By default, the r- squared value compares the residuals around the full model (in this case, that’s the two lines) to the residuals around a single, horizontal line that is at the height of the average y-axis value.

@somalkant6452 4 года назад

Hi josh, M highly indebted to you for all of your awesome videos. I have one doubt which is making me restless. please if you could help. suppose i have only one independent Variable which is categorical (control(0)/mutant(1) ) and one dependent varible(DV), which is continous, so i will get the graph as shown in 0:53 in this video. There is no other independent variable(IV) to make the graph look like as in 8:29. How to check "linearity" between IV and DV, as it is an important assumption of linear regression? We cannot draw a line with some slope and intercept in this case (0:53) or this " linearity" assumptions will not be required to check? and other Linear regression assumptions such as "normality of residuals" and "Homoscedasticity" are they also not required to be ensured in my example? please help. Thanks

@statquest 4 года назад

When we talk about "linearity" with respect to "linear models" like these, the only thing that is linear are the coefficients that connect the independent and the dependent variables. In this case, the independent variable is linear transformation of the dependent variable because we are just multiplying the the dependent variable by a coefficient. As for the other assumptions of linear regression, like "normality of the residuals" - we still have residuals in this case (see 1:57 ), so we can still check if the residuals themselves are normal.

@alecryan8733 2 года назад

Ahhhhh the relationship between design matrices and dummy encoding just clicked for me. The less common case where the the design matrix is all 0s in the first vector in the matrix is the same as one-hot encoding right?

@statquest 2 года назад

I'm not sure. I can't imagine why a design matrix would ever a column with just 0s. That just sets a parameter to 0.

@yenhoeooi9 2 года назад

For the mouse weight/mouse size/mutant example, if the two slopes are different, does that mean my new equation can be: y= control intercept+ intercept offset+ control slope+ slope offset And now my new Pfancy= 4, is that true?

@statquest 2 года назад

In linear regression terminology, if the slopes are different we call it an "interaction" and the new equation has what we call an "interaction term", but the idea is the same - it compensates for the differences in slopes.

@RandomGuy-hi2jm 4 года назад

can we use pythogorus theorem to find cooedinates on the line

@statquest 4 года назад

Probably, but it's easier to just plug in the x-axis values into the equation.

@raghavgaur8901 4 года назад

Hi Josh,Actually I wanted to ask you how to decide a null hypothesis for any case .As I understood the concept of p value but I didn't understand how to decide a null hypothesis for any given case.

@statquest 4 года назад

The typical null hypothesis is that there is no difference between two things. If we reject that hypothesis, then the data suggest that there is a difference.

@raghavgaur8901 4 года назад

@@statquest thanks for answering sir

@rizkykiky7721 2 года назад

how do you count the p-value from that F-value? I was a bit lost in there

@statquest 2 года назад

See: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-nk2CQITm_eo.html

@relatively_random4903 2 года назад

I'd like to know more about the number F he keeps calculating. Is there a Wikipedia article about it, as a starting point? Or a name commonly used for it?

@statquest 2 года назад

See: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-nk2CQITm_eo.html

@Han-ve8uh 3 года назад

1. The idea I get from this video is we can use any design matrix we want to create a test between any complex vs simpler model and interpret the significance of their difference in equation terms from the p-value right? 2. I have trouble relating the conclusion at 11:08 (p-value small--> fancy better than simple mean model) with linear regression and what the p-value here means. Why in linear regression there is a p-value for every coefficient (so a whole linear regression has multiple p-values) but here there is only a single p-value?

@statquest 3 года назад

I answer your question about what all of the p-values are for when doing a relatively fancy linear regression in the follow up video: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-Hrr2anyK_5s.html

@Han-ve8uh 3 года назад

@@statquest Thanks for the heads up, I didn't know that video existed, I watched and it clearly explained the different p-values. Something left unexplained was what does the p-value of the intercept mean? Is that a comparison of control group mice with a line that must past through origin, vs mutant mice with mutant offset amount above that? I think someone else here asked this too, why is the slope for both control/mutant same? Could it have been modelled as 2 different slopes, something like 2 new columns in design matrix slopecontrol 11110000 and slopemutant 00001111 to replace the single slope column. Does this work? If these p-values make sense, would I be able to infer anything about the single slope used in this video from the results of those 2 type-dependant slopes?

@statquest 3 года назад

@@Han-ve8uh The answer to your questions about the design matrix are in this video starting at: 0:33 (if you want to see the worked out example, see: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-NF5_btOaCig.html ). The answer to your question about the p-value for the intercept - this just tells us if the intercept value is significantly different from 0. Generally speaking, we are not interested in this (one way or the other) since we are more interested in comparing the two groups.

@shamshersingh9680 4 месяца назад

Hi Josh, the 2 lab example is a bit confusing. 1. First of all when we say difference between control and mutant means, do we mean (lab A control mean - lab A mutant mean) and (lab B control mean - lab B mutant mean). 2. Secondly, what is lab B offset? Is it (lab A control mean - lab B Control mean). 3. How lab B mutant equals (lab A Control Mean + lab B offset + difference). As per the figure, the lab B mutant should be equal to lab A Control Mean + Difference between lab A Control Mean and lab B mutant mean. Why do we have lab B offset in this equation.

@statquest 4 месяца назад

1) In this case, we use the average between lab A control and mutant and lab B control and mutant. Thus, we have a single difference that we use for both lab A and lab B. 2) Yes 3) We do it the way we do it since we have a single value that represents the difference between the mutants and the control, regardless of the lab. NOTE: The reason we use a single difference is because we assume (hypothesize) that the effect of the mutation is the same, regardless of the lab and that the only differences are in the lab itself - possibly due to different measurement techniques.

@amirwagih4797 3 года назад

10:33 why the degrees of freedom of the fancy model = n-3 , since we have two lines, we have n-2*2 = n-4 degrees of freedom cuz each line can pass through any two points?

@statquest 3 года назад

Since both lines share the exact same slope, we only need to estimate one parameter for it instead of two (one for each line).

@amirwagih4797 3 года назад

@@statquest Thanks alot Josh, I get it now, I always struggled with degrees of freedom and i really would love to see a statquest about that , Thanks again for the amazing content you produce!

@DeepakSah3.0 4 года назад

How you calculated the p-value?

@statquest 4 года назад

With an F-distribution. The concepts are explained in Linear Models Part 1: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-nk2CQITm_eo.html

@aelinsardothien8926 11 месяцев назад

oh how i would love to be a mutant mice : tall and skinny

@statquest 11 месяцев назад

@sudinroy7979 3 года назад

Is there any other application of design matrix instead of regression analysis ?

@statquest 3 года назад

Design Matrixes can be used for all general linear models (which includes linear regression, but also ANOVA and many other more complicated models) as well as all generalized linear models (which includes Logistic Regression and many other more complicated models).

@rohitrajora9832 3 года назад

Hi josh, I was able to understand everything prior to the decision matrices topic. Could you please suggest me on how i could improve my understanding? Also, if it's plausible, can you please make a "Decison matrices in python" video cause that would really help.

@rohitrajora9832 3 года назад

PS if you could also make "in python" videos for for the topics you have implemented in R before

@rohitrajora9832 3 года назад

that would be great

@statquest 3 года назад

What specific time point, minutes and seconds, is confusing?

@rohitrajora9832 3 года назад

I'm new to ML and started watching your machine learning playlist and got stuck on decision trees Here are my overall doubts- From the "GLMs part 2" video ========================= 7:28 (& 10:36) - why do we calculate the residuals using the 2 mean lines when we just made a single line {y = mean(control) + mean(mutant) }to the data using the design matrix? From the "decision matrices" video ============================ 8:31 - What do we do when the slopes are different 8:23 (and 13:59) - How de get these y equations? I mean I'm not intuitively able to get them on my own. 9:56 - Is this how we always compute the residuals (data-line)^2 ? I mean to calculate the pts on the line , do we always use the design matrix (and its corresponding equation) 13:59 - what is lab B offset? what is diff(mutant - control) ...what labs do the these mean mutant and mean control correspond to?

@statquest 3 года назад

@@rohitrajora9832 From GLM 2, we have two lines because how the design matrix works. In this case, to estimate the mean of the control subjects, we multiply "mean_control" by 1 and the "mean_mutant" by 0, giving us just the mean for the control. To estimate the mean of the mutants, we multiply mean_control by 0 and mean_mutant by 1. This is illustrated here ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-NF5_btOaCig.html

@diamagneetik 4 года назад

Sorry! But how you get p-value = 0.003 (11:06 minutes)? From table?

@statquest 4 года назад

I talk about that in Linear Models Part 1 (this is part 3!): ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-nk2CQITm_eo.html

@karannchew2534 2 года назад

13:31 y = labA control mean + labB offset + difference Shouldn't there be two "difference" i.e. y = labA control mean + labB offset + difference.labA + difference.labB ?

@statquest 2 года назад

@dansolpa Год назад

hi guys! hope you are having a beautiful day! hey, I have a question, in the last example where there are 2 different labs, the difference between mutant and control is calculated in this way: Lab A mutant mean - Lab A control mean? or (Lab A mutant mean + Lab B mutant mean) - (Lab A control mean + Lab B control mean). Hope anyone can help me. Thanks!!

@dansolpa Год назад

Maybe is necessary a 4th parameter, and change 3th parameter to be the difference between (Lab A mutant mean) - (Lab A control mean) and the 4th parameter to be the difference between (Lab B mutant mean) - (Lab B control mean) ?

@statquest Год назад

The trick is to see how the 0's and 1's affect the equation. For a control measured in lab A, we have... 1*lab A control mean + 0*lab B offset + 0*difference = lab A control mean ...now for a mutant measured in lab A... 1*lab A control mean + 0*lab B offset + 1*difference = lab A control + difference. ...so the mutant value in lab A is the lab A offset plus the difference between mutants and control. Similarly, we and measure the difference between mutant in control in lab B by including the lab B offset.

@dansolpa Год назад

@@statquest thanks for your reply!!!!! I'm still confused. As I understand the lab B offset is calculated by taking the Lab B control mean, and the difference between mutant and control is calculated in the following way: (Lab A mutant mean - Lab A control mean). So, if the previous hypothesis is correct, when calculating the (Lab B mutant mean) with the design matrix we are going to get a different value than the one in the graph, because maybe the difference between control and mutant is different in lab B and in lab A. In other words: Lab A mutant mean = 1*lab A control mean + 0*lab B offset + 1*difference = lab A control + difference. Lab B mutant mean could be different from 1*lab A control mean + 1*lab B offset + 1*difference = lab A control + lab B offset + difference because the difference between control and mutant could be different in lab B and in lab A. And if we are only calculating the difference by (Lab A mutant mean - Lab A control mean) we are only taking into account the lab A difference

@statquest Год назад

@@dansolpa It is possible that the difference between control and mutant is different in different labs. If so, this is called an "interaction effect", and we would need to add an additional term to compensate for it.

@ahorasimipaco 3 года назад

te quiero

@statquest 3 года назад

Gracias! :)

@claradong4649 5 лет назад

your voice is different from previous

@statquest 5 лет назад

This is actually an old video. I now use a better microphone.

@damienvalour5325 2 года назад

Hello Josh! I am a big fan with a colleague :-) Could you please do the same vid for models with interactions? Kr. BAmmm!! :-)

@statquest 2 года назад

I'll keep that in mind!

@mohammadalidastgheib2688 2 года назад

I didn't get the last example.

@statquest 2 года назад

What time point, minutes and seconds, was confusing?

@cicinindivin3689 2 года назад

Is this kind of 2 way ANOVA?

@statquest 2 года назад

Sure.

@cicinindivin3689 2 года назад

@@statquest it isn't clear to me why I should use F (whole regression) instead of T for the "slope" between the intercepts to calculate a p value... I know that for a simple single linear regression F is just T^2 (for the slope), but going "3 dimensional", like in the batch effect example you show at the end of the video, I no longer see the relation between F and the (now) 2 Ts... and I would expect to have to use the Ts since I still have only 2 groups (orthogonally, lab a and b, or WT and mutant), not 3 (which would imply ANOVA and F, at least so I've been taught)... Isn't T in this case better than F since it is "measuring" specifically the difference in intercepts instead of F that is "measuring" the quality of the fit overall (or maybe the last residual "dimension" of the fitting "surface")? I'm confused...

@statquest 2 года назад

@@cicinindivin3689 The difference between a t-test and an F-test is like the difference between a knife and a swiss army knife ( imageengine.victorinox.com/mediahub/39710/640Wx560H/SAK_1_3713__S1.jpg ). The t-test can only compare means between two groups and it can only take one variable into account when making that comparison. For example, a t-test can compare the height of two groups of people, where "height" is the only variable we measure. However, when we have more than one variable, like we measured "height" and we measured "weight", then we would not be able to use the t-test to compare the two groups. In contrast, an F-test works when we have measured 1 or more variables. In other words, the F-test is a generalization of the t-test. In these examples, we have measured more than one variable per group (size and weight) so our only choice is to use the F test (if we want to use all of the data we measured).

@cicinindivin3689 2 года назад

@@statquest Thanks for replying. I omitted that the t values mentioned in the previous message are the ones provided by Excel in the regression box, below the ANOVA box, belonging to the various coefficients (intercept, slope in one axis, slope in the other axis etc.), maybe it makes more sense now. I would have thought that using the t value for the specific dimension under analysis (eg the difference in intercepts in the case of this video) would provide a p value "purified" for that specific null hypothesis (eg intercepts are equal)... Idk, I ll try to understand this better.

@statquest 2 года назад

@@cicinindivin3689 Generally speaking we want to use as much data as we can to make decisions. Using a t-test would force us to exclude some data from making a decision, and omitting data and result in worse decision making. To be honest, I think the best thing to do is just forget about using t-tests. Think of everything as a type of F-test, and you will be much better off and can always use all of the data.