See all my videos at www.zstatistics.com/videos/ 0:00 Introduction 1:20 Definition 6:40 Confidence Intervals 12:51 Proportions 17:16 Challenge Question Series music by Purdy. purdy.bandcamp.com/ Song: 3 Friends to the Stars
Seriously man , i can't believe how much of a difference creators like him make. Honestly they are the reason why people like us have honest confidence to accomplish things in the field that we are doing ! you are the reason growth of this societies happen
SE(x) = s/√n s/√n+x = ½( s/√n) √n+x = 2√n n+x = 4n x=3n x=3*20=60 You will likely require 60 more measurements, provided the standard deviation remains the same when the sample size increases.
Here's a much simpler solution and the key takeaway: SE(xbar) = s/√n ½SE(xbar) = ½( s/√n) Once you are here, you're done, notice that 1/2 = 1 / √4, therefore, ½SE(xbar) = s/(√n√4,) ½SE(xbar) = s/(√4n,) So, in General, we need 4 times as many observations to cut SE(xbar) in half. Generally, for any factor we wish to reduce SE by; reducing SE by a factor of x requires a x^2 as many observations: (1/x) SE(xbar) = s / √(n * x^2) For the solution to the specific problem, 4n - n = 3n = 60.
Very very helpful. I landed here in my journey on Lean - 6 Sigma education.This series and the other videos from you helped me sharpen my statistics concepts.
Very Informative. I recommend this channel for the students who want to learn statistics for Data science and Machine Learning. As almost all concepts are covered that essential to learn for understanding the Machine learning models... Thank you and really appreciate your work
Challenge: S.E(sample mean)/2=sample S.D/√(4n) so it will be 80 measurements (4*20) So we need 60 more measurements 🙂thank you for sharing wonderful lectures
Hi Justin, I've made it but I cannot find the videos of outliers, boutique measures and Range+IQR) which, according to the diagram should be on the list. I searched your channel but I could not find them. Are they posted yet?
EDIT: this answer is wrong, but in an interesting way. So I’m going to leave it here, with a correction posted below. I have posted a separate answer with what I believe is the correct math for the challenge question. _________ If you increase the sample size, two things will happen: (1) n will increase, and (2) you’ll be adding random data points, so they will probably end up with a somewhat different mean and standard deviation. About #1: n appears in the denominator of the formulas for standard deviation *and* SE, so an increase in n is sort of double counted; combining the formulas, you get a denominator of [sqrt(n-1)*sqrt(n)], which is pretty close to just n, especially for larger n. So in short, doubling n should get you close to halving SE. About #2: The change in mean doesn’t affect SE, but if the variation about the mean changes as new data comes in, it would alter the numerator of standard deviation, which in turn alters the numerator of SE. It’s beyond my skills to know if this is likely to change up or down as n grows, so I’ll just assume that, on average, actual variation about the mean does not change. Thus, all that matters is the increase in n, where a doubling of the sample would approximately half the SE.
In the interest of helping anyone who reads this in the future, I’ll explain my error here and then post a new comment with the correct answer to the challenge question. My insight was to look at where n appears in the standard deviation formula. My mistake was not realizing it appears (implicitly) *twice* in that formula. In the standard deviation formula, we see n in the denominator, but it is *also* implicitly in the numerator because the more values you sample, the larger the summation gets. (If you double your sample size, this summation approximately doubles.) So the implicit n in the numerator of the standard deviation and the n in the denominator of the SE approximately cancel out. So we need only use the denominator of the standard deviation to answer this question: it is sqrt(n-1), so we need slightly less than 4x the original sample size @The Gao Thanks for pushing back. This led me to go and reevaluate the algebra again. You might consider that your response would have been more useful if you had engaged with the math I wrote, which introduced a new idea about the role of n in the _standard_ _deviation_ formula. Just telling me I’m wrong, or just telling me what the correct answer is without engaging with my explanation, doesn’t help me learn. I ended up going back and figuring it out for myself, so no worries there. It just didn’t help me or future readers to avoid the mistake I made.
Hi Justin, didnt find your video on boutique measures. would love to go through that one as well, just like the other videos of this series. Love your content and explanation.
I am new at statistics, therefore this answer could be wrong BUT i think confidence interval is calculated for estimate the population mean, population mean has to be that interval, sample mean doesn't need to be inside that interval (Correct me if i am wrong)
@@codercoder7270 We actually add some value in the sample mean and subtract the same value from the sample mean. So the distance from the sample mean of both, the cieling and the floor value (in the interval) would be the same.
Actually what was wrong is that he forgot to change the sample mean value. 112 in n=5. If u see, then u will find that both below values are equidistant as I mentioned from 112, not the value of mean when n=50, n=500. So, technically he forgot to update the sample mean value when finding the new intervals.
Is it possible for the confidence intervals not to contain the sample mean, as shown at 12:39 for n=500? If I understand this correctly, it is saying that when you have a sample mean of 114.7 there is a high likelihood that the population mean is between 110.8 and 113.1. Unless we had other information about the population, this doesn't seem reasonable to me.
@zedstatistics Perhaps, I've missed something, but If the confidence intervals are the sample mean +/- the (SE * t score), how is the 95% confidence interval score for n=500 *lower* than the sample mean? Sample mean is 114.7, and the 95% confidence upper threshold is 113.1 in the slide (12:46). Should it not be 0.55 (SE) * 1.964729 (T score) + 114.7 = 115.78? In the slide, we have n = 5 with an even difference +-/ from the sample mean (15.8) with n = 50 we have (6.9 below and 0.3 above) with n = 500 we have (3.9 below and 1.6 below)
I find the probability distribution picture in this example confusing because it appears to be treating the true mean as a random variable, which it is not.
i dont think the underlying distribution needs to be a normal random variable...the sample mean distribution will always follow a normal distribution due to central limit theorem...also z statistic is used in case population variance is known otherwise t statistic is used
SE(x)=S/√n Equation for 20 samples equal SE(x)=5/√20 Equation for the required measurements: 1/2SE (x)=S/√n SE(x)=2S/√n Therefore, let equation 1 equal equation 2 5/√20=2S/√n S.√n=2S.√20 √n=(2S.√20)/S (cancel the S together) √n=(2.√20)/1 Square both sides (√n)2=( 2√20)/1)2 n=80 Therefore, you need 60 more measurements if you wish to half the standard error
Can anyone please explain to me the new formula for standard error of sample proportion at 13:54 ? Justin didn't quit go into details about it and I'm not able to extrapolate from the similar formula of standard error of the sample mean that was shown just right before. Why the numerator of SE of the sample proportion is p*(1-p)?
So basically this has something to do with the binomial distribution. When we talk about proportions or situations where we can have two outcomes like (yes or no, Voting in favor or against) we use binomial distribution. In binomial distribution p(1-p) is the variance. So [ p(1-p)]^(1/2) will be the standard deviation Since we know that standard error = standard deviation/ (n^1/2). So we directly put the value of standard deviation in our formula.
SE --> 1/2 SE n --> 4n 20 -- > 80 we need more 60 points for our sample. Everything been equal, "n" and "SE" should only be the changing factors (the other remain fixed) in the SE formula, this is to find the relationship between SE and n. let us say my SE = 12, s= 24 , n=20 SE/2 = 6 i will later on do 1/2 (24/sqrt(20)) = s/sqrt(n) n= 80 x (s^2/ 24^2), everything been equal s=24, in order to see the impact on "n" when changing the SE. n= 80 x (s^2/ s^2) n= 80
Here is one thing I do not understand about the t-distribution: why do we need to assume the population is normally distributed? In effect the t-distribution is the sample distribution for a given sample size, correct? I have read, and it does make intuitive sense as well, that sample distributions do not depend on the underlying population distribution. So, if you have say a sample size of 30, you will have a t-distribution approaching a normal distribution regardless of whether the population distribution was non-normal, bi-modal, skewed or what not. Were am I going wrong here?
Thanks for your explanation, Justin! Very helpful. Regarding the question in the end: SE = SD/sqrt(n). Since SD = sqrt(Sum((x'-xi)^2)/(n-1)), it makes SE = (whatever)/sqrt(n(n-1)). Therefore, to halve SE, one needs to increase the output of denominator, making sqrt(a(a-1)) = 2sqrt(n(n-1). Thus: sqrt(a(a-1)) = 2sqrt((20(20-1))) a(a-1) = 4(20(20-1) a^2 - a - 1520 = 0 Solving for positive a, we receive something roughly around 40. In order to halve the SE we need to increase n by 20. My best guess.
Hi, i still didn't get why you took 97.5% for t-distribution part if u just want for 95% only while calculating for the confidence interval for sample mean??
I know this is old, but I can help, when you look at the distribution, 5% lies outside but it's outside in both what are called tails. The 5% is for both on the positive side and for the negative side. Therefore there is 2.5% in each tail. So that's the reason we need to look at the .975. It seems a bit weird I know, but you are actually looking for the overall piece, which means .025 in either direction. So when you do it for .025 you are getting it for one tail. So now if you do it in the + and the - direction, twice that is the .05 that you were looking for. I hope that helped
did you make some mistakens when compute confidential interval as n=50 and n=500, for example, as n=50, and x bar equals 115.3, your confidential interval is [108.5,115.6], however, (108.5+115.6)/2 not equals 115.3. looking forward your help, thx
Just spotted that as well. My guess is that he forgot to change the mean and stuck with 112 as the sample mean. The final mean (114.7) isn't even in the confidence interval
Love this explanation, you laid it out so well! However, it is still not clear to me as to why we are using SQRT of n (why SQRT, where does that come from?) Could anyone clarify? cheers!
For the most part, the Standard Deviation of your sample (the numerator) doesn't change much regardless of sample size. So thanks to the "magic" of mathematics, using the square root of N makes it easy to systematically reduce your standard error. If you use the square root of N as the denominator, increasing your N by 4 cuts your SE in half. So this is a quick and easy way to eyeball how large your sample needs to be to reduce your standard error and consequently improve your confidence intervals. That's at least one good reason for using the square root of N as opposed to something else. For a more detailed response: statisticsbyjim.com/hypothesis-testing/standard-error-mean/
This is because when using inductive method, SE = std dev/ sqrt(n) matches more closely with the standard deviation of deductive method. Since we want to at least make our inductive analysis as close as possible to deductive method, we take this expression of SE. Hope this is helpful.
✓n is found because I think of it as sqrt of variance divided by the total observation (mean of the total variance). So finding sqrt of the mean of the total variance gives you Standard deviation/sqrt n .. Hope this helps.
this is a t-test, which takes into account that the sample size is small. go watch a video about it and it may become clearer although it only really clicked with me today lol
Why does the iq have to be normally distributed? Doesn't the central limit theorem state that the distribution of means will be normally distributed anyway? Great videos btw
I also got 80 at first but that doesn't change n in the sample variance denominator. So I inserted the formula for the standard deviation into the formula for the standard error and moved things around a bit so that the denominator is (n-1)*sqrt(n). Our goal is to halve the standard error, so we must double this denominator. Insert n = 20 and we get approx. 85 as our denominator and that doubly is 170. Our question now is what is n so that our formula for the denominator equals 170. That's some cumbersome algebra so I put it in online and the result was approx. 31, so we LIKELY need 11 more samples. LIKELY because new samples may shift the numerator, which will shift the mean, which will shift the standard deviation, which will shift the standard error. Let me know what you think, I could be wrong but hoping to be on the right track haha.
I think you are on the right track! I like that you substituted in the formula for standard deviation, which allows you to consider all the places n appears. I think this is what most other commenters are missing. Nice approach! I also did this, and I worked through the math manually to come up with the number 57. (See my comments on this video for details and a simple exposition of the math.)
I like and understand your presentation. However, I can't seem to figure out, what formula was used to calculate Std-Err-of-Percent in the following table? Gender Result Frequency Weighted-Frequency Std-Err-of-Wgt-Freq Percent Std-Err-of-Percent 95%CI-for-Percent Male Negative 1 16.20746 16.20746 7.5605 7.5259 0.0000 22.6253 Positive 4 64.82982 31.56546 30.2419 14.4387 1.3398 59.1441 Total 5 81.03728 34.96896 37.8024 15.9078 5.9594 69.6454 Fem Negative 3 100.00000 56.73086 46.6482 18.0779 10.4613 82.8350 Positive 1 33.33333 33.33333 15.5494 14.1520 0.0000 43.8778 Total 4 133.33333 64.91964 62.1976 15.9078 30.3546 94.0406 Total Negative 4 116.20746 58.52508 54.2087 17.5366 19.1054 89.3120 Positive 5 98.16316 45.08850 45.7913 17.5366 10.6880 80.8946 Total 9 214.37061 71.16743 100.0000
We need about *57* _more_ measurements, for a total of about *77* _______________ Here’s why: The sample size *n* affects the SE formula in 3 places: (1) It appears in the denominator of the SE. (2) It appears as “n-1” in the denominator of *s* (the standard deviation, which is the numerator of the SE formula). (3) It _also_ appears implicitly in the numerator of *s* because the summation there gets larger as the sample size increases, and, assuming the deviations are similar for the new measurements, the summation increases at the same rate as *n* (1) and (3) cancel. They are both just √n. So we can ignore them. Therefore we only need (2) to answer the question. Halving SE will require doubling the denominator of *s* In this question, the denominator of *s* is √(n-1) = √(19) Doubling this: 2 * √(19) = √(4*19) = √(76) = √(77-1) = √(n-1) So *n* = 77. This means we need 77-20=57 _more_ measurements to halve the SE. (Suggestions for improvement to this answer are welcome! 🙂)
Standard formula for standard error is $$SE = \frac{s}{\sqrt{n}}$$. As per the question the desired objective is reduce the SE by half i.e. SE/2. Thus the equations becomes $$\frac{SE}{2} = \frac{s}{\sqrt{n+x}}$$, where n = 20 and x is the additional measurements we are looking for as the answer. Solving this entire question will give us the final result of 60 (it will also involve the substitution of values). Thus 60 additional measurements will be required to decrease the standard error by half. Thanks Note: Above notation is based on Latex. If you put it in a RMarkdown document then you can see it as normal mathematical notation. (provided that latex or tinytex has been installed)
I did not understand the proportions part. Why does p gets to be 0.65? Why does 65 percent of votes on a specific party result in a p value of 0.65? Could you explain please? I would be thankful. Your videos are superb by the way!
at 12:20 shouldn't you calculate 95% confidence intervals for n=50 and n=500, using their respected means, and not the mean for n=5. The whole point of these intervals looks wrong to me. at 15:32 "If N gets pretty large, all distributions converge to a normal distribution" -- you simply citing the Central Limit Theorem wrong. Try to apply what you're saying to uniform distribution... The Central Limit Theorem says that SAMPLING DISTRIBUTION of means (for example) of ANY distr. converge to a normal distribution, not ANY distr. --> normal !!!