Correction: 10:18. The Amount of Say for Chest Pain = (1/2)*log((1-(3/8))/(3/8)) = 1/2*log(5/8/3/8) = 1/2*log(5/3) = 0.25, not 0.42. NOTE 0: The StatQuest Study Guide is available: app.gumroad.com/statquest NOTE 2: Also note: In statistics, machine learning and most programming languages, the default log function is log base 'e', so that is the log that I'm using here. If you want to use a different log, like log base 10, that's fine, just be consistent. NOTE 3: A lot of people ask if, once an observation is omitted from a bootstrap dataset, is it lost for good? The answer is "no". You just lose it for one stump. After that it goes back in the pool and can be selected for any of the other stumps. NOTE: 4: A lot of people ask "Why is "Heart Disease =No" referred as "Incorrect""? This question is answered in the StatQuest on decision trees: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-_L39rN6gz7Y.html However, here's the short version: The leaves make classifications based on the majority of the samples that end up in them. So if most of the samples in a leaf did not have heart disease, all of the samples in the leaf are classified as not having heart disease, regardless of whether or not that is true. Thus, some of the classifications that a leaf makes are correct, and some are not correct. Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/
@@parvezaiub That's what you get when you use log base 10. However, in statistics, machine learning and most programming languages, the default log function is log base 'e'.
Hi Josh - great videos, thank you! Question on your Note 3: How does omitted observations get "back into the pool"? Seems in the video around 16:16, the subsequent stumps are made based on performance of the previous stump (re-weighting observations from previous stump)... if that's the case, when do you put "lost observations" back into the pool? How would you update the weights if the "lost observations" was not used to assess the performance of the newest stump?
Everyday is a new stump in our life. We should give more weightage to our weakness and work on it. Eventually, we will become strong like Ada Boost. Thanks Josh!
Dude... I really appreciate you make these videos and put so much effort in to making them clear. I am buying a t-shirt to do my small part in supporting this amazing channel,.
I can't believe how useful your channel has been these days man! I literally search up anything ML related in youtube and there's your great video explaining! The intro songs and BAMS make everything so much clearer dude, the only bad thing I could say about these videos is that they lack a conclusion song lol
Looking forward to Gradient Boosting Model and implementation example. Somehow I find it difficult to understand it intuitively. Your way of explaining the things goes straight into my head without much ado.
Hello Sir, I really love the simple ways in which you explain such difficult concepts. It would be really helpful to me and probably a lot of others if you could make a series on Deep Learning, i.e., neural networks, gradient descent etc. Thanks!
Hi Josh, great video as always! Questions: 1. Given there are 3 attributes, and the reiterative process for picking 1 out of the 3 attributes EACH TIME, I assume an attribute could be reused for more than 1 stump? and if so, when we do stop reiterating? 2. Given the resampling is by random selections (based on the new weight of course), I would assume that means everytime we re-do AdaBoost we may get different forests of stumps? 3. Where can we find more info on using Weighted Gini Index? Will they yield same model? or it can be very different? Thank you!
1) The same attribute can be used as many times as needed. Keep in mind that, due to the bootstrapping procedure, each iteration gives us a different dataset to work with. 2) Yes (so, consider setting the seed for the random number function first). 3) I wish I could tell you. If I had found a good source on the weighted gini, I would have covered it. Unfortunately, I couldn't find one.
For my future reference. 11:36: If prediction for a sample was wrong, then increase its weight for future correction. If Amount Of Say is high (the tree is good), increase the weight more. Wrong - > Increase weight. Better Tree -> more amount of say -> adjust weight more
Hello. There is a little error in arithmetics. But AdaBoost is clearly explained! Error on 10:18: Amount of Say for Chest Pain = (1/2)*log((1-(3/8))/(3/8)) = 1/2*log(5/8/3/8) = 1/2*log(5/3) = 0.25 but not 0.42. I also join others in asking to talk about Gradient Boosting next time. Thank you.
Aaaaah. There's always one silly mistake. This was a copy/paste error. Oh well. Like you said, it's not a big deal and it doesn't interfere with the main ideas... but one day, I'll make a video without any silly errors. I can dream! And Gradient Boosting will be soon (in the next month or so).
@@statquest Don't worry about small errors like these, your time is GOLD and shouldn't be consumed by these little mistakes, use it to create more 'BAM'! The audience will check the errors for you! All you need to do is to pin that comment when appropriate so that other people will notice. PS, how to PIN a comment (I paste it here to save your precious time ^_^) : - Sign in to RU-vid. - In the comments below a video, select the comment you want like to pin. - Click the menu icon > Pin. If you've already pinned a comment, this will replace it. ... - Click the blue button to confirm. On the pinned comment, you'll see a "Pinned by" icon.
I swear that is the greatest channel about machine learning and statistics , Great job josh I just have a quick question : what if we have stump that both children say (left children get 2 yes 0 no ) and (right children get 2 yes and 1 no) and that is the best we could come with so what should i do ? i saw a different video and i found that he classify left children as yes and right children with no and he say the first stump make 2 errors but how !!!!!!!!!!!!!!!!!! we say that the node vote for a majority so it should be both say yes ,and the right child get 1 error so first stump made one error , right ?
Thanks, Josh for this great video! Just to highlight, at 10:21 your calculation should be 1/2 * log((1-3/8)/3/8)=1/2*log(5/3) How did you conclude that the first stump will be on weights? because of min total error or min total impurity among three features? It might happen that total error and impurity may not rank the same for all features, though they happen to be the same rank here.
I've put a note about that error in the video's description. Unfortunately RU-vid will not let me edit videos once I post them. The stump was weighted using the formula given at 7:32
Hi Josh. Awesome video again. My doubt: We choose 1st root,say Chest pain, based on Gini index. The Gini index for 2nd root will be calculated for all features or will we exclude Chest pain from subsequent stumps?
Really good video. What is weighted Gini though? I know that gini impurity is mentioned in decision tree series, yet i cannot find the definition of weighted Gini. thx
Thanks for the great explanation! One question though, how long the stumps will be created? That is- what is the termination criteria for the training stage?
Thanks Josh! I have 3 questions: 1. @3:43 you say weak learners are "almost always stumps" is there a case where it is not a stump? 1.a. also what is adavtage of using stump over bigger trees? 2. Does boosting algorithm only use decision trees?
1) I don't know. However, it's possible and easily done (i.e. in sklearn you just change the default value for "depth"). 2) It probably depends on your data. 3) In theory, no. In practice, yes.
Excellent tutorial, thanks! Would like to ask: 1) after recalculating and updating weights for each sample, the process repeats. And in first step we found the 'weight' has the lowest Gini, then in next round we exclude the 'weight' and only consider the remaining criteria? or not? I see the 'weight' is reused when bootstrapping is used, but how about if we still using Gini? 2) if same set of criteria is re-used, then we would get a bunch of stumps with same criteria but differ only with the amount of say. Why is that useful in this case, other than simply using each criteria once only? 3) if the same set of criteria is re-used, then how can we determine when to stop the stump building process and settle down to calculate which classification result group has larger amount of say? Thx~
Every stump selects from the exact same features (in this case, the features are "chest pain", "blocked arteries" and "patient weight"), however, the sample weights are always changing and this results in bootstrap datasets that contain different samples for each stump to divide. That said, while "Patient weight" might work well on the first stump, it might not work well in the second stump. This is because every sample that "Patient weight" misclassified will have a larger weight and thus, a larger probability of being included in the next bootstrapped dataset.
Hello Josh, Thank you for the amazing videos. I had a couple of questions on stumps that are created after Sample weights are updated at time 16:00. We continue with sampling from the full set assuming new weights. This means the new set will be a subset of the original dataset as you explained in 18:32 . Going forward all subsequent stumps will only work on smaller and smaller subsets, making it a bit confusing for me on how to ensure good randomness. 1. Do we also restart from the updated sample weights at 16:00 and redo sampling, thereby creating multiple different datasets but using the same sample weight values? This will probably ensure we use all the data from the original data set in some instances. 2. As a follow-up to Q1, do we perform multiple sampling at different levels to get more variations in the dataset before creating stumps? In the video, you described only one instance of sampling using new weights but I assume it needs to be performed multiple times to get variations in datasets. 3. Do we not sample from distribution at the very beginning also? All sample weights will have the same value but random sampling would mean some might not make it to the first round of ( stump creation + Amount of say + New sample weights ) itself to get unequal sample weights at 16:00. In the video you take all the samples by default hence the query.
I have a quick question- why we still kept Weights in determining which one would the second stump? I thought it should be excluded from the remaining stump pool once it's used for classification.
Wow, that was great. I love how you make it sound so simple. Just a question about the construction of the third stump without weighted gini. Suppose sample number 5 was not picked to build my new dataset to feed to stump number 2. Can sample number 5 be picked to feed to stump number 3 or do I lose it for the rest of the stumps ?
@@statquest Hi Josh, but wouldn't there be no sample weight assigned to that number 5 if it is not iterated in the 2nd stump? without a sample weight as a result of stump 2, then how would it be randomly selected in the selection for the 3rd stump? Does that mean the changes to the weights from stump 2, will be only applied to samples in stump 2 but the samples not selected will still retain their weights from stump 1, and the normalization of weights then be done for all samples regardless of them in stump 2 or not?
@@statquest But in that case, the elements that were not picked will be "more relevant" in the next stump? Seems like a weakness/inconsistency of the method.
Hi Josh. Thank You very much for this. I had a small doubt. Suppose we have a bad stump whose say is negative. While updating sample weights, we will decrease weights of incorrectly classified samples. This in turn will make the say even more negative, i think. So my question was why don't we increase the weights of misclassified samples for bad classifiers also so that my stump goes from a bad classifier to a good classifier?
Awesome explanation , so in a nutshell if i say adaboost filter out correct classified sample and move misclassified in each new sampling for new stump right?
@@statquest for next stump after sampling , any variable can be a a candidate of next stump based on its gini ? For example "Weight" feature has lowest gini that's why it was first stump , after doing iterations (calculating say , sample weights) next stump will be chosen based on each features smallest gini right?
Hi Josh, great explanation as always. Can you please give guidance on how Adaboost does the 'amount of say' calc and final prediction for regression? Also found something every interesting that in the sklearn Adaboost implementation it allows you choose the base learners between trees and linear models. :)
In response to Note 3 in your corrections, how exactly does this work? Is the dataset returned back to the previous dataset exactly (before the bootstrap one), or do the sample weights need adjusting?
I couldn't figure out the weighted gini index instead of the bootstraping method, especially where the weight should be applied, do we need to recalculate the weight within the node? Could you explain more of this to me? Thanks!!
Hai Josh, I am really enjoying your videos. If you could help me answer my one question, I would be very grateful. In the new data set that you have created, it seems like the probability of appearing the wrongly classified sample is more. But how picking a random number between 0 and 1 helps us put the wrongly classified sample in the new dataset more number of times?
@@statquest In 16:24, we use sample weight as a probability distribution and the 4th sample has got highest probability. The sample weights that we have assigned are artificial and doesn't reflect true distribution. But still in the new the new dataset (in 18:02), the sample with the highest sample weight (ie, 4th sample ) has occurred more number of times even though those sample weights doesn't reflect true probability distribution
@@abrahamgk9707 I'm not sure I understand what you mean by "true distribution", but, based on the sample weights at 16:24 here is how we select samples. We pick a number between 0 and 1. Since the weight for the first sample is 0.07, if that number falls between 0 and 0.07, we select the first sample. Since the second weight is 0.07, and 0.07 + the previous threshold, 0.07 = 0.14, then if that number falls between 0.07 and 0.14, we select the second sample. Since the third weight is 0.07, and 0.07 + the previous threshold, 0.14 = 0.21, then, if that number falls between 0.14 and 0.21, we select the third sample. Since the fourth weight is 0.49, and 0.49 + the previous threshold, 0.21 = 0.70, then, if that number falls between 0.21 and 0.7, we select the fourth sample, etc. etc. The range of values that select the fourth sample, 0.21 to 0.7, is 7 times greater than the range of any other individual sample.
A different bootstrap dataset will have different samples in it, which will, hopefully, improve (or worsen) the Total Error. If not, you can stop after a few more trees since prediction will not improve, or you can stop when the maximum number of trees are created.
Awesome video! I am wondering how many stumps should be generated. Does it depend on the number of predictors in my dataset? Or we can choose the best one with cross-validation. Also, is it possible that the same predictor will be used multiple times in different stumps even with different input sample weights?
You can figure out the number of stumps by seeing when the classifications no longer improve (this usually means that some samples are never correctly classified). And the same predictor can be used multiple times.
if lets say your goal is to maximize recall (not accuracy) where would that change be applied to choosing the next tree? or would it be in the amount of say calculation?
Hi Josh, Correct me if I'm wrong: the Gini index is a measure of inequality between two population so if one group is much larger than the other then the Gini should be high. So how come Weight has the lowest Gini despite having the maximum inequality out of the three features? Thanks in advance
Unfortunately there are two Gini Indices. One used for populations that measures inequality (for details see: en.wikipedia.org/wiki/Gini_coefficient ) , and another used in decision trees, like these, that measures impurity (for details see: en.wikipedia.org/wiki/Decision_tree_learning#Gini_impurity ). The leaves for "weight" have the least amount of impurity, so it has the lowest Gini. Unfortunately Gini Index is used for both terms, so I can understand why this is confusing.
Q1. 11:58 since amt of say can be negative as well, shouldn't the graph and x-axis extend towards the left? Q2. If a tree has a negative amount of say then a correctly classified sample will be assigned a higher weight than the sample incorrectly classified. Looks confusing why you would assign a higher weight to a sample that has been correctly classified even if it the tree overall has a negative amount of say.
A1. Sure, you can extend the graph in the negative direction. You get values closer and closer to 0 the more negative the amount of say is. A2: If a tree has a negative amount of say, that means it said most of patients in the training dataset with heart disease did not have have heart disease and most of the patients without heart disease had heart disease. Thus, if this tree "correctly" classifies a new sample, it grouped it with the observations with the opposite value, which means it did a bad job categorizing the samples. Thus, we need to spend more effort trying to group it with the same value.
The other thing I am not clear about is that the subsequent stumps that are made receive the output from the previous stumps or not (I mean not the error values etc but lets say a new observation needs to be classified) then will that new observation go through stump 1 and then through stump 2 and so on and so forth to be labelled at the end or does the new observation will be classified as having a heart disease by stump 1 which has amount of say lets assume 0.2 and then stump 2 classifies as no heart disease with amount of say lets say 0.4 and stump 3 classifies the new observation to have heart disease with amount of say as 0.2 again . Then will stumps 1 and 3's amount of say will be added or averaged or what will happen. Also you mentioned that stumps are created with two leaf nodes, then how do you deal with multinomial variables both in the case of independent and dependent variables
The amounts of say are just added at the end. In your case, you would get a tie, since the sum of stumps 1 and 3=0.4 for "Has heart disease" and stump 2=0.4 for "no heart disease". With multinomial situations, you create several adaboost "forests", one for each classification type and each forest tests one classification vs everything else lumped together. (so if you have 3 categories, you test 1 vs not 1, 2 vs not 2 and 3 vs not 3).
Question: When determining which features to split given updated sample weights, how did you determine the binning / ranges to assign given random draws on (0,1) ? Specifically, why bin sample weights into (0,0.07], (0.07, 0.14], (0.14, 0.21], etc? Are these based on standard deviations of a "normal" distribution of weights over (0,1)?
Ok, but I'm still confused about how the range (0,1) of possible normalized sample weights is naturally partitioned into the ones that you used? Are the bins always the same as the ranges you provided in this example? Why pick (0,0.07), (0.07,0.14) etc instead of say (0,0.25), (0.25,0.5) etc? Sorry if I am being dense
@@aipithicus The normalized weights add up to 1 ( 15:07 ), which makes them suitable for describing a discrete probability distribution. This means that each normalized weight can correspond directly to a probability. For example, the normalized weight for the first sample is 0.07 and that means that it should have a probability of 0.07 to be picked for the bootstrapped dataset. Thus, at 16:25 we pick a random number between 0 and 1. Because the probability that the random number will fall between 0 and 0.07 is 0.07, we use that range to decide if we should select the first sample. Because the next sample also has a normalized weight of 0.07, it should have a probability of 0.07 to be picked for the bootstrapped dataset. Because the probability that the random number will fall between 0.07 and 0.14 is 0.07, we use that as the range to decide if we should select the second sample. etc.
If we are already correctly classifying a sample, then we don't need to focus the next stump on correctly classifying as much as we need for the next stump to correctly classify the samples that were not correctly classified.
Hi Josh... Thank you very much.. Excellent teaching.. I have a doubt though.. Once the new weights are created and teh new dataset is selected @16.50, it is selected in random, then how does it ensure that sample with higher weights get more chance.. because the number chosen is random?..
The weights are equivalent to the probability that an observation will be added to the new dataset. So if the weight for an observation is low, because it was correctly classified by the last tree, then it has a low probability of being added to the new dataset. If the weight for an observation is high, because it was incorrectly classified by the last tree, then it has a high probability of being added to the new dataset.
First of all, I love your videos! I think there is another minor mistake when calculating the amount of say fo Weight: 1/2*log(7) = 0.422549. not 0.97. I think.
If you use log base 10, you get 0.42. If you use log base 'e', you get 0.97. In statistics, machine learning and most programming languages, the default log function is log base 'e', so that is the log that I'm using here. If you want to use a different log, like log base 10, that's fine, just be consistent.
Thank you very much for this video!! I have a question : at 19:13, How does a stump finally classify a patient as "Has heart disease" or "Does not have a heart disease" . I thought it depends on the value of the feature in the root node ...
The leaves of each stump make the classification based on the feature and threshold in the root. Does that make sense or are you asking about something else?
@@statquest Thank for you answer. Yes it does make sense but you're saying that the leaves of each stump make the classification and in 19:13 we're considering that the whole stump is doing the classification, I didn't understand the transition from the classification of each leaf to the classification of the whole stump !
@@ghofranezouaoui4269 I hope it's all clear now. We get some new data, and apply it to each stump. Each stump has a root that sends us to a leaf, and the leaf gives us the classification for that stump, given the data.
@@statquest Yes now I got it! I forgot that at that point we're considering new data so it's either this or that. Thank you so much for keeping things so simple and clear and for your quick responses :)
On what basis are we labeling that the left child node to be "Yes heart disease" and the right child node "No heart disease"? Cuz we just split the data based on a condition yea...or is it like we train the stump and then feed the dataset into this stump to get the Correct and Incorrect values? If that's the case how's it possible in the case of Blocked Arteries when we have both child nodes as purely impure.
Hi, thank you for your clear explanation. I have one question: Is the learning rate set to 0.5 in the calculation of the "amount of say"? because you didn't illustrate it
The concept of a learning rate was never part of the original algorithm, which is why I didn't illustrate it. To be honest, I'm not really sure how it would work in this context. If we scaled every "amount of say" by 0.5, then we would still get the exact same results.
Thanks! What about gradient boosting? Is it used for genomics? I am aware that it has been successful used in Kaggle competitions, but don't find applications to genomics, in spite of the support of XGBoost and CatBoost for R.
You know what's really funny - I just wrote a genomics application that uses XGBoost, so I know it can work in that setting. I'm using it to predict cell type from single-cell RNA-seq data. It works better than AdaBoost or Random Forests. However, it turns out that Random Forests have some nice statistical properties that make me want to use them over gradient boost. I may pursue both methods.
Hi Josh, Thanks for the awesome video. Just a query here, you are creating some new sample dataset, it is kind of bagging(bootstrapped dataset) in random forest? Thanks in advance :)
Einstein says "if you can't explain it simply you don't understand it well enough" and i found this AdaBoost explanation bloody simple. Thank you, Sir.
Josh, this is just awesome. The simple and yet effective ways you explain otherwise complicated Machine Learning topics is outstanding. You are a talented educator and such a bless for the entire ML / Data Science / Statistics learners all around the world.
AdaBoost: Forest of Stumps 1:30 stump: a tree just with 1 node and 2 leaves. 3:30 AdaBoot: Forest of Stumps; Different stumps have different weight/say/voice; Each stump takes previous stumps' mistakes into account. (AdaBoot, short for Adaptive Boosting) 6:40 7:00 Total Error: sum of (all sample weights (that associated with incorrectly classified samples)) 7:15 Total Error ∈ [0,1] since all sample weights of the train data are added to 1. (0 means perfect stump; 1 means horrible stump) --[class notes]
Hi Josh, excellent video. But I am not able to understand how weighted gini index is calculated after j have adjusted the sample weights ... Can you PL help?
Take the example of Chest Pain Gini index = 1 - (3/5)^2 - (2/5)^2 = 0.48 for the Yes category Gini index = 1 - (2/3)^2 - (1/3)^2 = 0.44 for the No category Since each category has a different number of samples, we have to take the weighted average in order to get the overall (weighted) Gini index. Yes category weight = (3 + 2) / (3 + 2 + 2 + 1) = 5/8 No category weight = (2 + 1) / (3 + 2 + 2 + 1) = 3/8 Total Weighted Gini index = 0.48 * (5/8) + 0.44 * (3/8) = 0.47
Thank you for the study guides Josh! I did not know about them and I spend 5 HOURS making notes about your videos of decision trees and random forests. I think 3 USD value less than 5 hours of my time, I purchased the study guide for AdaBoost and cannot wait for the rest of them (specially neural networks!)
Could you elaborate on weighted gini function? Do you mean that for computing the probabilities we take weighted sums instead of just taking the ratio, or is it something else?
I understand he calculates Gini for every leaf, then multiplies by whatever number of predictions is in that leaf and divides by total number of predictions in both leafs (8) so this index is weighted by the size of that leaf. Then sums weighted indices from both leafs. At least I'm getting the same results when applying this formula.
In this case, you need to use "log base 'e'", or the "natural log". In machine learning and statistics, people often use "log()" to mean the natural log. I know this is confusing, but, as long as you are consistent, you can use any log base and things will work.