I appreciate your endeavor making this video. You read and summarized the paper of XGBoost! I think you are the one who best explains XGBoost in the world!!!
This is funny how I always like StatQuest videos before watching them... and never regret it :D Exactly the same way as with this comment, I am so sure it's true ;)
Great series on XGBoost! Thank you very much for making them. Now I got more clear understanding about XGBoost, especially its boosting trees and computational advantages that make it fast. The animations are wonderful, making things more easily to follow. I love them so much. Hope that you will make something on LightGBM and CatBoost.
OMG !!!! what an explanation ! extremely detailed explanation of extreme gradient boosting. I think no one on this planet can explain this topic like you Josh! You have literally done the autopsy of this algorithm to get into that details:) Thanks a ton for this amazing video !!!!
Please accept my virtual standing ovation :) I finished the entire XGBoost tutorial and you made it sound super simple. Please also add LightGBM to the list. Thanks again for all your work.
Finally.. graduated in tree based algos from the statquest academy. What a feeling :) :)... One absolute last hiccup-- What does "xgboost splits the data so that both drives get a unique set of data" mean? What is a unique set of data there? And why does it ensure that parallel reads can happen? Why can't parallel reads happen if the "unique set of data" isn't there on different drives?
The idea is that 1) we start out with a dataset that is too large to fit onto the same disk drive, so we have to split it up. We could split it so that there is some overlap between what goes on drive A and what goes on drive B, but there's no speed advantage to that. Instead, if each drive has a unique subset of the data, we can call each drive simultaneously (in parallel) to access records.
Thanks you very much for the videos. You give the best explanation. I'm looking forward to the lightGBM video (hope there will be one, someday). That would be a HUGE BAAM!
So grateful for these videos. Has made understandable of xgboost so simple though the concepts are little complicated. Closing my eye i can know say how the tress are built, how are they optimised, how are they faster etc
Thanks for the great explanation! I have two questions: 1. How many dataset that we need so we can use parallel learning? 2. In parallel learning is we just make one tree so we can find weigth quantile sketch?
1. I'm not sure. However, it is probably mentioned in the documentation. 2. Parallel learning makes it so we can find the best feature split faster for a given node in a tree.
This is the most lucid, most comprehensive single stop shop for the tree based algos that you are going to need in any job at all. Lightgbm missing definitely leaves something to be desired. Shall we expect a Lightgbm video anytime soon, Josh?
I just wanted to say thanks. I really struggled in college because I always felt like the professors explained things like we already knew everything they were teaching. The way that you break everything down is super helpful. I hope that you continue to make these sorts of videos even when you because a world famous musician. Keep it up :)
@@statquest Would you mind explaining to me about it? Does LightGBM use the same concept as XGBoost in finding the candidate splits (e.g quantization)?
@@deana00 My understanding is that the first handful of trees (I believe 10 is the default) are built like XGBoost, and then it uses a subset of the elements in a feature based on the size of the gradients.
Thanks, Josh. One question about the greedy part (4th minute); in random forest (say, regression applications), even though we use a subset of features for each tree and for a bootsrapped subset of the data, we could still end up with many thresholds to examine similar to the problem with xgboost. Why not using quantiles for building random trees that deal with a high number of thresholds?
Great question! I don't think there is a reason why not. All you have to do is implement it. I think a lot of the "big data" optimizations that XGBoost has could be used with all kinds of other ML algorithms.
Hi Josh. I am not sure if you still check these comments, but I wanted to thank you for making these really amazing and informative videos. I am not sure I could pursue machine learning in my own time if I did not have these great resources to clearly explain the content to me. Thanks for making the maths fun and showing all the cool details, its great :)
Thanks for the highly accessible explanations for how XGBoost performs its calculations. What about doing a few videos on tuning the hyperparameters for improved model fits? For boosted tree, and random forest, this is not so complicated, but XGBoost has many, many parameters that can be tuned, making the optimization of the model quite challenging.
Hi Josh, Thanks for the awesome video.While you are preparing the videos in R and Python for xgboost hyper parameter tuning, would be great if you can point to some resources for xgboost hyper parameters in the mean time.
Thank you for the detailed but easy-to-understand video. I'm also interested in LightGBM algorithm as well (seems like it was compared with xgboost a lot), so I would be happy if you made one for lgbm as well.
At 19:56 when choose leaf for missing value, you select left branch as default path. It make senses because missing value residuals (-3.5 and -2.5) are negative, which is similar as nonmissing value residuals (-5.5. and -7.5). I wonder if I can select Right branch as default path for missing value if my residual is large positive e.g. 10.5 in stead of -3.5 and -2.5.
Nice explanation! I want to ask. When we use large dataset for sparsity aware split finding, are we must do parallel learning and weighted quantile sketch for find the threshold?
Hi Josh!! Thanks for this ❤️. But can you explain if how did you find the residuals value of missing dosage using initial predictions in Sparsity Aware Split Finding? Or if anyone else knows can you please help on me this . Thanks in advanced.
I'm not sure I understand your question however, In order to calculate the residuals, we only need to know the drug effectiveness. In other words, we don't need to know the dosages to calculate the residuals. So we calculate the residuals and then figure out the default direction to go in the tree for the missing values (which we never have to actually fill in).
Yes yes, I completely overlooked that!!. Maximum Bam!!!❤️ Also we all really appreciate how you still reply and clear doubts from the old videos. Thanks Josh!
Great video! But I don't understand why *Hessian* is used to serve as *weights* for quantile histogram. What is the underlying mathematical reason that the 2nd order derivative plays a role of weight?
Hessian = p^ * (1-p)^, you can understand/feel Hessian as the probability threshold for whether to further branch out or not. If H = 0.9 * (1-0.9) = 0.09, H is very less, and the classification of that branch is enough as the classes are well separated, if H = 0.5 * (1-0.5) = 0.25, H is very large (max), and the resulted classification of that branch is not yet good enough, the classes for the observation are not well separated. So perhaps you can use 0.15 as the "probability threshold" Hessian value.... now making a quantile of these Hessian values well separates rightly classified and wrongly classified due their sum shown exactly @14:30.
I have yet another question on the histogram-building process: Let's say I had 1M rows and 1 feature 1. Build the histogram. Now there are approx 33 split-points 2. Splitpoint 10 gives max gain 3. Data from the first 10 bins go to the left child and the data from the 10th bin onwards go to the right child *4. To further split the left and right child on the same feature, (a) Are histograms with ~33 split-points built again for data that landed in the left and the right child. Or, (b) is it that now only 10 already computed split-points would be considered to compute gain for the left child and 23 already computed split-points for the right child?* I think it is (b) since I think that is the only one that can result in a speedup IMO
Based on the manuscript alone, I would say that it is (a), but it's possible that, in practice, they reuse the splits and use (b). Theory and practice aren't always the exact same.
@@statquest Thank you for the reply, Josh :). Since yesterday, I have read the histogram based split finding vs goss based split finding comparison that is given in the lgbm paper. The two algorithms are juxtaposed side by side. Basis that I am reasonably confident that it is (b) There's another trick that is given to speed up in which the histogram is computed only for one child and the histogram for the other child is merely parent_node_histogram - computed_child_histogram. This would be possible only if the histogram is computed once at the beginning of the tree and then is reused again throughout the tree. I got as much from the lightgbm manuscript. Looking forward to your thoughts on the same
@@nitinsiwach1989 That make sense because LightGBM builds trees "leaf-wise" - so it looks at the two leaves and selects the one that has less variation to add branches to.
in 14:25 what if we have 5 samples at 0.1 probability instead of 2 samples as well as another group of 5 samples at 0.9 probability instead of 2 samples in addition to the last two samples with low confidence ? will the first and second group end up in two separate quantiles with total weight of 0.45, each? if so, then then third quantile will contain the last two samples with opposite residual since their sum of weights is almost equal to that of the first two quantiles, i.e., 0.48
Thank you for the amazing video !! but I have some questions Does it mean that all the missing value for that feature (let's say feature A) need to be in the same side (all in left or all in right)? so if it is, it means that XGBoost will treat all the missing value in the feature A as the same? Thank you Josh : D
Thank you for your really helping serie about XGBoost I've got a question. When you talk about huge data base, what do you mean? Also, is a 67,000 rows X 9 columns can be considered as a huge data base? Thank you in advance for your answer. BAM!
In this case "huge" = so big that we can not fit all of it in the available ram at the same time. On my computer 67,000 x 9 would not be huge because I can load that into memory all at once.
If i may ask politely, can someone explain it to me in general yet simple explanation of how did xgboost fill the missing value in datasets? It's for my thesis as my lecturer ask me to explain it more specific about it. Thank you:)
@@rahul-qo3fi XGBoost converts all categorical variables to numeric via One-hot-encoding ( ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-589nCGeWG1w.html ). XGBoost can do this efficiently for variables with lots of options by using sparse matrices (which only keep track of the non-zero values).
At15:06 minutes, we have sum of weights in one quantile as 0.18, in another we have 0.18, then the last two we have 0.24. But as far as I understood you explained that in weighted quantiles, the sum of weights in all the quantiles are equal but here its not equal in all 4??
It's not equal, but it's as close as it can be to being equal. Does that make sense? If equal isn't an option (and that is the case here), you XGBoost gets as close to equal as possible.
Hello Josh. Many thanks for another super BAM video! I have a question about the missing value part. You explained how xgboost incorporates missing values in training and makes predictions for missing values in future data. What happens when there are no missing values in training, but there are in testing/future data?
We can make our training data to have missing values so that if in future any test data have missing values then our model can easily handle that test data.
Hi Josh, thanks for making these awesome video to learn the XGBoost in depth. But I want to ask the *weight* you mentioned in, is it same with the *cover* you mentioned in previous videos? Since both of them have the same formula.
At 9:19 I say that the weights are derived from the Cover metric, and since I later say they have the exact same formula, then we can assume that they are the same.
Thanks for XGBoost series, if the feature like Dosage is a string format how XGBoost calculate that to build a tree? Like time series data ex. Sunday, Monday, Tuesday etc. Thanks
Thanks Josh for this amazing video. Your explainations are really great and helpful. It would great if you could share your approach for understanding the complex algorithms like xgboost. Currently I am digging into catboost, just curious to know what resources or plan you follow when you want to understand a new algorithm like reading research papers, understanding maths behind it, etc.
Th great ML channel. Josh, are you planning to give lectures on Convolutional neural network and Capsule network for Deep Learning? I'm expecting those Bam!
Your videos are the excellent quality but please stop the bad music at every video start.... It is a humble request... Because it distracts the mind before starting the learning process...