XGBoost Part 3 (of 4): Mathematical Details

Подписаться 1,3 млн

Просмотров 127 тыс.

50% 1

In this video we dive into the nitty-gritty details of the math behind XGBoost trees. We derive the equations for the Output Values from the leaves as well as the Similarity Score. Then we show how these general equations are customized for Regression or Classification by their respective Loss Functions. If you make it to the end, you will be approximately 22% smarter than you are now! :)
NOTE: This StatQuest assumes that you are already familiar with...
XGBoost Part 1: XGBoost Trees for Regression: • XGBoost Part 1 (of 4):...
XGBoost Part 2: XGBoost Trees for Classification: • XGBoost Part 2 (of 4):...
Gradient Boost Part 1: Regression Main Ideas: • Gradient Boost Part 1 ...
Gradient Boost Part 2: Regression Details: • Gradient Boost Part 2 ...
Gradient Boost Part 3: Classification Main Ideas: • Gradient Boost Part 3 ...
Gradient Boost Part 4: Classification Details: • Gradient Boost Part 4 ...
...and Ridge Regression: • Regularization Part 1:...
Also note, this StatQuest is based on the following sources:
The original XGBoost manuscript: arxiv.org/pdf/...
The original XGBoost presentation: homes.cs.washi...
And the XGBoost Documentation: xgboost.readth...
Last but not least, I want to extend a special thanks to Giuseppe Fasanella and Samuel Judge for thoughtful discussions and helping me understand the math.
For a complete index of all the StatQuest videos, check out:
statquest.org/...
If you'd like to support StatQuest, please consider...
Buying The StatQuest Illustrated Guide to Machine Learning!!!
PDF - statquest.gumr...
Paperback - www.amazon.com...
Kindle eBook - www.amazon.com...
Patreon: / statquest
...or...
RU-vid Membership: / @statquest
...a cool StatQuest t-shirt or sweatshirt:
shop.spreadshi...
...buying one or two of my songs (or go large and get a whole album!)
joshuastarmer....
...or just donating to StatQuest!
www.paypal.me/...
Lastly, if you want to keep up with me as I research and create new StatQuests, follow me on twitter:
/ joshuastarmer
Corrections:
1:16 The Lambda should be outside of the square brackets.
#statquest #xgboost

Опубликовано:

2 окт 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 298

@statquest 4 года назад

Corrections: 1:16 The Lambda should be outside of the square brackets. Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/

@SuperJ98 Год назад

Hey Josh, Thank you very much for all of your videos. It has been very helpful for my master thesis. I know this is a small details but I think the Similarity Score in 23:05 of this video is negative and should have a minus before the 1/2. At least that's what I see on page 3 of the original paper.

@statquest Год назад

@@SuperJ98 Yes, you are referring to equation 6 in the manuscript - that equation definitely has a minus sign in it. However, I was referring to the individual terms in equation #7 and in Algorithm #1 on that same page. Those terms are what I ended up calling "similarity scores" because of what they represented and how they were used to find the optimal split. That said, I should have been clearer in the video about what, exactly, I was referring to.

@ricclx7290 Год назад

Hello Josh, Great explanation. One question I have is that in neural nets we take derivates for gradients during the epochs in the training process and do back propagation, etc. From your explanation, I think I interpret that there is a loss to be calculated and minimized theoretically but in practice the derivates (gradients) are always that one equation of adding the residuals and dividing by the total. So, I should conclude that there is no derivative calculation during training? instead just use that one equation ?

@statquest Год назад

@@ricclx7290 Yes. Unlike neural networks, XGBoost uses the same overall design for every single model, so we only have to do calculate the derivative once (on paper) and know that it will work for all of the models we create. In contrast, every neural network has a different design (different number of hidden layers, different loss functions, different number of weights, etc.) so we always have to calculate the gradient for each model.

@beautyisinmind2163 3 года назад

I wish this channel live 1000 years in youtube.

@statquest 3 года назад

Thank you!

@jiaqint961 7 месяцев назад

OMG... How you break down complicated concept to simple concept is amazing. Thank you for the content.

@statquest 7 месяцев назад

Thank you!

@sxfjohn 4 года назад

The best valuable and most easily explained the hard core of xgboost, Thanks!

@statquest 4 года назад

Thank you very much! :)

@pedroramon3942 Год назад

Thank you very much for explaining all this very hard math in the original article. I did all the calculations and now I can say I understand xgboost in deep.

@statquest Год назад

BAM! :)

@sayantandutta8353 4 года назад

I just completed the Ada Boost, Gradient Boost and XGBoost series, it was awesome. Thanks Josh for the awesome contents!

@statquest 4 года назад

Thank you very much!! You deserve a prize for getting through all those videos. :)

@sayantandutta8353 4 года назад

:-) :-)

@junbinlin6764 3 года назад

Your youtube channel is amazing. Once I find a job related to data science after uni, I will donate this channel fat stacks

@statquest 3 года назад

Triple bam! :)

@nielshenrikkrogh5195 8 месяцев назад

as always a very structured and easy to understand explanation......many thanks!!

@statquest 8 месяцев назад

Glad you liked it!

@3Jkkk2 2 года назад

Josh you are the best! I love your songs at the beginning

@statquest 2 года назад

Thanks!

@auzaluis Год назад

gosh!!! such a clean explanation!!!

@statquest Год назад

Thanks!

@alex_zetsu 4 года назад

I knew enough calculus to know what the second derivative with respect to Pi would be, but even though you spoke normally and I could see it coming, "the number one" seemed so funny after doing all that.

@statquest 4 года назад

Ha! Yeah, isn't that funny? It all just boils down to the number 1. :)

@user-fy4mu7tp6h 10 месяцев назад

Very nice explanation on the math. love it !

@statquest 10 месяцев назад

Glad you liked it!

@iraklimachabeli6659 3 года назад

This is a brilliant and very detailed explanation of math behind XGBoost. I love that notation uses minimal subscripts. I was scratching my head for a day after looking at original paper by Chen and Guestrin. This video clearly laid out all the steps , taylor expansion of loss function and then gradient of second order approximation with respect to trees current prediction. Now it so obvious that gradient is wrt to current prediction, but somehow it was not clear before.

@statquest 3 года назад

Glad it was helpful!

@jingzhouzhao8609 9 месяцев назад

Merry Christmas Josh, 😊 Just a quick observation: at 11:00, I noticed that p_i represents the previous predicted value, therefore, p_i-1 might be a better notation to denote this.

@statquest 9 месяцев назад

I'm just trying to be consistent with the notation used in the original manuscript.

@henkhbit5748 3 года назад

The math was, as always, elegantly explained. Analogous you're support vector machine math explanation usingTaylor series for radial kernel.

@statquest 3 года назад

Yes, the Taylor series shows up in a lot of places in machine learning. It's one of the "main ideas" behind how ML really works.

@CrazyProgrammer16 Год назад

Very well explained. Thank you.

@statquest Год назад

Glad you liked it!

@damianos17xyz99 4 года назад

Oh after finish my project from Xgboost classification - max score I get! I have watched first two parts and it was really helpful, thanks! Now is the part there, yes! :-) What a helpful man !?

@statquest 4 года назад

Congratulations on your project! That is awesome! There is one more video after this one: Part 4: XGBoost Optimizations.

@damianos17xyz99 4 года назад

:-) :-) :-) ! :D 😝👍👍

@knightedpanther Год назад

Thanks Josh. You are awesome. Please let me know if I got this right: For Gradient Boosting, we are fitting a regression tree so the loss function is just sum of squared residuals. When deciding a split we just try to minimize the sum of squared residuals. For XGboosting they modified the loss function by adding the regularization term. So when deciding a split, we can just try minimizing this new loss function. However they decided to flip it for clarity (or other purposes like maximization instead of minimization which we don't know) and called it similarity and we try to maximize it when deciding a split.

@statquest Год назад

Both methods have two different loss functions, depending on whether they are performing regression or classification. Since you are interested in these details, I would strongly recommend that you watch all 4 gradient boost videos and all 4 of the xgboost videos very carefully. You can find them here: statquest.org/video-index/

@knightedpanther Год назад

@@statquest Hi Josh, thank you. I have already watched the videos. After your comment, I looked up my notes which I made while watching them. For Gradient Boosting, even though the loss functions are different (Sum of Squared Residuals for regression and log loss for classification), when we are fitting an individual tree for both cases, we try to minimize the sum of squared residuals when deciding a split. But the output value for both cases are different. For regression case, it is just the mean of the residuals in that leaf but for classification, it is sum of residuals divided by sum of pi(1-pi) for all observations in that leaf. For Extreme Boosting tree, the split condition is also different for regression and classification. The definition of similarity score changes. For regression it is sum of residuals squared divided by number of residuals + lambda. For classification, it is sum of residuals squared divided by sum of pi(1-pi) for all terms + lambda. The output values are also different just like Gradient Boosting. Now My question is why don't we change the split condition in gradient boosting for classification like it is done in Extreme Gradient Boosting?

@knightedpanther Год назад

Referring to this video: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-StWY5QWMXCw.html&ab_channel=StatQuestwithJoshStarmer... for gradient boosting..If we put the calculated value of gamma back in the loss function equation, we will get something like sum of squared residuals for all observations divided by sum of p(1-p) for all observations. Why don't we use this as the split criteria for gradient boosting classification like we do in XGBoost?

@statquest Год назад

@@knightedpanther Gradient boosting came first. XGBoost improved on it. If you want what XGBoost offers, just use it instead.

@knightedpanther Год назад

@@statquest Thanks Josh. I was just trying to understand if there was a mathematical or logical reasoning behind what these two algorithms were doing that I missed.

@Cathy55Ms 2 года назад

Great tutorial materials to whom need the fundamental idea of those methods! Do you plan to publish videos on ligntGBM and catGBM too?

@statquest 2 года назад

I hope so!

@chrischu2476 2 года назад

This is the best educational channel that I've ever seen. There seem like a little problem in 18:02, when you convert L(yi, pi) to L(yi, log(odds)i). I thought pi is equal to (e^log(odds) / 1 + e^log(odds)). Please tell me if I am wrong or misunderstand something. Thanks a lot.

@statquest 2 года назад

This is explained in the video Gradient Boost Part 4 here: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-StWY5QWMXCw.html NOTE: There is a slight typo in that explanation: log(p) - log(1-p) is not equal to log(p)/log(1-p) but equal to log(p/(1-p)). In other words, the result log(p) - log(1-p) = log(odds), is correct, and thus, the error does not propagate beyond it's short, but embarrassing moment.

@chrischu2476 2 года назад

Thanks a lot. You've explained very well in the Gradient Boost Part 4. I can understand how -[yi log(pi) + (1-yi) log(1-p)] converted to -yi log(odds ) + log(1+e^log(odds)) (right of the equal sign) in 18:02, but why L(yi, pi) is equal to L(yi, log(odds)I) (left of the equal sign)? Thanks for your patience to reply me.

@statquest 2 года назад

@@chrischu2476 If we look at the right sides, of both equations, we have a function of 'p' and we have a function of 'log(odds)'. As we saw in the other video, the right hand sides are equal to each other. So, the left hand sides just show how those functions are parameterized. One is a function of 'p' and the other is a function of 'log(odds)'.

@chrischu2476 2 года назад

@@statquest Oh...I got it. Thank you again for everything you've done.

@bktsys 4 года назад

Keep going the Quest!!!

@statquest 4 года назад

Hooray! :)

@anastasia_wang17 3 года назад

Hey, Josh! I really enjoy your videos and I could not express my gratitude enough!

@statquest 3 года назад

Glad you like them!

@hoomankashfi1282 Год назад

you did a great job with this quest, could you please make another quest and describe how does XGBoost handle multi class classification tasks? there are several strategies in sk learn but understanding them in another issue. Good luck

@statquest Год назад

Thanks! I'll keep that in mind.

@anunaysanganal Год назад

Thank you for this great tutorial! I had a question regarding the similarity score; why do we need a similarity score in the first place? Why can't we just use a normal decision tree with MSE as a splitting criterion like in GBT?

@statquest Год назад

I think the main reason is that the similarity score can easily incorporate regularization penalties.

@anunaysanganal Год назад

@@statquest Got it! Thank you so much!

@knightedpanther Год назад

I had similar doubt. Please correct me if I am wrong. This is what I gathered from the video: For Gradient Boosting, we are fitting a regression tree so the loss function is just sum of squared residuals. When deciding a split we just try to minimize the sum of squared residuals. For XGboosting they modified the loss function by adding the regularization term. So when deciding a split, we can just try minimizing this new loss function. However they decided to flip it for clarity (or other purposes like maximization instead of minimization which we don't know) and called it similarity and we try to maximize it when deciding a split.

@ayenewyihune Год назад

Super clear

@statquest Год назад

Thanks!

@suzyzang1659 4 года назад

I was waiting for this for a very long time, cannot wait to learn!! May I please know when the part 4 will come out? Can you help to introduce how to realize XGBoost in R or Python? Thank you!!

@statquest 4 года назад

Part 4 should be out soon - earlier for you since support StatQuest and get early access. I'll also do a video for getting XGBoost running in R or Python.

@suzyzang1659 4 года назад

@@statquest Thank you! Hurray!

@rodriguechidiac8648 4 года назад

@@statquest Can you add to that part a grid search as well once you do the video? Thanks a lot, awesome videos.

@suzyzang1659 4 года назад

@@statquest Can you please help to explain how xgboost deal with missing values in R or Phython? I was running a xgboost model but the program cannot continue if there is missing value in my data set. Thank you!

@statquest 4 года назад

@@suzyzang1659 Wow, that is strange. In theory XGBoost should work with missing data just fine. Hmmm....

@SophiaSLi 3 года назад

Thank you so much for the excellent explanation and illustration Josh!!! This is the best (clearest, best-organized, most comprehensible, most detailed) XGBoost lecture I've ever seen... I don't find my self having the need to ask follow-up questions as everything is explained so well!

@statquest 3 года назад

Awesome, thank you!

@Erosis 4 года назад

It's crazy to think a graduate student (Tianqi Chen) came up with this... Very impressive.

@statquest 4 года назад

Agreed. It's super impressive.

@benjaminlu9886 4 года назад

Hi Josh, what is Ovalue? Is that the result that the tree outputs? Wouldn't then the output value be 0.5/predicted drug effectiveness in the first example? Or is the output value a hyper parameter that is used in regularization? Also, BIG thanks for all these videos!

@statquest 4 года назад

At 5:54 I say that Ovalue is the "output value for a leaf". Each leaf in a tree has its own "output value". For more details about what that means, check out XGBoost Parts 1 ( ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-OtD8wVaFm6E.html ) and 2 ( ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-8b1JEDvenQU.html ), as well as the series of videos on Gradient Boost ( ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-3CC4N4z3GJc.html )

@mengzhou193 2 года назад

Hi Josh! Amazing videos! I have one question at 6:39, you replace p_i with (initial prediction+output value), but according to part 1&2, I think it should be (initial prediction+eta*output value), am I right about this?

@statquest 2 года назад

To keep that math simple, just let eta = 1.

@TennyZ-mw2jb Год назад

@@statquest Thanks for your clarification. I guess you may need to mention this in the video next time if you simplify sth coz I got confused in this part too.

@salhjasa 3 года назад

This channel is awesome. After searching and searching for somewhere to explain this clearly this is just perfect.

@statquest 3 года назад

Thank you very much! :)

@rushilv4102 3 года назад

Your videos are really really helpful and easy to comprehend. Thank you so much!

@statquest 3 года назад

Glad you like them!

@ParepalliKoushik 3 года назад

Thanks for the detailed explanation Josh. Why XGBoost doesn't need feature scaling although it uses gradients?

@statquest 3 года назад

Because it is based on trees.

@Vivekagrawal5800 2 года назад

Amazing Video!! Makes the Maths of XGBooost super simple. Thank you for your efforts...

@statquest 2 года назад

Thank you very much! :)

@davidd2702 3 года назад

Thank you for your fabulous video! I enjoy it and understand well! Could you tell me the output from the xgb classifier giving 'confidence' in a specific output (allowing you to assign a class) ? is this functionally equivalent to statistical probability of an event occuring?

@statquest 3 года назад

Although the output from XGBoost can be a probability, I don't think there is a lot of statistical theory behind it.

@palebluedot8733 4 года назад

I cant get past the intro. Its so addictive and Im not kidding lol.

@statquest 4 года назад

BAM! :)

@mikhaeldito 4 года назад

I couldn't give a lot but I am a proud patron of your work now! I hope others who are financially capable would also donate to StatQuest. BAM!

@statquest 4 года назад

Thank you very much!!! Your patronage means a lot to me.

@ilia8265 2 года назад

Can we have a study guide for XGBoost plz plz plz plz 😅

@statquest 2 года назад

I'll keep that in mind for sure.

@cici412 2 года назад

Thanks for the video. I have one question that I'm struggled with. At 7:58, why the new predicted value is not (0.5 + learning rate X Output Value)? Why is the "learning rate" omitted to compute the new predicted value?

@statquest 2 года назад

If we included the learning rate at this point, then the optimal output value would end up being scaled to compensate for its effect. By omitting it at this stage, we can scale the output value (make it smaller) to prevent overfitting later on.

@cici412 2 года назад

@@statquest Thank you for the reply! appreciate it.

@sayantanmazumdar9371 2 года назад

If I am right here , then finding output value is just like gradient descent of the loss function .Like we do in neural networks

@statquest 2 года назад

It's a little different here, in that we have an exact formula for the output value and do not need to iterate to find it.

@abhashpr Год назад

Wonderful explanation .. did not se this sort of thing anywhere else

@statquest Год назад

Thanks!

@rrrprogram8667 4 года назад

MEGAA BAAMMMMMM is backkk...

@statquest 4 года назад

Hooray!!! :)

@zahrahsharif8431 4 года назад

Maybe this is basic, but the Hessian matrix is a matrix of what exactly ? Its the partial second order derivatives of the loss function with respect to what just the log odds?. Just trying to see bigger picture here applying it to training data

@zahrahsharif8431 4 года назад

Also you are looking at one feature in this example. If we have say 20 features how would the above be different??

@statquest 4 года назад

If you have multiple features, you calculate the gain for all of them. The Hessian is the partial second order derivatives. In the case of XGBoost, those derivatives can be with respect to the log(odds) (for classification) or the predicted values 9 (in regression).

@RishabhJain-u9r Месяц назад

Why would XGBoost have tree depth as a hyper-parameter with a default value of 6 when all we use is Stumps with a depth of 1!

@statquest Месяц назад

The stumps are used to just provide examples of how the math is done. Usually you would use larger trees.

@xiaoyuchen3112 4 года назад

Fantastic vedio! I have a small questions, if we calculate the similarity based on gradient and second order gradient, how can these similarities be additive? That is to say, why can we add similarities in different leaves and compare it with the similarity in the root?

@statquest 4 года назад

The formula for calculating similarity scores is just a scoring function. For more details, see: arxiv.org/pdf/1603.02754.pdf

@sureshparit2988 4 года назад

Thank Josh ! could you please make a video on LightGBM or share the difference between LightGBM and XGBoost.

@statquest 4 года назад

It's on the to-do list.

@RishabhJain-u9r Месяц назад

can you possibly proof-read this, please. Step 1: Calculate a structure score for all three nodes in the stump. Structure score is given by sum of square of residuals of the observations divided by the number of observations, plus a factor called 𝛾 which is used to avoid overfitting. (As 𝛾 increases, the effect of the score of each single tree decreases in getting the outcome of the model.) Step 2: Calculating the Gain, which is the difference between the above structure score for the parent node and the sum of the structure scores for child nodes. Step 3: If the gain is positive, then the above split is made. The true derivation of this structure score is quite interesting and can be found in the original paper by University of Washington researchers.

@praveerparmar8157 3 года назад

Thank God you skipped the fun parts 😅😅. They were already much fun in the Gradient Boost video 😁😁

@statquest 3 года назад

bam! :)

@karannchew2534 3 года назад

21:50 Why does the highest point of the parabola give the Similarity Score please? What exactly is the Similarity Score definition?

@statquest 3 года назад

We derive this starting at 22:18

@iOSGamingDynasties 3 года назад

I am learning XGBoost and this has helped me greatly! So thank you Josh. One question, at 7:18, in the loss function, the term p_i^0 is the total value from previous trees? That being said, p_2^0 would be initial value 0.5 + eta * (output value of the leaf from the first tree), am I right?

@statquest 3 года назад

I believe you are correct.

@iOSGamingDynasties 3 года назад

@@statquest Yay! Thank you :)

@laveenabachani 2 года назад

Amazing! The human race thanks you for making this vdo.

@statquest 2 года назад

Thank you! :)

@kunlunliu1746 4 года назад

Hi Josh, great videos, learned a ton. Are you gonna talk about the other parts of XGBoost, like quantile? Looking forward it!

@statquest 4 года назад

Yes, that comes up in Part 4, which should be out soon.

@vinayak186f3 4 года назад

I watch your videos , get the subtitles downloaded and make notes from it. I'm really enjoying doing so . THANKS FOR EVERYTHING. 😊

@statquest 4 года назад

BAM! :)

@damp8277 3 года назад

Watching this video with the original paper open is like deciphering forgotten texts. Thanks so much!

@statquest 3 года назад

Glad it was helpful!

@jingyang2865 Год назад

This is the best resource I can find online on explaining XGboost! Million thanks to you!

@statquest Год назад

Glad you think so!

@dr.kingschultz 2 года назад

your videos are awesome

@statquest 2 года назад

Thank you so much 😀!

@АлександраРыбинская-п3л Год назад

This series about XGBoost is marvellous! Thanks!

@statquest Год назад

Thank you very much!

@alessio.c7538 9 дней назад

Grandissimo. Grazie .

@statquest 9 дней назад

@abhzz3371 7 месяцев назад

7:43, how did you get 104.4? I'm getting 103.... could anyone explain?

@statquest 7 месяцев назад

That's a typo! It should be 103. When I did the math I forgot to subtract 0.5 from each residual.

@sheenaphilip6444 4 года назад

Thank you so much for this series of videos on XG boost!! Has helped so much..esp in understanding the original paper on this, which can be very intimidating at first glance!

@statquest 4 года назад

Thanks! :)

@mostafakhalid8332 4 месяца назад

Second order Taylor polynomial is used only to simplify the math? is there another objective?

@statquest 4 месяца назад

It makes the unsolvable non-linear equation solvable.

@christianrange8987 Год назад

Great video!! Very helpful for my current bachelorthesis!🙏 Since I want to use the formulas for Similarity Score and Gain in my thesis, how can I reference them? Do you know if there is any official literatur like book, paper etc. where they are mentioned or do I have to show the whole math in my thesis to get from Tianqi Chen's formulas to the Similarity Score?

@statquest Год назад

You can cite the original manuscript: arxiv.org/pdf/1603.02754.pdf

@pratt3000 10 месяцев назад

I understand the derivation of Similarity score but didnt quite get the reasoning behind flipping the parabola and taking the y coordinate. Could someone explain?

@statquest 10 месяцев назад

You know, I never really understood that either. So, let me know if you figure something out.

@rubyjiang8836 3 года назад

cool~~~

@statquest 3 года назад

Bam! :)

@RishabhJain-u9r Месяц назад

Hey Josh, is the similarity score, essentialy the irregularity score from the original paper, with a negative sign infront of it? Thanks!

@statquest Месяц назад

I'm not sure what you are referring to as the irregularity score, but the similarity score in the video refers each term that includes summations in equation 7 on page 3 of the original manuscript. Although they refer to that equation as L_split, in the algorithm sections of the manuscript they call it "gain". To see "gain" in action, see: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-OtD8wVaFm6E.html

@mohammadelghandour1614 2 года назад

In 18:06 How (1-yi) log(1-Pi) ended up like this : log(1+e^(log(odds))) ?

@statquest 2 года назад

See: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-StWY5QWMXCw.html

@FedorT54 Год назад

Hi! Can you please explain me, Similarity score for model with number of trees =1 and depth =1 is its logloss minimum value?

@statquest Год назад

You can get a sense of how this would work by watching my video on Gradient Boost that uses the log likelihood as the loss function: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-StWY5QWMXCw.html

@Gabbosauro 3 года назад

Looks like we just apply the L2 Ridge reg param but what about the L1 Lasso regularization parameter? Where is it applied in the algorithm?

@statquest 3 года назад

The original manuscript only includes the L2 penalty. However, presumably the L1 penalty is included in an elastic-net style.

@ineedtodothingsandstuff9022 Год назад

Hello thanks for the video. Just one questions is the splits done based on the residuals all the time, or gradients? For instance if I use a different loss function, gradient might have different calculation. In this case, do we still use residuals to do the splits or we use respective gradients from the given loss function? Thanks a lot!

@statquest Год назад

I believe you would use the gradients.

@nishalc 2 года назад

Thanks for the great video. I'm wondering how other regression methods such as poisson, gamma and tweedie relate to what is shown in the video here. I imagine the outputs of the trees in these cases are similar to the case of regression, as we are estimating the expected value of the distribution in question. On the other hand, the loss function would be the negative log likelihood for the distribution in question. If anyone has any details of how these methods work it would be much appreciated!

@statquest 2 года назад

In the context of "xgboost" and pretty much all other machine learning methods, the word "regression" doesn't refer to linear regression specifically, but simply to any method that predicts a continuous value. So I'm not sure it makes sense to compare this to poisson regression specifically, other than to say that XGBoot's "regression" does not depend on any specific distribution.

@nishalc 2 года назад

@@statquest thanks for the reply! So with these methods would xgboost simply use the negative log likelihood of the distribution in question as the loss function and take the derivative to be the output of each tree?

@statquest 2 года назад

@@nishalc XGBoost does not use a distribution.

@nishalc 2 года назад

@@statquest hmm in that case how do these specific (gamma/poisson/tweedie) regressions work?

@statquest 2 года назад

@@nishalc en.wikipedia.org/wiki/Poisson_regression

@helenjude3746 4 года назад

I would like to point out that the hessian in XGBoost for Multiclass Softmax is not exactly pi(1-pi). It is actually twice that. See source code: github.com/dmlc/xgboost/blob/master/src/objective/multiclass_obj.cc See here: github.com/dmlc/xgboost/issues/1825 github.com/dmlc/xgboost/issues/638

@statquest 4 года назад

Thanks for the clarification. In this video we're only talking about boolean classification, as described at 1:29

@karthikeyapervela3230 Год назад

@statquest I am trying to workout a problem on pen and paper but just 4 features instead of 1, so once the split it is made on 1 feature does it proceed to another feature? What happens next?

@statquest Год назад

At each potential branch, each feature and threshold for that feature are tested to find the best one.

@wibulord926 2 года назад

it easy to understand p1 and p2 but come to p3 seem overwhelme for me lol !!!!

@statquest 2 года назад

Yes, this video assumes that you are already familiar with Gradient Boost and all of its details. If you can understand Gradient Boost at this level, this video will be easy. To learn about Gradient Boost (and Ridge Regression), see: Gradient Boost Part 1: Regression Main Ideas: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-3CC4N4z3GJc.html Gradient Boost Part 2: Regression Details:ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-2xudPOBz-vs.html Gradient Boost Part 3: Classification Main Ideas: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-jxuNLH5dXCs.html Gradient Boost Part 4: Classification Details: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-StWY5QWMXCw.html ...and Ridge Regression: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-Q81RR3yKn30.html

@Stoic_might 2 года назад

What is the number of decision trees should be there in our XGBoost Algorithm? And how do we calculate this

@statquest 2 года назад

I answer this question in this video: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-GrJP9FLV3FE.html

@zahrahsharif8431 4 года назад

Do you need to remove outliers in your data first to reduce the loss function?

@statquest 4 года назад

Dealing with outliers is always a good thing to do before putting your data into a machine learning algorithm.

@zahrahsharif8431 4 года назад

But aren't tree based methods insensitive to outliers?

@aditya4974 4 года назад

Triple Bam with part 3!! Thank you so much.

@statquest 4 года назад

Thanks! :)

@VBHVSAXENA82 4 года назад

Great video! Thanks Josh

@statquest 4 года назад

Thanks! :)

@niguan7776 3 месяца назад

Great job Josh! The most wonderful video among all the others I can find. Just a question about the output value for each node. Why is the regularized loss function l(y, p0+O) instead of l(y, p0+eta*O) when doing the 2rd order Taylor Approximation? I agree that if you set the eta equal to 1, the output value will be sum of residuals/number of residuals+lambda, but if I take eta into account, the output value is actually eta*residuals/number of residuals*squared eta +lambda, if I am correct.

@statquest 3 месяца назад

What time point, minutes and seconds, are you asking about specifically?

@niguan7776 3 месяца назад

@@statquest it’s at 6:42:)

@statquest 3 месяца назад

@@niguan7776 If the question is why I left eta out of the regularized loss function, it is because it was also omitted from the derivations in the original manuscript: arxiv.org/pdf/1603.02754

@jhlee8796 4 года назад

Thanks for great lecture. Where can I get your beamer pdf file?

@henry7434 4 года назад

bam

@statquest 4 года назад

@midhileshmomidi2434 3 года назад

So to get ouput value and similarity score, this huge amount of calculation(double derivatives) is required No wonder why xgboost takes lot of training time One doubt Josh While running model to calculate output value and similarity score it just calculates the formulae right or it goes through all this huge process

@statquest 3 года назад

Th whole process described in this video derives the final formulas that XGBoost uses. Once derived, only the final formulas are used. So the computation is quite fast. To see the final formulas in action, see: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-OtD8wVaFm6E.html and ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-8b1JEDvenQU.html

@mohammadelghandour1614 2 года назад

in 11:10 I understood You used the second order Taylor approximation because of this term "Pi+Ovalue",which makes differentiation with respect to Ovalue difficult, Similarly, in Gradient boost lesson part 2, ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-2xudPOBz-vs.html , in 19:03 similar term existed "F(X)+Gamma", However, U preferred to replace F(X) with its value "73.3" and that made differentiation with respect to "gamma" simple and easy. Why didn't you use the same simple substitution method with XGBoost loss function instead of second order tayolor approximation?

@statquest 2 года назад

So, one big difference between XGBoost and Gradient Boost is that, old, normal, Gradient Boost has one derivation for regression, and a different derivation (which is more like XGBoost's) for classification (see: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-StWY5QWMXCw.html ). In contrast, XGBoost uses a single, unified derivation for both regression and classification.

@mohammadelghandour1614 2 года назад

@@statquest Yea Thank you. I'm just wondering why the gradient boost for regression doesn't use the second order Taylor approximation for optimization of loss function.

@statquest 2 года назад

@@mohammadelghandour1614 It's a good question. The unified approach that XGBoost uses makes more sense to me.

@maddoo23 2 года назад

At 5:27, wouldnt gamma also decide if a node gets built(going by original paper, not able to post link)? You wouldnt have to prune a node if you dont build it.

@statquest 2 года назад

If you look at equation 6 in the original paper, it shows, in theory, how 'T' could be used to build the optimal tree. However, that equation isn't actually used because it would require enumerating every single tree to find the best one. So, instead, we use a greedy algorithm and equation 7, which is the formula that is used in practice for evaluating the split candidates, and equation 7 does not include 'T'. Now, the reason we don't prune as we go is that when using the greedy algorithm, we can't know if a future split will improve the trees performance significantly. So we build all the branches first.

@maddoo23 2 года назад

@@statquest 'T' is not there in eq 7 but 'gamma' is there in equation 7 (deciding whether or not to split). For positive gamma, it always encourages pruning. I couldnt find anything in the paper about not using 'gamma' to build tree because it might lead to counterproductive pruning in the greedy approach. However, I agree with your point that 'gamma' should be avoided while building the tree. Thanks!

@vinodananddixit7267 4 года назад

Hi, At 18:17 , I can see that you have converted pi=(e^log(odds)/1+e^log(odds))? Can you please let me know how it has been converted. I am stuck at this point. Any, reference/help would be appreciated.

@statquest 4 года назад

ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-BfKanl1aSG0.html

@ShubhanshuAnand 4 года назад

Expression of Output value looks very similar to GBDT gamma field with L2 regularizer, we can always use first order derivative with SGD optimization to get the minima as we do in other optimisation problem to solve for minima, why use taylor expansion? Is taylor expansion gives faster convergence?

@statquest 4 года назад

SGD only takes a step towards the optimal solution. The Taylor expansion *is* (an approximation of) the optimal solution. The difference is subtle, but important in this case.

@rcdag-b5z 2 года назад

Hi! Amazing video, thank you for this great content! I have a question, maybe it's stupid but it's just for having everything's clear: after that you proved that the optimal output value is computed on a minimization of a Loss + L2 penalty made by approximating this function with a 2nd order taylor approximation, I still don't get the next step when you improve the prediction by creating nodes such that the gain in terms of similarity is bigger. Of course I know that by building such a tree you would improve the optimal output value since the first guess comes from a 2nd order approximation, but I still don't get how do you prove this mathematically. Thank you again!

@statquest 2 года назад

What time point, minutes and seconds, are you asking about?

@rcdag-b5z 2 года назад

@@statquest at minute 20:09 when you said that we need to derive the equations of the similarity score so we can grow the tree. Maybe I didn't explain well, my question is related on how the loss optimization is connected with the tree growth algorithm based on gain in similarity that you explained in Part 1 and Part 2. Is this a procedure that helps us to refine the optimal output value guess (done by minimizing the 2nd order approximation of the loss function ) ?

@statquest 2 года назад

@@rcdag-b5z The derivation of the similarity scores that we do in this video results in the similarity scores that I introduced in parts 1 and 2. In those videos, I just said, "Hey! This is the similarity score that we are going to use!". In contrast, in this video, I'm saying "Hey! Remember those similarity scores? Well, this is where they come from."

@amitbisht5445 3 года назад

Hi @JoshStarmer, Could you please help me in understanding, how taking the second order gradient in taylor series helped in reducing the loss function?

@statquest 3 года назад

At 10:58 I say that we use the Taylor Series to approximate the function we want to optimize because the Taylor Series simplifies the math.

@Maepai_connect 4 месяца назад

Love the channel always! QQ - why is the initial prediction 0.5 and not an average of all observations? 0.5 could be too far fetched with continuous data for regressions.

@statquest 4 месяца назад

That was just the default they set it to when XGBoost first came out. The reasoning was that the first few trees would significantly improve the estimate.

@Maepai_connect 3 месяца назад

@@statquest thank you for answering! Does it now use average?

@statquest 3 месяца назад

@@Maepai_connect I believe it does now, but it might also be configurable.

@hampirpunah2783 3 года назад

I have a question, I did not find your formula in the xgboost tianqi chen paper, can you explain the original formula XGboost ?

@statquest 3 года назад

Which formula in the Tianq Chen paper are you asking about? Most of them are in this video.

@hampirpunah2783 3 года назад

@@statquest in the Tianq Chen paper formula number (2) and then your formula similarity score, output value I can't find formula..

@statquest 3 года назад

@@hampirpunah2783 Equation 2 refers to a theoretical situation that can not actually be solved. It assumes that we can find a globally optimal solution and, to quote from the manuscript: "The tree ensemble model in Equation 2 includes functions as parameters that cannot be optimized..." Thus, we approximate equation 2 by building trees in an additive manner (i.e. boosting) and this results in equation 3, which is the equation that XGBoost is based on. Thus, in order to explain XGBoost, I start with equation 3. Also, the similarity score in my video is equation 4 in the manuscript and the output value is equation 5 in the manuscript.

@Stoic_might 2 года назад

What is the number of decision trees should be there in our XGBoost Algorithm?

@statquest 2 года назад

I answer this question in this video: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-GrJP9FLV3FE.html

@Stoic_might 2 года назад

@@statquest ok thank you

@李娜-r6j 11 месяцев назад

100 years later ai will come to this Channel to Learn what their grand grand father looks like😊

@statquest 11 месяцев назад

Ha! BAM! :)

@FranciscoLlaneza 8 месяцев назад

🤩

@tulanezhu 4 года назад

Really helped me a lot understanding the math behind XGB. This is awesome! For regression, you said XGB used 2nd order Taylor approximation to derive the leaf output, while general gradient boost use 1st order Taylor. From what I understand other than the lambda regularization term, they just end up with the same answer, which is sum of residuals/number of residuals in that leaf node, right?

@statquest 4 года назад

That's what I recall. The difference in the taylor expansion is due to the regularization term.

@strzl5930 3 года назад

Are the output values for the trees that are denoted as O in this video equivalent to the output values denoted as gamma in the gradient boosting videos?

@statquest 3 года назад

No. XGBoost includes regularization the calculation of the output values. Regular gradient boost does not.

@tingtingli8904 4 года назад

Thank you so much for the videos. And I have a question . The pruning can be done after the tree built, if the difference between gain and gamma is negative , we can remove the branch. Could you explain this, can we have this conclusion from math details. Thank you

@statquest 4 года назад

I'm pretty sure I show this in the video. What time point are you asking about?

@rahelehmirhashemi5213 4 года назад

love you man!!! :D

@statquest 4 года назад

Thank you! Part 4, which covers XGBoot's optimizations, will be available for early access in one week.

@wenzhongzhao627 4 года назад

Thanks Josh for the great series of ML videos. They are really "clearly explained". I have a question regarding calculation of g_i and h_i for the XGBoost classification case, where you used log(odds) as the variable to take the first/second derivatives. However, you used the p_i as the variable to perform the Taylor expansion. Will that cause any issue? I assume that in the classification case you have to use log(odds) to perform the Taylor expansion and variable update in stead of p_i as in the regression case.

@statquest 4 года назад

If you want details about how we can work with both probabilities and logs without causing problems, check out the video: Gradient Boost Part 4, Classification Details: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-StWY5QWMXCw.html

@georgeli6160 4 года назад

Hi guys, can someone provide some insight into why the similarity score is the maximum of -1*the loss function?

@statquest 4 года назад

It's just defined that way.

@venkateshmunagala205 2 года назад

Can you please me understand why we multiplied the equation with negative (-) to get similarity score which makes parabola to get inverted ?@ time stamp 21:29

@statquest 2 года назад

I'm not sure I understand your question. We multiply the equation by -1 to flip it over, so that the problem becomes a maximization problem, rather than a minimization problem.

@venkateshmunagala205 2 года назад

@@statquest Thanks for the reply . I need to know the specific reason to flip it to make it as maximisation problem ? Btw I bought your book but I won’t see gbdt and xgboost in it

@statquest 2 года назад

@@venkateshmunagala205 To be honest, you might be better off asking the guy that invented XGBoost. I can only see what he did and can only guess about the exact reasons. Perhaps he wanted to call the splitting criteria "gain", and in that case, it makes senes to make it something we maximize.

@venkateshmunagala205 2 года назад

@@statquest BAM Thank you Josh.