No video :(

Tuning XGBoost using tidymodels

Julia Silge

Подписаться 15 тыс.

Просмотров 19 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Опубликовано:

28 авг 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 85

@kemikao 4 года назад

Your videos are very informative! I love that you take the time to show the data first and explain what the variables are. And the fact that you explain the tidy functions and even repeat a bit of what you said in earlier videos is great! You use just the right amount of detail for me at least. Thank you.

@mygeorgyboy 2 года назад

Very nice example. You show all the process, very illustrative. Thank you Julia

@mathteacher1729 4 года назад

Thank you so much for this video (and for all your videos). I've been using R for about two or three years and this was just the right amount of detail and exposition for me. Your workflow is clean and easy to follow, I like how you used the help function and your overall layout is nice to (console in the top right). I look forward to trying XGBoost on some data sets now! :)

@mehdi1270 3 года назад

Thank you so much Julia for all your tutorial videos. They are easy to follow and very informative.......just great! Please keep posting them. I hope you can find some time to post a video on neural network optimization with Keras in R. I can even start a petition for that. LOL

@user-sb9oc3bm7u 16 дней назад

Would be amazing if you do a video using nested data (instead of having a nominal variable, nest it and generate a model for each of the levels for example), also using the map_workflow etc.. great as always!

@davidjackson7675 4 года назад

I always learn something from your videos.

@BonifaceMakone 4 года назад

These videos are super informative. Keep them coming. Thanks

@flachboard84 2 года назад

Very helpful video! I look forward to following this example in a future project!

@erickcohen1876 4 года назад

Hi Julia, this was video was amazing and very informative! Would you be able to help us find resources for (or post a video about :) ) the math behind these models? I.e. gradient-descent for XGBoost models. Thank you very much for posting these videos! I am learning a ton!

@talitabac 3 года назад

Amazing video, super clear! Thank you, Julia!

@alanjiang2930 3 года назад

Watched more than half of your videos within one week. Don't even want to blink! Saw you plotted XGB importance - wonder if there is tidymodel way to plot SHAP values from XGB. Thanks, Julia!

@JuliaSilge 3 года назад

If you are only doing xgboost, you might try the SHAPforxgboost package: cran.r-project.org/package=SHAPforxgboost (it takes a bit of munging the model to get it to work with that package) For modeling in general, I like DALEX for explainability, which also supports tidymodels: modeloriented.github.io/DALEXtra/reference/explain_tidymodels.html We have a chapter in process on explainability in our upcoming book, so keep your eyes out for that: www.tmwr.org/

@alanjiang2930 3 года назад

@@JuliaSilge Got it. Thanks for the direction! Again, amazing video series! Really really tidy.

@haraldurkarlsson1147 3 года назад

Very nice presentation of xgboost by the way.

@haraldurkarlsson1147 3 года назад

Julia, I ran this model on a new mac mini and it produced results in about 7 minutes. Much faster than my old mac which I desktop did not dare run it on.

@PA_hunter 2 года назад

Similar time here

@luisfernandocuestasanchez4343 3 года назад

You are the most amazing person I've ever come across Thanks a lot Blessings =)

@edneideramalho2363 Год назад

You are the best!

@JoseAyerdis 4 года назад

If you get a RStudio crash related to Initializing libomp.dylib, but found libomp.dylib already initialized. When using the final workflow and fit it. You can use a workaround on OSX Sys.setenv(KMP_DUPLICATE_LIB_OK = TRUE)

@raminziaei6411 4 года назад

Thanks a lot Julia. I really love your videos. Do you have any plans for making a video on neural network and tuning it in tidymodels? That would be awesome if possible. Please continue these videos. They are really great. Cheers

@faiazrummankhan5589 3 года назад

All your videos are such a great learning resource for real world EDA and modelling. I was just wondering what theme you are using in rstudio ?

@JuliaSilge 3 года назад

It's one of the themes available via rsthemes: www.garrickadenbuie.com/project/rsthemes/

@badrGamer11 2 года назад

Always an amazing content thank you

@JerryWho49 10 месяцев назад

Great video, thanks. But I’ve got a question. Say, my local computer is too small to fit a model fast enough. How would I train a model in the cloud? Do you have any best practices?

@JuliaSilge 10 месяцев назад

One of the easiest ways to go is to use RStudio on SageMaker: posit.co/blog/getting-started-rstudio-sagemaker/

@geilin2394 4 года назад

These vids are great. Can we see a classification model with calibration curves, and then recalibrate it, within the tidymodels framework? How long did the hyperparameter tuning take here?

@lucaskramer438 4 года назад

Great explanation, but i have one question: When you call last_fit() you make use of your split object. In my particular case i only was provided with the train and test test initially, so that i dont have a split object. Is there any way to call last_fit() nevertheless? Thanks!

@JuliaSilge 4 года назад

You can't call last_fit() directly if you don't have the split, but you *can* manually do what it is a wrapper for, which is train one last time on the training data and then evaluate one last time on the testing data.

@haraldurkarlsson1147 3 года назад

I should mention that the mini ran this quietly and I heard no noise from an overworked. The unit is also cool to the touch.

@angvl8793 Год назад

Hi Julia ! Great video as always :) ! Can i ask you something please? At around 34.08 if we don't want to use the xgb_grid you are using and we use in the tune_grid() function, something else for the grid parameter, let's say grid = 50 is this ok ? I mean generally is it ok to use grid equal a number ? Thank you very much !

@JuliaSilge Год назад

Yes, that argument can take a couple of different kinds of values, either a dataframe or an integer value: tune.tidymodels.org/reference/tune_grid.html You can read a bit more about this here: www.tmwr.org/grid-search.html#evaluating-grid

@angvl8793 Год назад

@@JuliaSilge Thank you again ! :) .

@gkuleck Год назад

Hi Julia! Great video. Have you done a video on multiclass classification? I am struggling to find guidance for this type with text classification. Thanks!!

@JuliaSilge Год назад

Check out these two: - juliasilge.com/blog/nber-papers/ - juliasilge.com/blog/multinomial-volcano-eruptions/

@gkuleck Год назад

Thank you!

@haraldurkarlsson1147 3 года назад

Julia, I was able to follow along and everything looked fine until the final roc_auc curve. I get a mirror image of your curve. I have combed through the code and found nothing wrong. The confusion matrix outcome is similar to yours etc. It seems like a systematic error. I noticed when looked at the data that will generate the curve that indeed my numbers for specificity are somehow switched. While your table starts with specificity of 1 mine starts at zero so the value seem more like 1-specificity to begin with in my case. I am puzzled.

@JuliaSilge 3 года назад

You can look at the first comment at the relevant blog post here: juliasilge.com/blog/xgboost-tune-volleyball/ Since I published this blog post, there was a change in yardstick in version 0.0.7: github.com/tidymodels/yardstick/blob/master/NEWS.md#yardstick-007 that changed how to choose which level (win or lose) is the "event". You can change this by using the `event_level` argument for functions like `roc_curve()`: yardstick.tidymodels.org/reference/roc_curve.html

@briancostello939 4 года назад

Great video! Is there any difference between “pivot_longer” and “gather”? They look identical to me, just with the arguments having different names, but want to make sure I’m not missing something.

@JuliaSilge 4 года назад

You can read this blog post that introduced the pivot verbs: www.tidyverse.org/blog/2019/09/tidyr-1-0-0/

@briancostello939 4 года назад

Julia Silge oh awesome thanks!

@amahoela730 3 года назад

Does anyone know how you can save the workflow for later use? I have problems with it since it is not of format 'xgb.booster', whereas using the function saveRDS might result in compatibility issues in case of future package versions.

@Matthew-px9nu 4 года назад

Julia thank you for these great videos keep it up ! Quick question once using last_fit if wanting to predict on NEW data what are the workflow steps ? Last_fit doesn’t really work on new data that wasn’t in the original split. Thank you !

@JuliaSilge 4 года назад

Once you get to last_fit(), check out the objects that are inside of it. One of the columns contains a *fitted model* that can be used on new data. In fact, that fitted model is used on the testing data to compute the metrics!

@Matthew-px9nu 4 года назад

@@JuliaSilge Thank you Julia! Last quick Q, noticed you always process the commands in console from the notebook Rmd, what button do you click to run in console instead of in the notebook?

@JuliaSilge 4 года назад

@@Matthew-px9nu That's probably my most used keyboard shortcut! Ctrl+Shift+Enter for a chunk, Cmd+Enter for a line In RStudio, you can find them under Tools -> Keyboard Shortcuts Help, but there's just a handful that I use regularly.

@vincentpepe1064 4 года назад

@@JuliaSilge Hi Julia! Where do I exactly find this? The columns I have are splits, id, .metrics, .notes,. predictions, .workflow. I can't find the fitted model in .workflow either so I'm not sure where it is. Thanks!

@JuliaSilge 4 года назад

@@vincentpepe1064 The .workflow is a *fitted* workflow at this point. For example, try tidying it or predicting on it. I show how to tidy it here: juliasilge.com/blog/palmer-penguins/

@deltax7159 4 месяца назад

What appearance theme are you using here?

@JuliaSilge 4 месяца назад

I use one of the themes from rsthemes: www.garrickadenbuie.com/project/rsthemes/ I think Oceanic Plus? There are lots of nice ones available in that package.

@haraldurkarlsson1147 3 года назад

Julia, I do like Markdown but for testing out code I prefer R script simply because I make a lot of mistakes. So I am curious to know why you work in Markdown. Is it so because you have already written and debugged your code and would like to save the lesson in a nicer format?

@JuliaSilge 3 года назад

No, I work in R Markdown regularly. In R I basically am either building package code or I am working in R Markdown. I'm a huge believer in the idea of "literate programming" as a real way to work. I make a lot of mistakes too, but I don't think that reduces the value of combining narrative and code in one document.

@haraldurkarlsson1147 3 года назад

I am working on setting up a class for students in my department and am quite torn on whether to go the Markdown or R script route. Since most of the class work will be around coding and simply learning how to R I am inclined to start with the regular setup (script) and then move on to Markdown later. Thanks.

@JuliaSilge 3 года назад

@@haraldurkarlsson1147 The person I know who has thought the most about this is Mine Çetinkaya-Rundel; you can see one of her resources for teaching here: datasciencebox.org/ She recommends teaching R Markdown to emphasize reproducible analyses.

@haraldurkarlsson1147 3 года назад

I see. Thanks a lot for the tip.

@haraldurkarlsson1147 3 года назад

Julia, I will have a deeper dive into the datasciencebox. However, I will be teaching grad students that should have some inkling of what the basic statistics concepts are. Most have already worked with data, done some data processing, and generated tables and graphs. I would like to teach them R to simplify their lives and give them hopefully a new valuable skill for the current or future work. As grad students the science part is covered.

@dudeadulto 4 года назад

Hi im getting a warning-error: ! Fold01: model 1/20: The `x` argument of `as_tibble.matrix()` must have colum... Whentune_grid function runs... Found in a github issue, that it's related to "name reparing"... Do you have any idea if it really affects the results of the tunning process, or if thers a update/solution for it?

@JuliaSilge 4 года назад

Hmmmm, do you want to make sure all your packages are updated? That sounds like a message from an older version of the packages. If you are still getting that warning, I recommend creating a reprex and posting on RStudio Community: community.rstudio.com/c/ml/15

@dudeadulto 4 года назад

@@JuliaSilge After reading your responde, I did update all my packages, and the error still occurs, but the process seems to keep running. I will let it finish, and see if it affects the results of the tune_grid

@Simonsayztaga 4 года назад

Do you have a course on tidymodels?? Video Course or Tutorials?

@JuliaSilge 4 года назад

You can check out this interactive course on tidymodels: supervised-ml-course.netlify.app/

@artathearta 3 года назад

@@JuliaSilge Amazing resource, thank you

@wecsleyprates3205 4 года назад

Hey Julia, congrats again: show up this error: xgb_res

@JuliaSilge 4 года назад

You need to *install* xgboost, actually; you don't have the package installed: install.packages("xgboost")

@wecsleyprates3205 4 года назад

@@JuliaSilge yeah...but I don't know what is happening, when I try install the package xgboost gives a error telling me that the xgboost is not available for my R version. My R Studio is the currently version.

@JuliaSilge 4 года назад

@@wecsleyprates3205 Ah, a classic problem that folks run into when things get borked! Check out this SO question + answers: stackoverflow.com/questions/25721884/how-should-i-deal-with-package-xxx-is-not-available-for-r-version-x-y-z-wa

@wecsleyprates3205 4 года назад

Thanks @@JuliaSilge...Do you know what means the error below? Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘predict’ for signature ‘"xgb.Booster"’

@JuliaSilge 4 года назад

@@wecsleyprates3205 That sounds like xgboost still isn't getting loaded correctly to me. Could you try creating a reprex showing your problem and posting on RStudio Community? rstd.io/tidymodels-community

@shamsulhoquekhan933 Год назад

Can someone tell me why we used sample_prop inside the search grid?

@JuliaSilge Год назад

It's what proportion of the total available sample is used for modeling within one boosting iteration: dials.tidymodels.org/reference/trees.html#details

@tamaraabzhandadze2712 3 года назад

Thank you for the great tutorial. I have been haivng a problem with a confusion matrix. namely, when i run the code " final_res_r %>% collect_predictions()%>% roc_curve(dependent_var, .pred_dependent_var)%>% autoplot()", i get the error Can't subset columns that don't exist. x Column `.pred_dependent_var` doesn't exist.. I can not understand how to solve the problem. What am i doing wrong?

@JuliaSilge 3 года назад

Hmmmm, do you see the column with the predicted class probability in it, after you run `collect_predictions()`? You can check out the documentation for `roc_curve()` here: yardstick.tidymodels.org/reference/roc_curve.html And if you continue to have trouble, I recommend creating a reprex and posting it on RStudio Community: rstd.io/tidymodels-community It's often easier to get help with coding problems in a format like that rather than comments.

@tamaraabzhandadze2712 3 года назад

@@JuliaSilge Dear Julia! Just amazing to read your response :). I have solved that problem :). however, another problem that I could not solve was related to the variable importance. I managed to create a figure but I can not get the actual values per variable. I tried to use varImp(model_name), xgb.importance(model = model_name). but getting just lovely red text around, without the results :)

@JuliaSilge 3 года назад

@@tamaraabzhandadze2712 I typically use the vip package for variable importance, as I show in this blog post: juliasilge.com/blog/xgboost-tune-volleyball/

@tamaraabzhandadze2712 3 года назад

@@JuliaSilge thank you! I have actually posted the question there as well :) . I read your answer and got the results :). I just really have to decide now the cutoff coefficient for choosing some variables out of ten features. p.s. i did factor analyses as well, and could identify 3 variables with good loading, but there it was a bit easier as there are cutoffs for loading :). For XGboost i have no idea what to do :)

@artathearta 3 года назад

48:44 my autoplot was flipped along the X = Y axis, I wonder why.

@JuliaSilge 3 года назад

It's because of a global change in how yardstick finds the "first" or base level event: juliasilge.com/blog/xgboost-tune-volleyball/#comment-5015180544