No video :(

Tuning random forest hyperparameters with tidymodels

Julia Silge

Подписаться 15 тыс.

Просмотров 18 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Опубликовано:

28 авг 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 85

@lucianobatista4554 4 года назад

It is just incredible the domain you have with R, thanks for sharing quality content like this.

@SadatQuayiumApu 4 года назад

Wonderful tutorial as usual. Thanks a ton Ms. Silge.

@shahrdadshadab1115 3 года назад

Great job Julia, Really usefull work, you answered many of my questions.

@kelvinwaweru2238 2 года назад

Wonderful lady and awesome teacher

@vivi1311 4 года назад

I really like your videos! You're helping me a lot in my masters lol It'd be awesome if you made one about bagging with decision trees and tuning parameters as well. Keep up the great work!

@coolghoul9 2 года назад

finally an R user with a dark theme

@TheTeksan 4 года назад

Thank you so much ! Best video about tuning RF

@ASA-ho9ro Год назад

This is an excellent tutorial. I'm having trouble with the hyperparameter tuning: Creating pre-processing data to finalize unknown parameter: mtry Error in `check_installs()`: ! Some package installs are required: • 'ranger', 'ranger' Backtrace: 1. tune::tune_grid(tune_wf, resamples = trees_folds, grid = 20) 2. tune:::tune_grid.workflow(tune_wf, resamples = trees_folds, grid = 20) 3. tune:::tune_grid_workflow(...) 4. tune::check_workflow(workflow, pset = pset) 5. tune:::check_installs(hardhat::extract_spec_parsnip(x))

@JuliaSilge Год назад

You need to install the ranger package, via `install.packages("ranger")`

@guanzhenhua7501 4 года назад

Excellent, solve my problems. Thankyou!!

@haraldurkarlsson1147 3 года назад

Very interesting and informative exercise. I have one question. Why not use a package like VIM to get a quick overview of "missings" in the dataset like in the plot the aggr() function generates in VIM?

@yangyang6008 Год назад

Hi Julia, thank you for sharing the great tutorial! Is it possible to tune hyper-parameters using genetic algorithm in Tidymodels?

@JuliaSilge Год назад

I don't believe so, as of today. I would recommend suggesting this as a feature request on the tune repo: github.com/tidymodels/tune/issues

@yangyang6008 Год назад

@@JuliaSilge Thank you Julia, and I just posted a feature request on the tune repo.

@oliverarmstrong1813 3 года назад

Hi Julia - when you down sample in recipes, is this pre processing also being passed to 'new data' when you use the predict function?

@JuliaSilge 3 года назад

When you predict on a workflow, in most situations, both the preprocessing and modeling are applied to new data. However, for subsampling, we want the test set or other holdout sets to look like new data, not downsampled or upsampled. You can read more here: www.tmwr.org/recipes.html#row-sampling-steps

@hansmeiser6078 3 года назад

When and where in our code, can/should we say: 'finalize(mtry(), trainData)'... this seems to be not the intended location... daGrid

@JuliaSilge 3 года назад

You should be able to `finalize()` a parameter like mtry based on your data: dials.tidymodels.org/reference/finalize.html If you are having trouble and running into errors, I recommend creating a reprex and posting on RStudio Community: rstd.io/tidymodels-community

@kaihennig2323 4 года назад

Hey Julia, thank you for this video. This is helping me a lot for a course in my bachelor's degree! I am using your code right now and I am stuck now for at least 3 hours at the first training part with grid = 20. Do you know any method on how to track the process? I would really love to know if it takes an other few hours or days. It woul be cool if you could give a quick answer and maybe tell me how long the train took for the case in the video. Thank you and have a great day.

@hunlee2488 3 года назад

hey huge thx to your work! Just some questions at 53:35, why did you set importance as "permutation"? and what kind of fit (algorithm or mathematical model) is used to narrow down the order of variable importance?

@JuliaSilge 3 года назад

That is actually how variable importance is computed, via permuting the variables! You can read more about this here: christophm.github.io/interpretable-ml-book/feature-importance.html

@mkklindhardt 3 года назад

Hi Julia and everyone, Hope you had a good weekend. I am working with a large geo-referenced point data with biophysical variables (e.g. soil pH, precipitation etc) and would like to test for spatial autocorrelation in my data. Any tidy-friendly way to test for this particular autocorrelation? As it is an element important for my further Machine Leaning regression analysis I would need to take it into account to not produce algorithms/models that a erroneous in predicting the outcome variable. Thank you

@matthewryan8939 4 года назад

Hey Julia, You mentioned in there something about tuning parameters for recipes. Is there an easy way to do this in tidymodels, and if there is, would you be able to point me in the right direction for a good reference? Thank you!

@JuliaSilge 4 года назад

Yes, this example shows tuning a recipe parameter: tidymodels.github.io/tune/articles/extras/text_analysis.html

@matthewryan8939 4 года назад

@@JuliaSilge Thank you!

@normannborg 4 года назад

Very useful question and answers. Thanks

@mkklindhardt 3 года назад

Hi again Julia, I was wondering if tidymodels have some specific framework in place for dealing with spatial (or temporal) auto-correlation of data?

@JuliaSilge 3 года назад

We recently started a package for resampling methods appropriate for spatial data, which lets you create resampling folds with observations that are close together in the same fold: spatialsample.tidymodels.org/

@abdelouahebhocine5237 3 года назад

Hello julia I have question when I fit my model of random forest in test data set some levels of factor variables don't exist ? How I can solve it? Thanks for everything you shared with us very helpful we enjoy:)

@JuliaSilge 3 года назад

I don't believe you should be fitting your model on the test data; when you *predict* on the test data, it should be OK that there are some factor levels in the training set that are not in the test set. If you want to explicitly handle new factor levels at *prediction* time, check out this recipe step for data preprocessing: recipes.tidymodels.org/reference/step_novel.html

@wayanadolan2696 2 года назад

Thank you for such a helpful tutorial! If I have a new dataset (with the same column setup) and I want to apply the model I built to that dataset, what should I do?

@JuliaSilge 2 года назад

You will want to `predict()` using your trained model and your new data. You can see some examples of that with random forest models here: parsnip.tidymodels.org/articles/articles/Examples.html#rand-forest-models

@wayanadolan2696 2 года назад

@@JuliaSilge Wonderful, thanks!

@Bigmoney703 2 года назад

What is the value of tidymodels vs something like caret? It seems like to run the final model requires another function call on top of more processing. Plus, getting variable importance also requires a model re-run. Some things that are abstracted away in an unclear manner are the downsampling and new factor levels in test. Is the data downsampled for the cross validation folds? When you show the cv-folds, they have the original non-downsampled counts (same on the test data) -- is that downsampled when the model / cv metrics are calculated? I assume it has to be but then the performance should be different between train / test datasets unless the test set is also downsampled (that would be a big no-no). For new levels, I've had these packages break completely when there is a new level in test that's not in training and it's quite annoying to fix.

@JuliaSilge 2 года назад

- Max and I are doing a keynote at rstudio::conf to answer your first question in more detail! Very shortly, tidymodels functions are more modular and more maintainable, exchanging a large number of arguments for a large number of functions. We believe this is better in the long run but caret is not going anywhere if you want to keep using it! - It is true that tuning your model and a final fitting of your model are separate in tidymodels. This is by design, for better understanding, clarity, efficiency, modularity, etc. Feel free to keep using caret if your preference is for "one giant function". - When using any subsampling approach, the subsampling *only* happens to the training/analysis set, not the testing/assessment set. It is bad practice to do it to any holdout data, because you need to estimate your metrics as you will experience them on new data. You can read more here: www.tmwr.org/recipes.html#skip-equals-true - We provide support for handling new factor levels here: recipes.tidymodels.org/reference/step_novel.html

@juliantagell1891 4 года назад

When you showed the map at 14:27... it got me thinking of using a neural network modelling... like in playground.tensorflow.org/ but I don't know how to implement such a thing... Seems like certain regions are certainly more prone to be a certain type of tree... as a more complex way of utilising the Lat Lon coordinates. Any thoughts?

@eddytheflow 4 года назад

@ 50:00, this is a VIP based on the tuned model using the *training data*. Isn't it good practice to create a final model using *all data*? And if that is the case wouldn't it be more beneficial to report the VIP using that new model? Does last_fit() also create a new model using *all data*?

@JuliaSilge 4 года назад

The last_fit() function fits one time on the training data and evaluates one time on the testing data because we consider that best practice for most practitioners in most situations; you typically need an unbiased performance estimate for your final model which requires a test set. Certainly you may have different constraints and it's not unreasonable in some circumstances to fit a final model to all data. You can use regular old fit() for that, and then you can compute variable importance for that model.

@jasonosman3209 3 года назад

Thank you for the tutorial and I do have a question. How do I plot the final AUC curve?

@JuliaSilge 3 года назад

Check out the `roc_curve()` function: yardstick.tidymodels.org/reference/roc_curve.html

@brendang8610 4 года назад

i learn more and more with each video! something i've been trying to figure out though is: can a workflow object contain more than 1 model spec at the same time and i can choose which spec i want to use to fit the data? or is a workflow limited to 1 model spec (so like each model has its own workflow?)

@JuliaSilge 4 года назад

A workflow can contain one model spec, or *no* model spec. If you want to try out different models, you can set up a workflow with only the recipe (or formula), and then use `add_model()` before fitting.

@Scrabyard 3 года назад

When I try different hyperparameters for Random Forest (Regressor), the best model in my Testset (e.g. 80% accuracy) has mostly around 99% accuracy on my trainingset. So it massively overfits, but when reducing this overfitting by hyperparameter-tuning also my accuracy on the testset decreases. So Is overditting in some cases ok if it performs best on testset?

@JuliaSilge 3 года назад

It sounds like you may have been using the test set to choose or compare models, which is a practice that will lead, in general, to poor performance on new data. The purpose of the test set is to estimate performance on new data; you can't use it to compare or choose or tune models. We want to use resampled datasets for comparing or tuning models. You might want to check out these two chapters of TMwR: www.tmwr.org/splitting.html www.tmwr.org/resampling.html

@mkklindhardt 3 года назад

Dear Julia, I highly appreciate your elaborate and fun tutorials on your channel. I have come accros a few questions on my steep learning curve with R tidyverse etc. (btw I am a fan!).. 1) In this video, why are you down sampling - and perhaps more importantly what is happening behind the down_sample function? 2) In this video, I am following the tutorial with my own dataset. At some point I get this following error: "There are new levels in a factor: NAThere are new levels in a factor: NAThere are new levels in a factor: NAError: logRR should be a factor variable.". For your info; my logRR (response ratios) is my outcome variable. Why is it complaining about this? 3) second wuestion, I have started learning the mlr package for machine learning exercises on this huge dataset I'm currently working on for my internship.. So what advanteges and disadvantages are there between tidymodels and mlr - and how could I select the right package features for best performance. Do you maybe have information/materials about this? Thank you! Kind regards

@JuliaSilge 3 года назад

You can read more about subsampling at these two links: www.tidymodels.org/learn/models/sub-sampling/ www.tmwr.org/recipes.html#skip-equals-true If you have new levels in testing data that aren't in your training data, you might check out using `step_novel()`: recipes.tidymodels.org/reference/step_novel.html The mlr3 ecosystem has a lot to recommend it, and is very similar in its goals to tidymodels. You will get similar performance from both (used correctly/as designed) because the underlying algorithms are the same; choosing between them is a matter of choosing an interface that you prefer. I would select one vs. the other based on which one you feel is the best fit for your preferences or the constraints of your organization.

@mkklindhardt 3 года назад

Appreciate your detailed answers@@JuliaSilge . For point 2) Now I only get the Error: "logRR should be a factor variable"

@mkklindhardt 3 года назад

For some reason, it seems to be related to the step_dummy(). When I commend it out the problem does not appear. Instead, I did the conversion of characters to factors in the creation of the data frame. What is your exact explanation of why we need to perform a step_dummy(all_numinal(), -all_outcomes()) in the recipe if you are already converting your characters to factors in the data frame, prior to the recipe? Thank you! @Julia Silge Have a great day

@JuliaSilge 3 года назад

@@mkklindhardt It's hard to say what's going on based on this description. To get detailed help like this, I'd recommend creating a reprex and posting on RStudio Community: rstd.io/tidymodels-community In general, you create dummy/indicator variables when you don't want a factor but instead want a numeric representation of the information that a factor variable holds: recipes.tidymodels.org/reference/step_dummy.html#examples

@jsst28 3 года назад

I'm using a regression instead of classification and I'm getting this error, "All models failed. See the `.notes` column." In notes its says, "unknown projection function." Any idea what the issue(s) might be? Thanks

@JuliaSilge 3 года назад

Hmmmmm, hard to say with only that much information. Can you put together a reproducible example and post on RStudio Community? We will be able to help you out there: rstd.io/tidymodels-community

@jsst28 3 года назад

@@JuliaSilge I actually figured out the issue with Random Forest! I made a mistake in the step_other code. Thanks!

@AndreaDalseno 3 года назад

Hi Julia, thank you so much for your videos. I have a question for you: is there a way to restrict the values in grid_regular() to a specific set of values es c(500, 1000, 1500) or something like seq(400,2400, by=200)? It looks like the param function expects a range with min and max only. I'm pretty much sure there is a way but even googling I could not figure it out. Can you help me?

@AndreaDalseno 3 года назад

I'm pretty confident in python but new to R and your videos are teaching me so much! 👍

@JuliaSilge 3 года назад

@@AndreaDalseno The various parameter functions from dials are mean to be helpful convenience functions, but you can just use a dataframe/tibble to specify the grids as well. Here is one example that does this: www.tidymodels.org/learn/work/tune-text/#grid-search It's a pretty complicated model overall, but focus on how the grid is specified as a plain tibble with `crossing()` from tidyr, then used as an argument to `tune_grid()`.

@AndreaDalseno 3 года назад

@@JuliaSilge it works perfectly. For example I did: rf_grid

@AndreaDalseno 3 года назад

In such a case a racing search like the one you showed in your video about XGBoost may be a better choice, though.

@alitahsili 2 года назад

expand_grid()

@jamespaz4333 2 года назад

How much time did you wait? I am curious :) mine is taking too much.... 38:30

@JuliaSilge 2 года назад

Gosh, honestly I can't remember but I would guess not longer than 30 min? None of these Tidy Tuesday datasets are too big.

@coolghoul9 2 года назад

julia what pieces would i use to make predictions on test data that we didnt train on / the final test submission where we dont know the results like on kaggle competition,

@coolghoul9 2 года назад

i feel like im close but wouldnt the recipe not work on new data since it would choose different factor levels from the threshhold part

@JuliaSilge 2 года назад

@@coolghoul9 The threshold is *learned* from the training data and then will be applied to testing data or new data. No new thresholds are computed with new data. This is one of the real benefits of using tidymodels!

@JuliaSilge 2 года назад

If you want to predict on new data that doesn't have the outcome, I would recommend using `augment()`: parsnip.tidymodels.org/reference/augment.html Or `predict()`: parsnip.tidymodels.org/reference/predict.model_fit.html

@coolghoul9 2 года назад

@@JuliaSilge when I try to predict on the new data it says values missing in.... and names pretty much all the columns that are made from the preprocessing so Im not sure because even when I bake the recipe to the new data it shows that there are no missing values, really confusing

@JuliaSilge 2 года назад

@@coolghoul9 I suggest that you create a reprex for this and post it on RStudio Comunity: rstd.io/tidymodels-community That is a better forum for specific debugging questions.

@DzorDzi83 4 года назад

I have a problem when i come to the stage of tune_grid "recipe: Error: could not find function \"all_nominal\"" all models failed, this happens every time, even in lasso tutorial. tried removing packages, reinstalling them but to no avail.

@JuliaSilge 4 года назад

Hmmmm, that is somewhat strange. You can check out that function here: tidymodels.github.io/recipes/reference/has_role.html You might try to build a very small example using that function, i.e. a reprex, and see if you can find where it is going wrong.

@kaosavenger 4 года назад

I had the same issue but on one of the other videos, Julia did. I found that it was this doParallel::registerDoParallel() that was causing my problem. I didn't use that and it worked for me.

@DzorDzi83 4 года назад

@@JuliaSilge Now it's working after updating all the packages to the latest version. Alwasy having problem when i'm tuning a model, always refers that some formula in preprocessing step (preperation step) is having problem. Now after 2 months and updates it's working.

@hotchord 4 года назад

Your videos are wonderful. Great teaching tool. Could you post your code please.

@JuliaSilge 4 года назад

That's available here on my blog: juliasilge.com/blog/sf-trees-random-tuning/

@ruoguliu6072 10 месяцев назад

super cool vid! cannot find step_downsample, anyone knows why

@ruoguliu6072 10 месяцев назад

also Error in `step_dummy()`: Caused by error: ! cannot allocate vector of size 30749.5 Gb if I add in the step_dummy

@JuliaSilge 10 месяцев назад

You can find that recipe step here: themis.tidymodels.org/reference/step_downsample.html

@JuliaSilge 10 месяцев назад

@@ruoguliu6072 That's a memory problem. Are you working with a really enormous dataset? You might try training a model with only 10% of your data and see if you can get it working before using your whole dataset.

@ruoguliu6072 9 месяцев назад

Actually it was the ID column, I solved it by step_dummy(all_nominal(), -all_outcomes(), -the_ID_col_original_name)@@JuliaSilge

@ruoguliu6072 9 месяцев назад

Thank you! I found that the downsample was deprecated and was in themis package@@JuliaSilge

@hansmeiser6078 3 года назад

*What does 'just enough trees' mean?*

@JuliaSilge 3 года назад

You can read more here in what I think is a good explanation about how the number of trees behaves in a random forest: stats.stackexchange.com/questions/36165/does-the-optimal-number-of-trees-in-a-random-forest-depend-on-the-number-of-pred/36183

@hansmeiser6078 3 года назад

@@JuliaSilge hm.

@normannborg 4 года назад

How can you say you didn't overfit just watching the test metrics?

@JuliaSilge 4 года назад

Comparing the training and testing metrics is helpful, but yep, not the whole story!

@normannborg 4 года назад

@@JuliaSilgethanks for the answer. Do you have any lecture, reading, or whatever you'd suggest about the overfitting detection (I' m facing this problem that came out with a nested cv applied to random forest)

@JuliaSilge 4 года назад

@@normannborg Hmmm, one resource is this section in Feature Engineering: www.feat.engineering/selection-overfitting.html