No video :(

To downsample or not? Handling class imbalance in bird feeder observations

Подписаться 15 тыс.

Просмотров 2,8 тыс.

50% 1

Will squirrels will come eat from your bird feeder? Let's fit a model with #TidyTuesday data on bird feeders both with and without downsampling to find out. Check out the code on my blog: juliasilge.com...

Опубликовано:

28 авг 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 23

@wouldntyaliktono Год назад

One way I like to think about this question of downsampling is whether it alters the bias term of my model. Rebalancing the data will force the model to assume that the global average probability of SQUIRREL is 50%, but that isn't the case in the empirical data. And that can affect how successful my models are when they're deployed to production.

@JuliaSilge Год назад

Love this!

@natarajanlalgudi Год назад

Down sampling will have an impact in production as it will affect the model's ability to generalize to unseen data. Weighted loss function approach could actually yield far lesser variance, and far better model performance on unseen data outside of the training and validation process.

@JuliaSilge Год назад

@@natarajanlalgudi In tidymodels, a similar/related approach is tuning using a custom cost function for classification: yardstick.tidymodels.org/reference/classification_cost.html

@xxXXCarbon6XXxx Год назад

I love squirrels, they are so cute so I could never be a hater. We were in Washington at the Vietnam memorial wall & my brother-in-law offered a squirrel a piece of banana. It bit his finger and I laughed so hard (yes they may have rabies!). Adorable.

@alexandroskatsiferis Год назад

Nice demonstration showing the complexity of imbalanced classes. An issue with choosing specificity, sensitivity and similar metrics, is that they are all dependent on the decision threshold (in this case 0.5) which further complicates decision making.

@CaribouDataScience Год назад

Thanks for sharing!!

@517127 Год назад

Excelent work. I learn a lot with your videos

@yangyang6008 Год назад

Hi Julia, how can we define a class imbalance? In the example, "squirrels" is 4 times more than "no squirrels". If "squirrels" is only 1.5 times more than "no squirrels", is it still called imbalance?

@JuliaSilge Год назад

I think anything other than perfect balance (i.e. the categories are equal) is imbalance, but in typical modeling projects you don't start having problems until you have proportions like 5-to-1 or 10-to-1.

@yangyang6008 Год назад

@@JuliaSilge Thank you for your help Julia!

@natarajanlalgudi Год назад

@@JuliaSilge 4:1 is on the borderline of "serious imbalance" I'm guessing. There could be some learners tuned better using resampling or penalties and some not so.

@shauryamehta5339 Год назад

Hi I have this question that if i will use more than two different models in my work flow set for two different specification then how many models in total will be computed? For example lets say i want to compute two models one be using regularized regression and other be a tree based model with two different specification one be without down sample and other be with downsample so will in toal 4 models will be computed? Two for regularised regression and two for lets say random forest Thanks

@JuliaSilge Год назад

If I'm understanding you correctly, it sounds like you will have 4 models (logistic regression + downsampling, logistic regression without, tree-based + downsampling, tree-based without). When you decide to compare them, they will be fit to your resamples. If you have 10 folds, then you will fit 40 models to understand which will be the right one for you.

@ismaelmontero4811 Год назад

Hi Julia, thank you very much for your videos. I have a question. I have a dataset that only has nominal variables transformed as factors (it's a classification problem), however, when I try to use your code, I get an error: error: Some columns are non-numeric. The data cannot be converted to numeric matrix: 'ICode_Weather', 'ICode_Gender', 'ICategory_Age', 'iCode_Accident_Category', 'ICategory_Vehicle', 'ICategory_Time', 'BDrugs', 'BAlcohol', 'Week_Day', 'IZone'. There were issues with some computations A: x1 Can you give some advice? Thank you very much.

@JuliaSilge Год назад

You'll want to convert those to dummy or indicator variables using `step_dummy()`. Read more about this here: recipes.tidymodels.org/articles/Dummies.html

@ismaelmontero4811 Год назад

@@JuliaSilge Thank you for the information you shared, it was helpful. Do you know of any ways I could obtain the marginal effects?

@JuliaSilge Год назад

@@ismaelmontero4811 Many of the typical methods for getting marginal effects will work just fine. Here is an example of generating partial dependence profiles: www.tmwr.org/explain.html#building-global-explanations-from-local-explanations

@yangyang6008 Год назад

Hi Julia, thank you for the amazing tutorial! I wonder if it is possible to include Extreme Learning Machines in Tidymodels? Extreme learning machine (ELM) is a training algorithm for single hidden layer feedforward neural network (SLFN), which converges much faster than traditional methods and yields promising performance. The algorithm is currently included in the R package "elmNNRcpp" and "ELMR". Thank you.

@JuliaSilge Год назад

Not currently, no! You might be interested in learning how to create a parsnip model for it, like this: www.tidymodels.org/learn/develop/models/ Feel free to ask on GitHub or RStudio Community if you run into problems!

@yangyang6008 Год назад

@@JuliaSilge Thank you Julia and I will try to create a parsnip model for ELM. Hopefully, Tidymodels will update to include the algorithm in the future as ELM is very popular nowadays in machine learning.