Тёмный
No video :(

To downsample or not? Handling class imbalance in bird feeder observations 

Julia Silge
Подписаться 15 тыс.
Просмотров 2,8 тыс.
50% 1

Will squirrels will come eat from your bird feeder? Let's fit a model with #TidyTuesday data on bird feeders both with and without downsampling to find out. Check out the code on my blog: juliasilge.com...

Опубликовано:

 

28 авг 2024

Поделиться:

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист
Посмотреть позже
Комментарии : 23   
@wouldntyaliktono
@wouldntyaliktono Год назад
One way I like to think about this question of downsampling is whether it alters the bias term of my model. Rebalancing the data will force the model to assume that the global average probability of SQUIRREL is 50%, but that isn't the case in the empirical data. And that can affect how successful my models are when they're deployed to production.
@JuliaSilge
@JuliaSilge Год назад
Love this!
@natarajanlalgudi
@natarajanlalgudi Год назад
Down sampling will have an impact in production as it will affect the model's ability to generalize to unseen data. Weighted loss function approach could actually yield far lesser variance, and far better model performance on unseen data outside of the training and validation process.
@JuliaSilge
@JuliaSilge Год назад
@@natarajanlalgudi In tidymodels, a similar/related approach is tuning using a custom cost function for classification: yardstick.tidymodels.org/reference/classification_cost.html
@xxXXCarbon6XXxx
@xxXXCarbon6XXxx Год назад
I love squirrels, they are so cute so I could never be a hater. We were in Washington at the Vietnam memorial wall & my brother-in-law offered a squirrel a piece of banana. It bit his finger and I laughed so hard (yes they may have rabies!). Adorable.
@alexandroskatsiferis
@alexandroskatsiferis Год назад
Nice demonstration showing the complexity of imbalanced classes. An issue with choosing specificity, sensitivity and similar metrics, is that they are all dependent on the decision threshold (in this case 0.5) which further complicates decision making.
@CaribouDataScience
@CaribouDataScience Год назад
Thanks for sharing!!
@517127
@517127 Год назад
Excelent work. I learn a lot with your videos
@yangyang6008
@yangyang6008 Год назад
Hi Julia, how can we define a class imbalance? In the example, "squirrels" is 4 times more than "no squirrels". If "squirrels" is only 1.5 times more than "no squirrels", is it still called imbalance?
@JuliaSilge
@JuliaSilge Год назад
I think anything other than perfect balance (i.e. the categories are equal) is imbalance, but in typical modeling projects you don't start having problems until you have proportions like 5-to-1 or 10-to-1.
@yangyang6008
@yangyang6008 Год назад
@@JuliaSilge Thank you for your help Julia!
@natarajanlalgudi
@natarajanlalgudi Год назад
@@JuliaSilge 4:1 is on the borderline of "serious imbalance" I'm guessing. There could be some learners tuned better using resampling or penalties and some not so.
@shauryamehta5339
@shauryamehta5339 Год назад
Hi I have this question that if i will use more than two different models in my work flow set for two different specification then how many models in total will be computed? For example lets say i want to compute two models one be using regularized regression and other be a tree based model with two different specification one be without down sample and other be with downsample so will in toal 4 models will be computed? Two for regularised regression and two for lets say random forest Thanks
@JuliaSilge
@JuliaSilge Год назад
If I'm understanding you correctly, it sounds like you will have 4 models (logistic regression + downsampling, logistic regression without, tree-based + downsampling, tree-based without). When you decide to compare them, they will be fit to your resamples. If you have 10 folds, then you will fit 40 models to understand which will be the right one for you.
@ismaelmontero4811
@ismaelmontero4811 Год назад
Hi Julia, thank you very much for your videos. I have a question. I have a dataset that only has nominal variables transformed as factors (it's a classification problem), however, when I try to use your code, I get an error: error: Some columns are non-numeric. The data cannot be converted to numeric matrix: 'ICode_Weather', 'ICode_Gender', 'ICategory_Age', 'iCode_Accident_Category', 'ICategory_Vehicle', 'ICategory_Time', 'BDrugs', 'BAlcohol', 'Week_Day', 'IZone'. There were issues with some computations A: x1 Can you give some advice? Thank you very much.
@JuliaSilge
@JuliaSilge Год назад
You'll want to convert those to dummy or indicator variables using `step_dummy()`. Read more about this here: recipes.tidymodels.org/articles/Dummies.html
@ismaelmontero4811
@ismaelmontero4811 Год назад
@@JuliaSilge Thank you for the information you shared, it was helpful. Do you know of any ways I could obtain the marginal effects?
@JuliaSilge
@JuliaSilge Год назад
@@ismaelmontero4811 Many of the typical methods for getting marginal effects will work just fine. Here is an example of generating partial dependence profiles: www.tmwr.org/explain.html#building-global-explanations-from-local-explanations
@yangyang6008
@yangyang6008 Год назад
Hi Julia, thank you for the amazing tutorial! I wonder if it is possible to include Extreme Learning Machines in Tidymodels? Extreme learning machine (ELM) is a training algorithm for single hidden layer feedforward neural network (SLFN), which converges much faster than traditional methods and yields promising performance. The algorithm is currently included in the R package "elmNNRcpp" and "ELMR". Thank you.
@JuliaSilge
@JuliaSilge Год назад
Not currently, no! You might be interested in learning how to create a parsnip model for it, like this: www.tidymodels.org/learn/develop/models/ Feel free to ask on GitHub or RStudio Community if you run into problems!
@yangyang6008
@yangyang6008 Год назад
@@JuliaSilge Thank you Julia and I will try to create a parsnip model for ELM. Hopefully, Tidymodels will update to include the algorithm in the future as ELM is very popular nowadays in machine learning.
@cuysaurus
@cuysaurus Год назад
You look awesome, Julia.
@joshuapooley8993
@joshuapooley8993 Год назад
I am not sure if @ijessup is into data science, but if she were then this would be the video for her. #Gary
Далее
🎙ПОЮ ВЖИВУЮ!
3:17:56
Просмотров 1,5 млн
Use xgboost and effect encodings to model tornadoes
37:36
Evaluate multiple ML approaches for spam detection
30:36
3 Reasons to Use Tidymodels with Julia Silge
1:23:53
Просмотров 3,8 тыс.
Tune xgboost more efficiently with racing methods
28:43
Sliding window aggregation for rents in San Francisco
15:34
How to resolve Class Imbalance in R
12:34
Просмотров 10 тыс.
Deploy a model on AWS SageMaker with vetiver
23:08
Просмотров 2,7 тыс.
Logistic regression for US House election vote share
25:13
🎙ПОЮ ВЖИВУЮ!
3:17:56
Просмотров 1,5 млн