No video :(

How to handle high cardinality predictors for data on museums in the UK

Julia Silge

Подписаться 15 тыс.

Просмотров 6 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Опубликовано:

28 авг 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 19

@utubemoitral Год назад

Fantastic. Thanks Julia, i was looking for something like this for long. Also hoping you will do a video on what if analysis soon, maybe?

@olexiypukhov-KT Год назад

Thanks again for all your videos! You are an amazing teacher.

@DavidSmith-sd5xo Год назад

The best!

@mcmahonandrewjonatha Год назад

I love this! Thank you, Julia. I have two quick questions. 1) The embedding techniques are impressive, but should I take any precautions when using these embedding techniques to avoid harmful data leakage from the output variable? 2) Once the high-cardinality predictor is transformed to a numeric variable, can I treat the correlations with other numeric variables as informative and meaningful on there on (assuming I made an ordinary correlation matrix)? Or would that be misguided?

@JuliaSilge Год назад

Be sure to estimate embedding techniques using resampling (inside of resampling, not one time before resampling) to avoid data leakage, yes! If you use it in a workflow like I have shown here, that is the behavior. If you want to use something like `step_corr()` after you have created an embedding, I think that would work great. For something more like a correlation matrix, like in this section: www.tmwr.org/dimensionality.html#beans Then you just need to remember that the outcome was used to estimate that value.

@mcmahonandrewjonatha Год назад

@@JuliaSilge Very helpful. Thanks 🙏

@alelust7170 Год назад

Awesome

@Pablo-ln9nd Год назад

Hey @JuliaSilge ! Great video as always :) I have a question, what do we do when we have a high cardinality target variable? (lets say 17 categories) Can we still use tree based models? (Random forest, xgboost, etc) Thank you! :)

@JuliaSilge Год назад

If your outcome has that many levels, you have two main options that I have seen work out well in practice. You can a) train 17 separate models that are each 1 level vs. all others and at prediction time see which option has the highest probability or b) train one model on to predict for all levels at once. You can see here for an example of the second: smltar.com/mlclassification.html#mlmulticlass

@Pablo-ln9nd Год назад

@@JuliaSilge thank you so much for the info Julia. Keep up the great content!

@mrsnakesss Год назад

Thanks for this video! I never used embed for categorical predictors yet. Why the default value for a new level is -0.909 and not 0? What does it represent? Thanks !

@JuliaSilge Год назад

This would be like the mean value of the outcome, or like the intercept or bias (if this predictor were the only one being used in the model). It's not zero because there are more unaccredited than accredited museums.

@mrsnakesss Год назад

@@JuliaSilge I see, thanks!

@AnkeetSingh-gt9fm 4 месяца назад

Hey Julia, great tutorial. I had a question. Here you used Subject_Matter as the only high cardinality variable. If we have a dataset where there are multiple columns with high cardinality, can the recipe method be used in such a case for all the high cardinality columns ?

@JuliaSilge 4 месяца назад

Yes, you sure can! You will need to keep in mind how much data you have vs. how many predictors you are trying to encode in this way, and definitely keep in mind that you are using the **outcome** in your feature engineering. You can read more here: www.tmwr.org/categorical

@AnkeetSingh-gt9fm 4 месяца назад

@@JuliaSilge Great I’ll keep that in mind. Thank you!

@AnkeetSingh-gt9fm 4 месяца назад

@@JuliaSilge Hi, I had another question with regards to my previous question. For each column, would we have to define a separate recipe? And while creating the workflow, how would you add the recipes for multiple columns in the workflow(since workflow only allows one recipe)? I was unable to find resources for this online. Any help would be appreciated!

@JuliaSilge 4 месяца назад

@@AnkeetSingh-gt9fm Oh, you don't need a separate recipe for different columns, just separate steps. So you could do `step_lencode_glm()` then pipe to another `step_lencode_glm()`, etc.

@AnkeetSingh-gt9fm 4 месяца назад

@@JuliaSilge Thank you that’s what I figured and ran the code. I received an error: Error in - dsy2dpoC.Msymfrom) : not a positive definite matrix (and positive semidefiniteness is not checked), looks like I need to assess some variables in my model. You are very helpful with your prompt replies, I really appreciate it. Thank you!