Lasso regression with tidymodels and The Office

Julia Silge

Подписаться 15 тыс.

Просмотров 12 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Наука

Опубликовано:

28 авг 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 44

@RAPmastaGBLASCO63 4 года назад

Every time I watch one of your videos I learn something new and become more confident in my modeling. Thank you so much for them!

@rrmaximiliano 4 года назад

Thanks, Julia for the video. Really interesting how you approached the cleaning and models in comparison to David. Pretty nice you keep making these videos. They are super helpful.

@neomorphicduck 4 года назад

Can't agree with you more!

@hesamseraj Год назад

I am reviewing all the videos and adding the tree episode names as some sort of homework for myself.

@erickknackstedt3131 4 года назад

Love it! Finding this channel has made my day.

@iugaMovil 3 года назад

Great video Julia. It was a refresher for add_count and geom_col because I stop using them for some reason.

@minhnguyenbui6827 4 года назад

Oh wow, It's so amazing. I know you via Text mining with R book, Found David and your channel is a memorable milestone in my learning R process :D

@alexnoble17 4 года назад

This is super interesting. I would love to do this analysis with Doctor Who (specifically New Who!)

@ethanthealien 4 года назад

This was fantastic! It got me really excited about tidymodels =)

@k.d0721 4 года назад

you are the best, I should put your name in my PhD thesis

@luisfernandobaldanfechio8958 3 года назад

Thanks a lot, excellent material. I'm having a different response from the fitted workflow (@ 27:00). I'm receiving a tibble: 31 x 3 with only one intecept while yours is a tibble 1,563 x 5 with many intercepts. I copy/paste the code as in my blog post.

@JuliaSilge 3 года назад

Ah, I believe there has been a change in parsnip since this video was published that you only get the lambda you actually specified, not the whole path of lambdas: github.com/tidymodels/parsnip/blob/master/NEWS.md#parsnip-013

@juliantagell1891 4 года назад

Thanks Julie, this is great. Just got one question at 4:20. The other day I realised I can put pipes inside a mutate to get something like below... do you reckon using this is a good idea (I don't see it much but it feels really efficient)? transmute(episode_name = title %>% str_to_lower() %>% str_remove_all(remove_regex) %>% str_trim(), imdb_rating)

@hesamseraj 3 года назад

Amazing thank you very much.

@brendang8610 4 года назад

Awesome and informative video as always! I have a question and hope you can help clarify - I noticed when you did the bootstrap resampling you used office_train as the dataset, which is the unmodified training data. In another video (the hotel bookings one) you used the juiced recipe as the dataset when creating the monte carlo cross validation resamples. Is there a best practice on which dataset to use when resampling with tidymodels - the un-processed training data vs the pre-processed & juiced recipe data? Thanks!

@brendang8610 4 года назад

oh! wait is it because here you're using a workflow() and in the hotel bookings video you weren't? and if so, is the workflow applying the recipe, prepping and juicing in the resampling step for you?

@JuliaSilge 4 года назад

@@brendang8610 Yes, that's basically it! A workflow that includes a recipe will apply that recipe. Generally it is probably better practice to do resampling on the unmodified training set, because otherwise you can get LEAKAGE from your preprocessing steps and then overly optimistic results from resampling.

@vincentpepe1064 4 года назад

Hi Julia, Love the video! I was wondering how you would compare the accuracy of the model to the testing data? I need to submit a report with both the predicted and actual values and cannot seem to find it.

@iqu3261 2 года назад

Thanks so much Julia for the valuable videos, im trying to evaluate LDA topic modelling on tweets using NPMI , do you have an idea how to implement it in R? thanks Sam

@muttbane1072 4 года назад

Great video! Love it!

@hoschie211 4 года назад

Very nice video! Well explained and above all: 30:18 :-)

@AdrianaCastilloC Год назад

Julia, this is great!! It's so well explained (: ... Do you know by any chance how to do exactly this for spatial (polygon) data?

@JuliaSilge Год назад

You might check out the spatialsample package: spatialsample.tidymodels.org/ And here is a blog post where I walk through how to use it: juliasilge.com/blog/drought-in-tx/

@AdrianaCastilloC Год назад

@@JuliaSilge oh, my god! This is GREATTTT!!! many many thanks!!

@TheFrankyguitar 4 года назад

Thanks for the great video Julia! I learned a lot. If we use a GLM, we might want to use a univariate filter to keep only relevant variables in the model since GLM's don't have built-in variable selection. Is there a way to do this with tidymodels? Maybe with recipes?

@JuliaSilge 4 года назад

Not currently, but we're interested in recipes supporting feature selection like that in the future!

@TheFrankyguitar 4 года назад

That's great! Thank you.

@mindlessgreen 3 года назад

Thanks for the nice tutorial. At 22:30, office_prep was created. What was that about? It was never used downstream. In general, I don't get the use of prep and bake.

@JuliaSilge 3 года назад

I think it *is* useful to know how to use `prep()` and `bake()` if you are going to be a tidymodels user, in order to debug and problem solve when things don't go right with your recipes. It's a way to check out how your recipe will preprocess your data for modeling. You can read about what the two functions do here: www.tmwr.org/recipes.html#using-recipes

@drinks3544 2 года назад

What does the value used to indicate "importance" on the x-axis mean? is that R^2?

@JuliaSilge 2 года назад

In the vip package, what "importance" is varies from model to model. You can look more at the documentation but for a linear model like a lasso regularized model, it is just literally the coefficients from the model itself (similar to coefficients from `lm()`). You can check out documentation for vip here: koalaverse.github.io/vip/

@vladimirmijatovic883 Год назад

Hi @julia - great video! funny - I tried tuning hyperparameters with two different values of trees. when I tune the model with trees = 100 and with trees = 1000 the order of variable importance changes. With trees = 100 the most important variable is mhi_2018, followed by one_race_a, while with trees = 1000 the most important variable is one_race_a (followed by mhi_2018). How is this possible? From where this could be coming from?

@JuliaSilge Год назад

I think you may be asking about a different video in this comment? But yes, maybe I should have been more clear that the variable importance I show is for *that model specifically*. The hyperparameters you choose for your algorithm often have an impact on variable importance. (And if you use variable importance to do feature selection, then that will change the hyperparameters you choose!) There is some related discussion here: stats.stackexchange.com/questions/264533/how-should-feature-selection-and-hyperparameter-optimization-be-ordered-in-the-m

@vladimirmijatovic883 Год назад

@@JuliaSilge OMG, how embarrassing :), indeed it is related to another video of yours. The question was about this video: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-OMn1WCNufo8.html (Predict Childcare Costs), but RU-vid kept rolling to next video while I was waiting for my model to be trained :). However, I was surprised that hyperparameter such as number of trees could impact order of variable importance. I guess my intuition was wrong.

@ryankirk574 4 года назад

What RStudio theme are you using? I could not find that in the default appearances.

@JuliaSilge 4 года назад

It's one of the themes from the rsthemes package: github.com/gadenbuie/rsthemes

@ryankirk574 4 года назад

@@JuliaSilge Thank you for the quick reply! Watched and now reading through the blog explanation for further understanding.

@travisknoche5639 4 года назад

Hi Julia, thanks for the video! I am getting the error: "All models failed in tune_grid(). See the `.notes` column." when running tune_grid(). My code is identical to yours and I'm also using a mac. Any ideas?

@travisknoche5639 4 года назад

all of the .notes say "model 1/1 (predictions): Error in cbind2(1, newx) %*% nbeta: invalid class 'NA' to dup_mMatrix_as_dgeMatrix"

@JuliaSilge 4 года назад

@@travisknoche5639 Is this using the same code/data as in my blog post? juliasilge.com/blog/lasso-the-office/ Or different data?

@travisknoche5639 4 года назад

@@JuliaSilge Yep!

@JuliaSilge 4 года назад

@@travisknoche5639 Does the first fit work, when you are not tuning?