Julia Silge

79
516 617

24:05

Empirical Bayes for Doctor Who episodes

7 месяцев назад

25:13

Logistic regression for US House election vote share

8 месяцев назад

25:10

Topic modeling for Taylor Swift Eras

8 месяцев назад

17:58

Weighted log odds ratios for haunted places in the US

9 месяцев назад

13:18

Bootstrap confidence intervals for how often Roy Kent says “F*CK”

9 месяцев назад

30:36

Evaluate multiple ML approaches for spam detection

10 месяцев назад

28:41

Evaluate the performance of GPT detectors

11 месяцев назад

21:32

Byte pair encoding tokenization for geographical place names

Год назад

37:36

Use xgboost and effect encodings to model tornadoes

Год назад

36:40

Predict childcare costs in US counties with xgboost and early stopping

Год назад

23:08

Deploy a model on AWS SageMaker with vetiver

Год назад

29:18

Use OpenAI text embeddings for horror movie descriptions

Год назад

27:13

Resampling to understand gender in art history textbooks

Год назад

45:46

To downsample or not? Handling class imbalance in bird feeder observations

Год назад

36:22

How to handle high cardinality predictors for data on museums in the UK

Год назад

30:03

Find high FREX and high lift words in Stranger Things dialogue

Год назад

29:27

Deploy different prediction types for a Bigfoot sighting model

Год назад

27:28

Deploy a model for LEGO sets with Docker

Год назад

15:34

Sliding window aggregation for rents in San Francisco

Год назад

32:38

Understand the gender pay gap three ways

2 года назад

22:44

Spatial resampling to understand drought in Texas

2 года назад

30:49

Predict NYT bestsellers with wordpiece tokenization

2 года назад

31:47

Handling coefficients for modeling collegiate sports expenditures

2 года назад

30:55

Poisson regression with tidymodels for package vignette counts

2 года назад

23:28

Statistical inference for aircraft and rank of Tuskegee airmen

2 года назад

47:44

Feature engineering & interpretability for xgboost with board game ratings

2 года назад

37:35

Predict ratings for chocolate with tidymodels

2 года назад

30:29

Topic modeling for Spice Girls lyrics

2 года назад

22:04

Predicting viewership for Doctor Who episodes

2 года назад

Комментарии

@dinohadjiyannis3225 Месяц назад

Julia, if I'm using a topic model on RU-vid comments to determine which video best explains topic modeling, how can I decide if your video or another video should be suggested? I see the model ranks comments with "gamma." If each comment is linked to a video ID, and based on gamma some or all comments rank highly in a hypothetical "topic modeling" topic, what then ? can we infer that your video is the best ?

@JuliaSilge Месяц назад

HAHA I can't tell if this is serious or not 🙈 In case it is, I will say that since topic modeling is unsupervised ML, it can't be used in a straightforward way to evaluate better/worse (you are not predicting a label). Instead, like you say, you could compare the relative proportion of certain topics (like, say, a topic that seems to be mostly about topic modeling) in one video's comments compared to others, and make an evaluation of videos based on that.

@dinohadjiyannis3225 29 дней назад

@@JuliaSilge If I can "cluster" comments related to topic modeling and find that the most relevant ones are linked to your video ID (based on beta, which will give you the top word probabilities), your video will appear with the highest relevance to that topic (based on gamma). This means your video is the most representative of that specific topic. But wait.. Then, if I manually compare, say, the top 10 most relevant videos and see that your video (which is at the top) also has a lot of likes, comments, engagement, and perhaps a great sentiment (after computing it) compared to the other 9, I can conclude that your video is the "best" and would recommend it. Does this make sense, or am I misinterpreting the gamma/beta. ***Assume I have concatenated all comments into 1 corpora. Each corpora is linked to a video ID.

@JuliaSilge 28 дней назад

@@dinohadjiyannis3225 I think that makes sense! Sounds to me like you are interpreting correctly. 👍

@dinohadjiyannis3225 28 дней назад

@@JuliaSilge A big thanks to you for replying, given that this video is 6 years old. 🥇

@rosiedavies7708 Месяц назад

does this work in the same way with regression problems?

@rosiedavies7708 Месяц назад

also thanks for this video, its very helpful and clear

@JuliaSilge Месяц назад

Yep, you would use `set_mode("regression")` in that case

@mxm8900 Месяц назад

Wow great video. I have nothing to do with text analysis, but I still watched the whole video

@smomar Месяц назад

All Hail the Dino! Now quickly get it some food, or else ... Thanks for the video. It was very informative.

@andreacierno4642 2 месяца назад

Thank you Julia. Can this work if my version of 'type' has 5-8 categories? Where the final output is More like 'X' where 'X' is each category label? Is there a way to get more words in each prediction fold? So in the final output it could look like 3 words for each more like? Thank you, again.

@JuliaSilge 2 месяца назад

I recommend that you check out this chapter of my book with Emil Hvitfeldt: smltar.com/mlclassification#mlmulticlass

@andreacierno4642 2 месяца назад

@@JuliaSilge Will do and thank you.

@emredunder9108 2 месяца назад

You are the queen of data analysis. Thanks for the video!

@kevingiang 2 месяца назад

Hi @JuliaSilge - thanks for your wonderful and helpful videos. I am trying to replicate your code with my own dataset and I am getting the following error when trying to initiate the tuning of the model: > xgb_rs <- + tune_race_anova( + object = xgb_wf, + resamples = dens_folds, + grid = 15, + control = control_race(verbose_elim = TRUE) + ) ℹ Evaluating against the initial 3 burn-in resamples. i Creating pre-processing data to finalize unknown parameter: mtry Error in `tune::tune_grid()`: ! Package install is required for xgboost. Run `rlang::last_trace()` to see where the error occurred. It says that a package install is required. Any idea about what package may be missing? I installed the 'tune' package and still gives me the same error. Any thoughts are appreciated. Thanks, Kevin

@JuliaSilge 2 месяца назад

It's the xgboost package that needs to be installed: CRAN.R-project.org/package=xgboost

@kevingiang 2 месяца назад

@@JuliaSilge I just figured it out... thanks much for answering back! You rock!

@gsonbiswas9765 2 месяца назад

Nice explanation. You could have used the searchK() function to show us how to select the range for K.

@deltax7159 2 месяца назад

What appearance theme are you using here?

@JuliaSilge 2 месяца назад

I use one of the themes from rsthemes: www.garrickadenbuie.com/project/rsthemes/ I think Oceanic Plus? There are lots of nice ones available in that package.

@danielhallriggins9008 3 месяца назад

Thanks Julia, love your videos! To get a more accurate sense of performance, would it be helpful to use {spatialsample} to account for spatial autocorrelation?

@JuliaSilge 3 месяца назад

That would be a great thing to do! This dataset doesn't have explicitly spatial information in it (just county FIPS code) so you would need to join some spatial info together with the original dataset.

@AnkeetSingh-gt9fm 3 месяца назад

Hey Julia, great tutorial. I had a question. Here you used Subject_Matter as the only high cardinality variable. If we have a dataset where there are multiple columns with high cardinality, can the recipe method be used in such a case for all the high cardinality columns ?

@JuliaSilge 3 месяца назад

Yes, you sure can! You will need to keep in mind how much data you have vs. how many predictors you are trying to encode in this way, and definitely keep in mind that you are using the **outcome** in your feature engineering. You can read more here: www.tmwr.org/categorical

@AnkeetSingh-gt9fm 3 месяца назад

@@JuliaSilge Great I’ll keep that in mind. Thank you!

@AnkeetSingh-gt9fm 3 месяца назад

@@JuliaSilge Hi, I had another question with regards to my previous question. For each column, would we have to define a separate recipe? And while creating the workflow, how would you add the recipes for multiple columns in the workflow(since workflow only allows one recipe)? I was unable to find resources for this online. Any help would be appreciated!

@JuliaSilge 3 месяца назад

@@AnkeetSingh-gt9fm Oh, you don't need a separate recipe for different columns, just separate steps. So you could do `step_lencode_glm()` then pipe to another `step_lencode_glm()`, etc.

@AnkeetSingh-gt9fm 3 месяца назад

@@JuliaSilge Thank you that’s what I figured and ran the code. I received an error: Error in - dsy2dpoC.Msymfrom) : not a positive definite matrix (and positive semidefiniteness is not checked), looks like I need to assess some variables in my model. You are very helpful with your prompt replies, I really appreciate it. Thank you!

@olexiypukhov-KT 3 месяца назад

I always love your videos Julia! I learn so much every time. Thank you for all the screencasts! Hopefully you haven't stopped and I am looking forward for more!

@Ejnota 4 месяца назад

With a left join you keep all the rows

@cb5231 4 месяца назад

thanks for this video Julia <3

@omoniyitemitope6113 4 месяца назад

Hi, I have these data with 35 variables and want to run some regression(RF,xgboost, etc..) on it. I am new to R and want to know if you have any special online training that I can register for?

@JuliaSilge 4 месяца назад

I recommend that you work through this: www.tidymodels.org/start/ And then take a look at this book: www.tmwr.org/ Good luck!

@omoniyitemitope6113 4 месяца назад

Thanks so much for your response. I followed one of your screencasts and got rsq of 0.37 for the RF model, is/are there anything I can do to improve the fit of my model?@@JuliaSilge

@JuliaSilge 4 месяца назад

@@omoniyitemitope6113This definitely depends on the specifics of your situation! I recommend that you check out a resource like *Tidy Modeling with R* for digging deeper on the model building process: www.tmwr.org/

@omoniyitemitope6113 4 месяца назад

@@JuliaSilgeThanks for your response. I will go through it. I did something that I did not know the statistical implication. I took the log of my dependent variable and performed a RF, and to my surprise I got % var explained to be 99.74, this looks too good to be true to me

@mohamedhany2513 4 месяца назад

could you make a video explaining how to deploy a model with shiny

@JuliaSilge 4 месяца назад

You may find this demo from Posit Solutions Engineering helpful: solutions.posit.co/gallery/bike_predict/

@zapbesttowatch2660 5 месяцев назад

good explanation

@eileenmurphy7044 5 месяцев назад

Thank you Julia for another excellent video. I have been trying to replicate your methods of generating a Canadian Provincial Map with the outlines of the provinces. For some reason map_data doesn't include borders to provinces the way that the states is set up. Do you have any ideas, how I can use map_data to do this? I am using map_data("world", "Canada") to get the provinces.

@eileenmurphy7044 5 месяцев назад

Hi Julia, Just answered my own question. The R package mapcan works like map_data - except it includes Canadian provincial borders.

@JuliaSilge 5 месяцев назад

@@eileenmurphy7044 Ah, that is good to hear! 🙌

@nabereon 5 месяцев назад

This channel is pure gold.

@teorems 5 месяцев назад

Bitters are good for health!

@G2Mexpert 5 месяцев назад

After recently having hacked my way through a similar attempt to visualize data by state, this was a like receiving "the answer sheet" from your teacher. Really appreciate the smart use of usmap information, building the tibble to facilitate the join of state abbr to state(lower) and the ever-useful window commands in dplyr!

@Y45HV1N 6 месяцев назад

(I'm very very new to all this) Getting the prior and the empirical/posterior from the same data seems counterintuitive to me and a bit like confirmation bias.

@JuliaSilge 6 месяцев назад

If you'd like some more conceptual background and theory behind this, I recommend the writing that Bradley Efron has done on it, the 1985 paper by Casella, and, for a more practical approach, the book _Introduction to Empirical Bayes: Examples from Baseball Statistics_ by my collaborator David Robinson.

@CaribouDataScience 6 месяцев назад

Thanks, another investing video.

@angvl8793 6 месяцев назад

We see Julia we click Like !! :)

@KK-tt5jz 6 месяцев назад

brilliant work, thank you Julia!

@rayflyers 6 месяцев назад

I've been watching your text mining videos recently. I'm about to start a project mining case notes for info on clients' relatives. I'm hoping to build a model that can predict if a case note contains that info or not. Any tips would be appreciated!

@JuliaSilge 6 месяцев назад

That sounds like it may be doable! Overall, I recommend this book I wrote with Emil for advice on predictive modeling for text data: smltar.com/

@elvinceager 6 месяцев назад

Yes, new video. its been a while. love this content.

@trevorschrotz 6 месяцев назад

Excellent, thank you.

@seyisonade1194 6 месяцев назад

As Usual, Brilliant work from a brilliant data scientist...

@forheuristiclifeksh7836 6 месяцев назад

0:08

@wapsyed 6 месяцев назад

Your videos are therapeutic haha

@mocabeentrill 6 месяцев назад

Clear explained and direct to the point! Thank you Julia.

@trevorschrotz 6 месяцев назад

Thanks for the tip on using type.predict = "response" in the broom::augment function. I learn something new from each of your vides, so thanks for the work that you put into making these.

@ColonelHathi 6 месяцев назад

YES! This is evidence supporting my hypothesis that Jodie Whittaker is a great actress, but Chris Chibnall is a terrible writer 😅. It isn't conclusive obviously. We would need observations from when they worked separately to get better proof.

@reshmilb2527 6 месяцев назад

please avoid background colour black. use eyesight friendly colour

@Jakan-sf3xj 7 месяцев назад

Thank you for the great video. I have one question, assuming the best model was one of the tuned random forest models, how would we extract the parsnip object to see the tuned hyperparameters i.e mtry and min_n?

@JuliaSilge 7 месяцев назад

You might check out the different "extract" functions in tidymodels. You can do `extract_fit_parsnip()` but you can also do `extract_parameter_set_dials()` to get the hyperparameters directly: hardhat.tidymodels.org/reference/hardhat-extract.html

@user-sb9oc3bm7u 7 месяцев назад

Hey Julia. Writing you here although its not related to this specific video. I am using the tidylo::bind log odds for a project but the fact that the order of set/feature is different than the one in tidytext::bind tf idf (which requires term and then document) makes it hard to easily encode it for shiny (with f <- type_of_algorithm; f(docs/sets, terms/fts, n=n), for example). Any chance you change the tidylo order? Obviously it can be done with a simple if(){} else{} but its much cleaner to use the `f <-select_algo` approach :)

@JuliaSilge 7 месяцев назад

Can you open an issue over at tidylo with an example/reprex showing what you mean? github.com/juliasilge/tidylo/issues

@Jackeeba 7 месяцев назад

The extra brackets of copilot can be really annoying! When I turn it on I've started to use enter for RStudio's autocomplete (tab just takes copilot's often incorrect suggestion), does anyone have a better solution?

@JuliaSilge 7 месяцев назад

Looks like the RStudio team is tracking the extra parentheses here: github.com/rstudio/rstudio/issues/13953 Feel free to thumbs up or add additional detail!

@manueltiburtini6528 7 месяцев назад

Amazing analysis! :O

@dantshisungu395 7 месяцев назад

Great episode like always Just a little whim from my side: would you mind doing videos about TDA real-life application ? And how can implement Spark on models that aren't on Parnsip ? Thank you 😅

@matthewcarter1624 7 месяцев назад

Hey Julia, thanks for you video ! I really like these.. I had one question about the std_var per writer... why did you divided it by the number of observations ?

@ariskoitsanos607 7 месяцев назад

Hey hi, it's cos the standard error of the average is sigma/sqrt(n), so the variance of the average would be sigma^2/n

@soylentpink7845 7 месяцев назад

Interesting! Could you do more practical applications of Bayesian statistics. I see it more and more asked and required by bigger tech companies.

@AlexLabuda 7 месяцев назад

So fun! Thanks for the video

@wilrivera2987 7 месяцев назад

Big fan of Dr Who too

@nosinz753 7 месяцев назад

adding this into my EDA tool box

@sr4823 7 месяцев назад

Sometimes copilot feels more like a burden than a companion. Thanks Julia, I always learn a ton with these videos.

@manueltiburtini6528 7 месяцев назад

Amazing work! you're so inspiring, thanks for sharing!

@JorgeThomasM 7 месяцев назад

Hi @JuliaSilge ! Would be volume = height * width * depth a sort of interaction / new variable? Thanks so much for all these wonderful sessions.

@JuliaSilge 7 месяцев назад

Yeah, for sure! We'd call that "feature engineering" because you are creating a custom feature from the original variables based on your domain knowledge of how furniture works. 😄

@gkuleck 8 месяцев назад

Hi Julia, Nice video on a topic that I find intrinsically interesting as a baseball AND tidy models fan. I did run into an error when executing the tune_race_anova. Error in `test_parameters_gls()`: ! There were no valid metrics for the ANOVA model. I am not sure how to fix this and I have been careful to follow the scripts. Any idea what might be causing the error?

@JuliaSilge 8 месяцев назад

When you see an error like that, it usually means your models are not able to fit/train. If you ever run into trouble with a workflow set or racing method like this, I recommend trying to just plain _fit_ the workflow on your training data one time, or use plain old `tune_grid()`. You will likely get a better understanding of where the problems are cropping up.

@EsinaViwn9 8 месяцев назад

Dear Julia, I was trying to find an example of tidymodels usage for time series forecasting (I want to create my own pipeline that can be used for any time series data as a first step, just run and see what happens). I am interested in cross-validation options for time series that tidymodels offers (probably inherits something from caret). Do you have any videos on that (I failed to find appropriate one)? Maybe you can share some links with code examples? I expect that you have touched this issue previously when demonstrating tidymodels power. Another problem I encountered was the construction of forecasts for future dates with a model that uses lagged dependent variables as predictors: is there a routine in tidymodels or somewhere else that allows one to automatically generate appropriate data for future predictions (say, if I want to predict for t+2, my lagged dep variable will be the prediction for t+1)? I had to code my own manipulations with a loop (create new row, then fill in values of necessary variables, then pass it to predict(model, newdata = this_new_row)), I am sure there are optimal solutions for this issue, maybe you are familiar with one (part of this problem was that among variables I also had an indicator for whether the day is a weekend or not, there was a dplyr pipeline I used to create it, is there a way to tell tidymodels "look at this pipeline, that is how I create all predictors for my model, please use this pipeline for predict()").

@JuliaSilge 8 месяцев назад

I have 2 suggestions for things to check out: - The first is the modeltime package: business-science.github.io/modeltime/ - The second are the time-based resampling approaches in rsample: rsample.tidymodels.org/reference/slide-resampling.html