Topic modeling with R and tidy data principles

Подписаться 15 тыс.

Просмотров 62 тыс.

50% 1

Watch along as I demonstrate how to train a topic model in R using the tidytext and stm packages on a collection of Sherlock Holmes stories. In this video, I'm working in IBM Cloud's Data Science Experience environment.
See the code on my blog here: juliasilge.com/blog/sherlock-...

Наука

Опубликовано:

7 авг 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 93

@lightspd714 5 лет назад

Julia you are a great teacher. I love your text mining with R book but it is nice to see the concepts come to life in video.

@RosieOutdoors 4 года назад

Thank you so much for this video. As a complete newcomer to r and topic modelling, this was so well explained.

@djcfb2889 3 года назад

Wow! This is probably the best R tutorial I've seen like forever!

@gabriellakountourides6726 3 года назад

You directed me to topic modeling after I asked a Q on stackoverflow, thank you so much! Thank you for this amazing amazing resrouce!

@learningstuffs5718 4 года назад

I am learning R and just got pass the basics and try to implement it into projects. Your channel is a fantastic place for people like me to learn please keep teaching. Thank you.

@happylearning-gp Год назад

Excellent contribution, so fast, very clear, error-free, well explained

@robertc2121 6 лет назад

Julia - this is amazing. Love your book -and I had been tempted by DataCamp for months before only signing up because of your Course. What a help they both have been Thank you!!

@JuliaSilge 6 лет назад

HA you are so welcome! I'm really glad these resources are helpful. 👍

@Mrsandis89 3 года назад

Julia, you’re an angel. I have to do my dissertation through STM, and, thank to you, I can literally complete it in 2 weeks b4 April 4th deadline.

@edutimqiu1168 3 года назад

Amazing work, incredibly helpful. All the best!

@jianzhang9157 4 года назад

I really like your Introduction! It's great.

@Dawgs10100 2 года назад

Thank you for this great video. I hope there are more to come! :)

5 лет назад

That was an awesome teaching, thanks so much!

@jadesweeney1690 3 года назад

This was so helpful to me during my research placement on tidytext data mining, thank you!

@XJRULO 4 года назад

I took one or two of your DataCamp courses, but making this available with no fees is a remarkable and nice work, thnks a lot!!!

@hesamseraj Год назад

It is again very helpful. I wish you keep sharing more videos on any new topic that interests you.

@mxm8900 2 месяца назад

Wow great video. I have nothing to do with text analysis, but I still watched the whole video

@kaswin6527 6 лет назад

Fabulous explanation ever seen .. Thank you sooooo much

@toshiyukihasumi825 2 года назад

Thank you so much for your video. It's the ONLY tutorial I've found that talks about STM! Please keep them coming and truly appreciate your video!

@JuliaSilge 2 года назад

I've also got this blog/screencast that demonstrates how to use STM: juliasilge.com/blog/spice-girls/

@swazy1777 4 года назад

You are an amazing teacher!

@DanTaninecz 6 лет назад

Great work. Very clear video. This type of solid instruction is all too rare in data science. Generally this type of stuff is just dumped on the user.

@prabhacar 2 года назад

thanks for such a nice explanation. loved the demo!

@RajatSrivatava 6 лет назад

Hi ma'am your presentation and teaching skills are so good . thanks so much

@stewartli5395 6 лет назад

great insights in a tidy way. like it very much. thanks.

@morzaq123 6 лет назад

Amazing Video. Looking Forward to more videos on Text Mining

@samuelholt7775 4 года назад

Please do more! This was a brilliant introduction with perfect pace, I learned so much in less than 30 min! Hopefully this tip helps you as much as this demonstration helped me: crtl+shift+m (or cmd+shift+m) is a handy dplyr shortcut. Thank me later ;)

@donataamato3418 2 месяца назад

THANK YOU so much!!!

@2108966 6 лет назад

Julia you are amazing!!! Thank´s!!!

@englianhu 6 лет назад

I used to use quanteda for my professional certificate few years ago. The tidytext and stm packages that you introduce will be more suitable for natural language processing. 😉

@lrschm 3 года назад

Awesome video - super helpful! :)

@GustavoMontanha 4 года назад

thanks julia, loved it

@odhiambogigs2829 5 лет назад

nice work....this was very helpful

@Mrsandis89 3 года назад

And of course, I’ve read your work. You’re brilliant.

@TerezaS 4 года назад

THank you so much for this video! And I love your book :)) If you considered doing more videos, I would love aspect-based sentiment analysis as a topic :))))

@entrepreneuriatrecherchesetcon Год назад

@Tereza S look on my video on sentiment analysis on many documents ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-rU97L9Tu7Dg.html

@biaoyang6207 5 лет назад

Great! Thanks for sharing!

@avijitnandy6662 6 лет назад

Maam we need more videos like this.

@janidelemmanuelcastaneda8318 4 года назад

Awesome content

@paulmm6878 3 года назад

Me encantan tus videos 😃 saludos desde Ecuador ✌️

@terraflops 3 года назад

****PLEASE ZOOM IN **** for the future, please! I _Love_ this, thank you so much!

@entrepreneuriatrecherchesetcon Год назад

Nice presentation. I suggest to increase the size via tools, general settings, appearance and choose for instance 16 or 18. Codes will e more clear.

@celloharper 6 лет назад

Thanks for the video. Please post more. How does one find your blog.

@vikrantnag86 4 года назад

Thank you Julia. Ca you please share some knowledge on how to do Sentiment analysis in R. Will be very helpful.

@vm2321 3 года назад

She's written a book about it bro lol Here's the link www.tidytextmining.com/

@renatacavalcanti8297 4 года назад

vídeo mais que perfeito

@bbbbraveheart 2 года назад

thank you so much~~~~

@dianaszabo3875 3 года назад

Thank you :)

@emilierademakers70 6 лет назад

Hi Julia, thanks for sharing this tutorial! It was exactly what I needed. I am working on recovering latent dimensions in job descriptions and I am using R topic modelling to gain insight. I have two questions. \1. I first started working on my data using the Text Mining in R and got acquinted with the lda methods. I see there are similarities with the stm package, however in the documentation it stated that without covariates (which is what I am doing at the moment), STM reduces to a logistic-normal topic model, oftern called the Correlated Topic Model. What would you say are the main differences between the CTM and LDA? And apart from it being fast (indeed!) what would you say is the main motivation for using the STM package (with spectral initilization)? \2. Would you recommend first filtering out synonyms using e.g. the wordnet package in R? Or should the co-currence of these words with other words in documents solve this more or less? Many many thanks! Emilie

@JuliaSilge 6 лет назад

I don't think you need to filter out synonyms before implementing topic modeling, because that is one of the things topic modeling is doing, during the modeling process, finding the latent topics. Related, you might want to even consider whether stemming is useful for your domain space: transacl.org/ojs/index.php/tacl/article/view/868 I have had consistent, excellent results with STM, which is one of the reasons I recommend it to folks. LDA models are based on the Dirichlet distribution (if you draw a sample from a Dirichlet distribution, you get a positive vector that sums to one); these models are based on priors over topics/words, then you solve for (approximate) posterior. CTM is a different approach, which models that one topic can be correlated with another (LDA assumes they are independent). Instead of Dirichlet, it uses the logistic normal distribution, as I understand it. If you want to read the original paper for CTM, it is here: arxiv.org/pdf/0708.3601.pdf As far as spectral initialization, it is a good place to start and nice for getting quick and reasonable results. If I need something very robust, then I do all the work that is laid out in the stm package vignette. I am working on some tidy tooling around that, and hope to get it out sometime soon!

@shilpasuresh641 4 года назад

How do you text mine a lot of urls stored in a CSV file ? or in other words topic modeling

@botswithabeat 6 лет назад

Great video! I am hoping to do some topic modeling on some 19th-century German texts with your approach. I still am unsure what I will do to import German stop words, but I will do some digging. One critique: it is difficult to type along while you are talking, especially when you are entering things into the console so quickly. Maybe slow down by 5%. Thanks a lot for the great website and video.

@botswithabeat 6 лет назад

Thanks a lot for the quick reply and very useful info!

@hkia7893 3 года назад

You can reduce the playback speed

@delando983 5 лет назад

Nice video!! I am getting an error not sure if its me...more likely it is :( sherlock_tf_idf %>% + mutate(word = reorder(word, tf_idf, story)) %>% + ggplot(aes(word, tf_idf, fill = story)) + + geom_col(alpha = 0.8, show.legend = FALSE) + + facet_wrap(~ story, scales = "free", ncol = 3) + + scale_x_reordered() + + coord_flip() + + theme(strip.text=element_text(size=11)) + + labs(x = NULL, y = "tf-idf", + title = "Highest tf-idf words in Sherlock Holmes short stories", + subtitle = "Individual stories focus on different characters and narrative elements") Error in mutate_impl(.data, dots) : Evaluation error: object 'FUN' of mode 'function' was not found.

@pe66o 5 лет назад

Dear Julia - how can I create a topic model , when I have dataset as follows - Column1 word , Column 2 frequency of the word in the texts, Column 3 Main class and Column 4 the subclass? The topics should be classes and the subclasses. I made already a dictionary with the classes and subclasses. Thank you

@jacobbonsell4776 6 лет назад

Is there a way to get the frequency counts next to the betas in the topic-word distribution? I wanted to either use mutate or join somehow but I don't know where to retrieve the counts.

@jacobbonsell4776 6 лет назад

Thank you

@sonabaghdasaryan1198 6 лет назад

Hi, an amazing video. But still I have a problem from the very beginning: I get an error while downloading gutenbergr. Error: No package with the name gutengergr. Which RStudio version do you use in this video? Thx in beforehand ^^

@sonabaghdasaryan1198 6 лет назад

Everything is fine, thx. After restarting my computer my code is running ^^ Julia, u r great ^^u inspired me to do TM ..

@davidizquierdogomez 5 лет назад

hello Julia...very nice video thanks a lot. I have a question...in my network graph of bi-grams, I get nodes without names...does it mean that i haven´t clean the white spaces properly? thank you very much.

@davidizquierdogomez 5 лет назад

Thanks for the response...I double checked and it is not a problem related to white spaces. I coded to get a igraph of bigrams and i get bigrams which are alone in two-nodes associations. Instead a bigram, there is a number on the empty node...

@abdulrahmanabdulkadri4825 4 года назад

This is great and very helpful! I would like to ask, how might we know which documents fall under which topic? Might there also be a data visualization for this? We only see how many documents fall under which topic, but not specifically which document.

@JuliaSilge 4 года назад

Yes, check out the topic modeling section of the workshop I taught at rstudio::conf this year: github.com/rstudio-conf-2020/text-mining

@abdulrahmanabdulkadri4825 4 года назад

@@JuliaSilge Amazing! Thank you very much!

@knowledgeispower7007 3 года назад

Thank you so much for this video. I’m very new to R and to STM. I’m working on a paper and trying to analyze press releases to formulate my hypotheses and find relevant topics. The press releases are stored on a word document. Could you please help/guide me on where to start and how to go about this? I’m trying to find latent variables and I heard that STM is a great modeling to use for this purpose. I appreciate your help 🙏

@JuliaSilge 3 года назад

The first thing you need to do is read the Word files into R, because Word files are a special format that require specific handling. One package I like for dealing with Word and other Office files is officer: davidgohel.github.io/officer/ You can look at the same of the other options folks use here: stackoverflow.com/questions/50439684/how-to-extract-plain-text-from-docx-file-using-r

@knowledgeispower7007 3 года назад

@@JuliaSilge thank you so much for your prompt response and for the resources you provided 🙏 I will definitely try them

@Yi-cu7ie 4 года назад

Hi, thank you for your video, which helps me a lot. I have a question. I have raw text with pdf and word form, how could I transfer this to data frame form like sherlock_raw and sherlock in the program. Thank you so much for your time and consideration!!!

@JuliaSilge 4 года назад

For PDFs, my favorite tool for reading text into R is the pdftools package: docs.ropensci.org/pdftools/ I have less experience reading in .docx files, but I have occasionally used the textreadr package: github.com/trinker/textreadr Good luck!

@ilCapotasto 6 лет назад

cast_dfm has been moved from quanteda to tidytext, correct?

@justinwallace1304 5 лет назад

@dr.tarunsengupta6248 Год назад

gutenbergr package is not available in new version of R. please change the code accordingly so that analysis can be done form ant text or pdf document.

@bistanz 3 года назад

Thanks for the video! One small question. Don't you need Sherlock %>% filter(!is.na(story)) to remove all NA rows?

@JuliaSilge 3 года назад

It's been a while since I looked at this, but I don't believe there are any NA rows, at least as of how the data was formatted when I originally created this video/post back in 2018. You can see that in the tf-idf plot, no NA story facet: juliasilge.com/blog/sherlock-holmes-stm/

@bistanz 3 года назад

@@JuliaSilge Thanks for replying. Don't we select only the top 10 words on each document to plot td-idf? Oh! eventually NA is not that frequent. You are right, we may no need to remove NAs. Thanks again for the amazing material.

@user-ro9ex5im2p Год назад

Great

@PatriciaRiosblog 5 лет назад

Hi julia would stm work nowadays for twitter or facebook content? thanks

@JuliaSilge 5 лет назад

Yep! This example shows using stm for topic modeling with long documents (books) but this approach also works with shorter documents. If you want to see an example of this, I have a blog here implementing topic modeling with Hacker News posts: juliasilge.com/blog/evaluating-stm/

@puspa_indah 5 лет назад

How to calculate theta and beta in structural topic modeling manually? does anyone know the formula or concept?

@puspa_indah 5 лет назад

@Julia Silge yes, I've already checked that paper but I don't find specific information that related to the formula I mention, does the algorithm on estimating theta and beta matrix is similar to any topic modeling methods (i.e LDA, CTM, STM, etc)? thanks for the previous reply btw :)

@hkia7893 3 года назад

Thanks Julia for this interesting implementation of topic modelling So in the end we get 6 topics with probability of 7 words each. And we do not know which story belongs to which topics.... 🤔

@JuliaSilge 3 года назад

If you look at the gamma probabilities, you can see how the stories are related to topics. Check out the plot "Distribution of document probabilities for each topic" here: juliasilge.com/blog/sherlock-holmes-stm/

@hkia7893 3 года назад

@@JuliaSilge thanks, I'm gonna check that out...

@Jaji1948 2 года назад

Resolution too low. Can’t read the screen. Can you send me a link to a higher res version?

@srisreshtan1471 4 года назад

When I am trying to install the 'Guttenberger' package, I am getting a message package ‘guttenberger’ is not available (for R version 3.6.3)

@JuliaSilge 4 года назад

I think you're dealing with some typos there; there's just one "t" and no "e" at the end: cran.r-project.org/package=gutenbergr

@srisreshtan1471 4 года назад

Yes. My mistake. Apologies. Thanks for correcting it.

@dinohadjiyannis3225 Месяц назад

Julia, if I'm using a topic model on RU-vid comments to determine which video best explains topic modeling, how can I decide if your video or another video should be suggested? I see the model ranks comments with "gamma." If each comment is linked to a video ID, and based on gamma some or all comments rank highly in a hypothetical "topic modeling" topic, what then ? can we infer that your video is the best ?

@JuliaSilge Месяц назад

HAHA I can't tell if this is serious or not 🙈 In case it is, I will say that since topic modeling is unsupervised ML, it can't be used in a straightforward way to evaluate better/worse (you are not predicting a label). Instead, like you say, you could compare the relative proportion of certain topics (like, say, a topic that seems to be mostly about topic modeling) in one video's comments compared to others, and make an evaluation of videos based on that.

@dinohadjiyannis3225 Месяц назад

@@JuliaSilge If I can "cluster" comments related to topic modeling and find that the most relevant ones are linked to your video ID (based on beta, which will give you the top word probabilities), your video will appear with the highest relevance to that topic (based on gamma). This means your video is the most representative of that specific topic. But wait.. Then, if I manually compare, say, the top 10 most relevant videos and see that your video (which is at the top) also has a lot of likes, comments, engagement, and perhaps a great sentiment (after computing it) compared to the other 9, I can conclude that your video is the "best" and would recommend it. Does this make sense, or am I misinterpreting the gamma/beta. ***Assume I have concatenated all comments into 1 corpora. Each corpora is linked to a video ID.

@JuliaSilge Месяц назад

@@dinohadjiyannis3225 I think that makes sense! Sounds to me like you are interpreting correctly. 👍

@dinohadjiyannis3225 Месяц назад

@@JuliaSilge A big thanks to you for replying, given that this video is 6 years old. 🥇