Text analysis in R. Demo 1: Corpus statistics

Подписаться 2,4 тыс.

Просмотров 19 тыс.

50% 1

This demo is part of a short series of videos on text analysis in R, developed mainly for R introduction workshops.
A more detailed tutorial for the code discussed here can be found on our R course material Github page:
github.com/ccs-amsterdam/r-co...

Опубликовано:

5 окт 2020

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 51

@zbear404 Год назад

I'm literally crying because of how awesome you can compile and explain text analysis. Thank you!

@data_kom Год назад

Thanks for making such a great video. I have been reading about text analysis and your video is the first I found easy to practise along with on RU-vid. The explanation of what every code chunk means and does was the magic for me. Thanks again

@davidrogerdat Год назад

Thank you for this!! Gracias por esto!!

@user-ro9ex5im2p 11 месяцев назад

This is amazing. Thank you

@jiaqianhe6870 Год назад

This is sooooo useful!!!! AWESOME!!!!

@doctortito6696 Год назад

Hi! Thank you so much for your great help! This is exactly what I was looking for. Many of the lines of codes are deprecated though :( can someone teach me how to update them with the latest R version? Thanks so much!!

@kasperwelbers Год назад

Hi, nice to hear that it's of help to you! Regarding the deprecated code, the link in the description with the tutorial is (should be?) up to date. More generally, we maintain our teaching material on this GitHub page: github.com/ccs-amsterdam/r-course-material

@learning.data.science Год назад

Thank you for informative text analysis videos. I am just begginner on texxt analysis and R, I start with your videos. I have got a question at 12 :13 min, kwic() needs tokens() so, I applied toks

@kasperwelbers Год назад

Yes, you're correct. The quanteda api has seen some changes since this video was recorded. You can still pass a corpus directly to kwic, but it will now throw a warning that this is 'deprecated'. This means that at the moment it still works, but at some point in the (near) future it will be mandatory to tokenize a corpus before using kwic

@abhipsatripathy3934 3 года назад

Hi, thnku for the video. It's really helpful. Just a doubt, what if we have to use a text file consisting of stories or something descriptive. The file is not in csv format, it has only paragraphs, neither column, nor row. How to handle such data file?

@kasperwelbers 3 года назад

It is possible, but depending on how the file is structured (and mainly, whether it's structured consistently) it might take a bit more work. If the file does not use a common structure, you'd have to tell R how to split the text into documents (rows) and their contents (columns). So, you'd read the file into R as a single text (read_file from the readr package works), then use string operations to split the text into documents, and then split the contents of each document. The bad news is that you might have to learn a bit about regex to do this nicely. The good news is that this is a really useful skill to have. I'd recommend looking into the stringr package. In particular look into str_split and regular expressions. This linkt might give a good idea: alexluscombe.ca/blog/parsing-your-.pdfs-in-r/ . (it's starts with a pdf, but that's only reading the text into R. Splitting the text into parts is the same deal)

@toddcoode9598 3 года назад

Awesome video, Kasper! Thx a lot! I have some questions about the comparison of PDF-files. I want to compare german party manifestos over time, which I already gathered (for EP and for regional elections). I wanted to compare their europe-saliency (how often the term europ* accours and if it occurs more often when elections are combined (EP2014, Regional 2015 VS EP+Regional 2019)) and I cannot run the code like in the is_obama example. I do need something like "is_CDU = docvars(dtm)$CDU== 'europ*' " if you know what I mean? I think my approach to copmare these manifestos should not take that much time/code, but I am somewhat hopeless and I hope you can help me. U got me as a sub for life! :D

@kasperwelbers 3 года назад

Hi Todd! The comparison of is_obama is a bit different, because there you compare how often each word occurs for different groups in your data. The groups are then based on some selection of document variables, such as a specific president. In your case, you want to compare the frequency of a specific selection of terms (europe, european, etc.). I think it would make more sense to use a dictionary here (around 17:00). Probably the easiest way is just lookup the terms, and than add the results as a column to your document variables. Then you'd just have a data.frame of your manifesto data with europe-salience as a variable. It'd look something like this. library(quanteda) dtm = data_corpus_inaugural %>% tokens() %>% dfm() ## prepare dtm dict = dictionary(list(europe = "europ*")) ## create dictionary dtm_europ = dfm_lookup(dtm, dict) ## lookup dict terms d = cbind(docvars(dtm_europ), as.matrix(dtm_europ)) ## combine docvars with dict results

@toddcoode9598 3 года назад

@@kasperwelbers woah, thank you so much! Dude, I would like to tip you some money for helping me, do you got patron or something? You really helped me out here 🙏✌️ is there a way for hiring you or something? I am so sure, my work will took not even an hour for someone experienced. For me as a beginner, it will take days to get this figured out, but I'll try anyway. Tank you so much and God bless 🙏

@toddcoode9598 3 года назад

@@kasperwelbers ohh, and also I just cannot figure out, how I turn my VCorpus (my PDF-files) in a dtm, because the dfm function outputs an error if I try to compute it on the Vcorpus, saying "dfm() only works on character, corpus, dfm, tokens objects." Am I really that dumb? ;D I just dont get it but I keep trying. Greez and god bless.

@kasperwelbers 3 года назад

@@toddcoode9598 the VCorpus is a thing from the 'tm' package (tm and quanteda are competitors, of sorts). In the future I would recommend using the readtext package to read texts, but we can also convert the VCorpus to a Quanteda style corpus. The only thing that changes is that you first use the corpus() function. Also, a bit inconveniently, quanteda recently had a big update due to which some of my example code no longer works. The new appropriate way is to first use corpus(), then token(), and then dfm(). So if you're vcorpus would be called vc, you'd get: dtm = vc %>% corpus() %>% tokens() %>% dfm()

@kasperwelbers 3 года назад

@@toddcoode9598 Thanks for offering, but really, this is just something I do on the side. Most of these videos are made for my work at the university, but it just makes sense to keep stuff open and spread the R gospel where possible

@yimeilong5518 3 года назад

Hi, thank you for your video. I have a question. While creating the dictionary, what if I have a long keywords list, should I type them in manually? That's hard. Do you have any idea? Thank you.

@kasperwelbers 3 года назад

Hi Yimei, as long as the data has a clear structure it should be possible to transform it into a quanteda style dictionary. Could you give a small example of what your list looks like?

@yimeilong5518 3 года назад

@@kasperwelbers Thank you so much for your reply. There are several categories in my dictionary, each category contains some keywords or phrases. For example: communication skills and critical thinking skills are two categories, those in the bracket are keywords. communication skills (and_communication, communicate_effectively, effectively_communicate, etc.) critical thinking skills ( critical_thinking, problem_solving, solutions_oriented, etc.)

@kasperwelbers 3 года назад

@@yimeilong5518 That should certainly be possible. In what format do you have this dictionary? Is it already a list in R, or is this literally what the dictionary looks like (in plain text)?

@yimeilong5518 3 года назад

@@kasperwelbers That's great! This dictionary is in Excel now. 8 categories and 200+ keywords in total. I created the dictionary in WordStat, tried to import.cat format directly to quanteda, But didn't work well. I'm trying to find some tutorials, Do you have any suggestions? Thank you so much.

@yimeilong5518 3 года назад

@@kasperwelbers Hi, I think I find out the problem. When I import the dictionary from WordStat ( a .cat file), the entries contain a hyphen, like "critical_thinking", thus, while using the dfm_lookup function, the machine cannot find it, since it shows " critical thinking" in the text. Any suggestion to solve this problem? Thank you so much.

@syedabdullah4607 3 года назад

@kesper welbers How could we use the quanteda for bibliographic data obtained from the Scopus in .csv format or web of science database in plain text format? Please help me with that process. Would be more thankful

@kasperwelbers 3 года назад

In general I would recommend looking into the readtext package for importing all sorts of common formats for text. But whenever you have a specific format from a big party such as scopus, your best bet is to first just google for "import scopus R" or something like that. Changes are high that someone has already built a special package that does what you need. On quick glance I see that there is an rscopus package, and the fulltext package also seems usefull

@syedabdullah4607 3 года назад

@@kasperwelbers Could you please make a short video on reading text file through Scopus?

@kasperwelbers 3 года назад

@@syedabdullah4607 I'm afraid that's a bit too specific, and I haven't worked with Scopus data myself (yet).

@syedabdullah4607 3 года назад

@@kasperwelbers I think that will be most interesting and appreciated task, because you already have a experience in the running R effectively. In case you want to have a sample dataset from Scopus then i would be happy to share at the moment

@kasperwelbers 3 года назад

@@syedabdullah4607 What I mean is that not many people will be interested in working with Scopus specifically. I might do something more general on importing data into R. However, at the moment I really don't have any free time (teaching in times of Corona is tough)

@zolzayaenkhtur8309 Год назад

Thanks for the video! How do you define the documents for the corresponding president, such as Obama? Does R do it automatically? How? Thanks in advance.

@kasperwelbers Год назад

Hi Zolzaya. The corpus function automatically adds any columns in the input dataframe (except the 'text' column) as document variables. So we do need to tell R that there is a president column, which is now done by importing a csv file that has this column. Hope that clarifies it!

@pragatisharma3602 2 года назад

I am bit confused with my question. My text contains digits also. In weight format like 120g, 130g etc. I need to remove them and I have to categorize the column into three names such as potato_chips, not_potato_chips, not_chips. Could you please help me ? Or any hint. :)

@kasperwelbers 2 года назад

Hi Pragati. There are several ways to do this, and which makes sense depends on what you're goal is. If the only thing you are interested in is whether texts are about potato chips (potato_chips), non-potato chips (not_potato_chips) or do not even mention chips (not_chips), then I think you'd best just make a dictionary for "potato" and "chips". This would give you two columns, based on which all the other columns you mention can be made. not_potato_chips would just be cases where both potato and chips are zero (if this means NOT potatoe AND NOT chips). not_chips all cases where chips is zero. But maybe I'm completely missing the point? :)

@67lobe Год назад

hello i' can't find the moment where you speak bout word documents. I'm having my words documents to crete a corpus

@kasperwelbers Год назад

Hi @67lobe, I don't think I discuss word files in this tutorial. But I think the best ways are to use the 'readtext' package, or 'antiword'. The readtext package is probably the best to learn, because it provides a unified interface for various file types, like word, pdf and csv.

@marcosechevarria6237 Месяц назад

The dfm function is defunct unfortunately :(

@sakifzaman Год назад

4:51 its showing this message "Error in textplot_wordcloud(dtm, max_words = 50) : could not find function "textplot_wordcloud"" i have all the relevant packages but still getting this. Do you know why ? and how to solve it?

@kasperwelbers Год назад

Hi Sakif, that error message does typically mean that the package that contains the function is not 'loaded'. Are you sure you ran library(quanteda)? Also, if you installed some packages afterwards, R might have restarted, in which case you might have to run that line again.

@roxyioana 8 месяцев назад

can not use - dfmat_inaug

@kasperwelbers 8 месяцев назад

Hi @roxyioana, please check the link to the tutorial in the description. We keep that repository up-to-date with changes. (and at some point I hopefully find the time to re-record some videos)

@user-ui8uz6nf7y 10 месяцев назад

what about importing text from multiple pdf/docx?

@kasperwelbers 9 месяцев назад

I think the easiest way would be to use the readtext package. This allows you to read an entire folder ("my_doc_files/") or use wildcards ("my_doc_files/article*_.txt). cran.r-project.org/web/packages/readtext/vignettes/readtext_vignette.html#microsoft-word-files-.doc-.docx

@redietgebretsion8249 3 года назад

Please prepare a video on fastRText, text classification library by facebook... the documentation is poor

@kasperwelbers 3 года назад

I haven't used it yet, but the port to R (rcpp) looks pretty solid! At the moment I'm a bit struggling to find time besides work, but if I can fit it into a course or workshop I might find a good excuse to play around with it

@redietgebretsion8249 3 года назад

@@kasperwelbers Thanks a lot