Text analysis in R. Part 1: Preprocessing

Подписаться 2,4 тыс.

Просмотров 13 тыс.

50% 1

This is a short series of videos on the basics of computational text analysis in R. It is loosely inspired by our Text analysis in R paper (vanatteveldt.com/p/welbers-tex..., closely related to our R course material Github page (github.com/ccs-amsterdam/r-co..., and 42% love letter to quanteda.
#### Useful links ####
Low-level string processing:
A good place to start is by learning how to use the stringr package. (I personally prefer the stringi package because I'm used to it, but stringr is probably more accessible to most, as it has this tidyverse flair).
stringr vignette:
cran.r-project.org/web/packag...
Another great resource on stringr is the R for data science book, which also does more regular expression stuff:
r4ds.had.co.nz/strings.html
Character encoding
'What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text' by David C. Zentgraf: kunststube.net/encoding/
'String encoding and R' by Kevin Ushey: kevinushey.github.io/blog/201...

Опубликовано:

5 окт 2020

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 16

@asterixklang7213 3 года назад

This is so well explained. Thank you very much for sharing this!

@kasperwelbers 3 года назад

Thanks!

@haraldurkarlsson1147 3 года назад

Very nice coverage of text analysis and the main concepts.

@obakeng4287 2 года назад

Legendary 👌

@larszijm5882 3 года назад

You saved me man, thanks a lot!!

@murielmoyahabo6078 2 года назад

I really love this. I will love to see your documents before converting it into a corpus. I need to see the structure and what yiu have there

@kasperwelbers 2 года назад

Hi Muriel. Could you clarify which corpus you mean? In general, I think the easiest way to make a corpus is by using a data.frame as input, as also described here: tutorials.quanteda.io/basic-operations/corpus/corpus/

@kobeoncount 10 месяцев назад

Dear Kasper, thank you very much for your videos. I am just getting into text analytics and I have a quick question. I am planning to work on Turkish language, and I don't know how to handle the stopwords and stemming processes. There are compatible files for TR to work through quanteda, but I don't know how to actually make them work. Could you please give some hints about that also? )

@kasperwelbers 10 месяцев назад

Good question. I'm not an expert on Turkish, so I don't know how well these bag-of-word style approaches work for it, but there does seem to be some support for it in quanteda. Regarding stopwords, quanteda uses the stopwords package under the hood. That package has the functions stopwords_getlanguages to see which languages are supported. Importantly, you also need to set a 'source' that stopwords uses. The default (snowball) doesn't support Turkish (which I assume is TR), but it seems nltk does: library(stopwords) stopwords_getsources() stopwords_getlanguages(source = 'nltk') stopwords('tr', source = 'nltk') Similarly, for stemming it uses SnowballC. Same kind of process: library(SnowballC) getStemLanguages() char_wordstem("aslında", language='turkish') # (same should work for dfm_wordstem) So, not sure how well this works, but it does seem to be supported!

@kobeoncount 10 месяцев назад

@@kasperwelbers This is so helpful, thank you!!

@gabrielbriziou1602 2 года назад

I love you

@rubenurbizagastegui36 3 года назад

How do you remove accents in different languages? Could you please give us some examples?

@kasperwelbers 3 года назад

Hi Ruben, I think you're looking for transliteration. Simply put, we can translate text into the ascii encoding, which doesn't have accents. This is available in base R (the iconv function), but I prefer using the stringi package: library(stringi) your_text = 'Der größte soufflé' stri_trans_general(your_text, "any-ascii") This is vectorized, so your_text can also be a vector with many texts. Note that this might fail, because depending on your system and how you imported/input the text you might need to specify the encoding. The transliteration from 'any' into 'ascii' is a bit rough, but surprisingly it often just works.

@rubenurbizagastegui36 3 года назад

Hi Welbers, Not. I am not looking for transliteration. I am looking for a way to deal with spanish accents at doin text analysis with Quanteda. it looks that Quanteda does not recognize accents. How to deal with spanish accents using Quanteda?

@kasperwelbers 3 года назад

@@rubenurbizagastegui36 But how do you then want to 'deal with spanish accents'? Your question was how to remove accents (which is often a good solution) but that is what you'd use transliteration for. Did you check the example code in my previous comment?