Text analysis in R. Part 1b: Advanced preprocessing

Подписаться 2,4 тыс.

Просмотров 4,2 тыс.

50% 1

This is a short series of videos on the basics of computational text analysis in R. It is loosely inspired by our Text analysis in R paper (vanatteveldt.com/p/welbers-tex..., closely related to our R course material Github page (github.com/ccs-amsterdam/r-co..., and 42% love letter to quanteda.
This specific video just adds some stuff about more advanced tools for preprocessing. For support in R, we recommend the spacyr and udpipe packages.
spacyr: cran.r-project.org/web/packag...
udpipe: bnosac.github.io/udpipe/en/

Опубликовано:

5 окт 2020

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 4

@ethanjudah8420 11 месяцев назад

Hi, I'm trying to do this on reddit data but the files I have are too large (100gb+) for only 3 months of data. That's in .zst. Do you have any suggestions on how to deal with this and apply these techniques on this data set in R?

@kasperwelbers 11 месяцев назад

If your file is too large to keep in memory, the only option is to work through it in batches or streaming. So the first thing to look into would be whether there is a package in R for importing ZST files that allows you to stream it in or select specific rows/items (so that you can get it in batches). But perhaps the bigger issue here would be that with this much data you really need to focus on fast preprocessing, so that you'll be able to finish your work in the current decade. So first make a plan what type of analysis you want to do, and then figure out which techniques you definitely need for this. Also, consider whether it's possible to run the analysis in multiple steps. Maybe you could first just process the data to filter it on some keywords, or to store it in a searchable database. Then you could do the more heavy NLP lifting only for the documents that require it.

@katebjorklund4650 Год назад

After I stem, I get a lot of single letter nonwords. Any advice on how to deal with those?

@kasperwelbers Год назад

Hi @Kate, that depends. If the words are non-informative you could just delete all single letter words. If the problem is that the words (before stemming) were informative, then perhaps stemming just doesn't work that well for your data (which can depend on the language your working with). For most languages (especially non-english) I would generally recommend using lemmatization if a good model is available for your language.