Тёмный

Text analysis in R. Part 1b: Advanced preprocessing 

Kasper Welbers
Подписаться 2,4 тыс.
Просмотров 4,2 тыс.
50% 1

This is a short series of videos on the basics of computational text analysis in R. It is loosely inspired by our Text analysis in R paper (vanatteveldt.com/p/welbers-tex..., closely related to our R course material Github page (github.com/ccs-amsterdam/r-co..., and 42% love letter to quanteda.
This specific video just adds some stuff about more advanced tools for preprocessing. For support in R, we recommend the spacyr and udpipe packages.
spacyr: cran.r-project.org/web/packag...
udpipe: bnosac.github.io/udpipe/en/

Опубликовано:

 

5 окт 2020

Поделиться:

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист
Посмотреть позже
Комментарии : 4   
@ethanjudah8420
@ethanjudah8420 11 месяцев назад
Hi, I'm trying to do this on reddit data but the files I have are too large (100gb+) for only 3 months of data. That's in .zst. Do you have any suggestions on how to deal with this and apply these techniques on this data set in R?
@kasperwelbers
@kasperwelbers 11 месяцев назад
If your file is too large to keep in memory, the only option is to work through it in batches or streaming. So the first thing to look into would be whether there is a package in R for importing ZST files that allows you to stream it in or select specific rows/items (so that you can get it in batches). But perhaps the bigger issue here would be that with this much data you really need to focus on fast preprocessing, so that you'll be able to finish your work in the current decade. So first make a plan what type of analysis you want to do, and then figure out which techniques you definitely need for this. Also, consider whether it's possible to run the analysis in multiple steps. Maybe you could first just process the data to filter it on some keywords, or to store it in a searchable database. Then you could do the more heavy NLP lifting only for the documents that require it.
@katebjorklund4650
@katebjorklund4650 Год назад
After I stem, I get a lot of single letter nonwords. Any advice on how to deal with those?
@kasperwelbers
@kasperwelbers Год назад
Hi @Kate, that depends. If the words are non-informative you could just delete all single letter words. If the problem is that the words (before stemming) were informative, then perhaps stemming just doesn't work that well for your data (which can depend on the language your working with). For most languages (especially non-english) I would generally recommend using lemmatization if a good model is available for your language.
Далее
Text analysis in R. Part 2: Analysis approaches
27:59
Text analysis in R. Part 1: Preprocessing
25:15
Просмотров 12 тыс.
All Rust string types explained
22:13
Просмотров 153 тыс.
Text analysis in R. Demo 2: Sentiment dictionaries
16:05
Foreigners Try Weird German Snacks
18:44
Просмотров 21 тыс.
Understanding the glm family argument (in R)
16:15
Просмотров 19 тыс.
LDA Topic modeling in R
23:04
Просмотров 19 тыс.