Thanks! I mostly used OBS, which is an open source tool for recording and streaming. I found it quite intuitive (with some tutorials), and as someone without any editing experience was able to set up a good simple system for switching and layering windows. (Though to be honest, this was amid early pandemic despair over how to manage online teaching, so I probably did spend quite some time on it). For the weather-person effect of talking in front of a screen, I bought a pull-up greenscreen, though since then I think automatic background filtering has come a long way, so a greenscreen might no longer be needed. I also used Kdenlive for editing. In my case I only used this for cutting and pasting pieces of recordings, which didn't really take long to figure out, but I think that tool also supports more advanced editing.
If I recall correctly, the correlated topic model mostly differs in that it takes the correlations between topics into account in fitting the model. It probably adds a covariance matrix, but there should still be posterior distributions for document-topic and word-document, and so you should still just be able to visualize the correlations of topics and documents (or topics with topics) in a heatmap. Though depending on what package you use to compute them, extracting the posteriors might work differently.
This is a brilliant tutorial on GLM in R with a very good breakdown of all the information in step by step fashion that is understandable for a beginner
Hi! Do you mean vastly different results, or very small differences? I do think some of the multilevel stuff could in potential differ due to random processes in converging the model, but if so it should be really minor.
Ahaha, not sure whether that's a question or a burn 😅. This is just a Blue Yeti mic in the home office I set up during the COVID lock downs. The room itself has pretty nice acoustic treatment, but I was still figuring out in a rush how to make recordings for lectures/workshops and it was hard to get clear audio without keystrokes hitting through.
In order to be able to plot with textplot_wordcloud, you need first to load the "quanteda.textplots" library. I guess so few things have changed after 3 years. Otherwise it is not going to work. Thank's for the video dear Kasper.
Hi @roxyioana, please check the link to the tutorial in the description. We keep that repository up-to-date with changes. (and at some point I hopefully find the time to re-record some videos)
Hello Kasper, I appreciate your great video. I have a question. Regarding your example data, what if there are two or more data points for each day for each person? Let's assume that you measure reaction time 4 times each day across participants. Do you need to average those data points and make one data point for each day? or do you use all data points?
Interesting question. We can actually add more groups to the model instead of aggregating, but it depends on your question. In the example, we used days as a continuous variable, because we wanted to test if there was a linear effect on reaction time. If you also want to consider the time of the day as a continous variable, then it indeed becomes awkward how to combine them. However, maybe your reason for the four measurements is just to get more data points, so you think of them as factors rather than continuous. While aggregating might be viable, you could also consider adding another level to your model, for whether the measurement was in the (1) morning, (2) afternoon, (3) evening, or (4) night. You could then have random intercept, for instance to take into account that people might on average have lower reaction times in the evening due to their after-dinner-dip. (though note that with just 4 groups you might rather want to use fixed effects with dummy variables) Perhaps more generally, what you're interested in is multilevel models with more than one group level. This is possible and very common/powerfull. Groups can then either be nested or crossed. be nested, for instance people living in cities.
I think the easiest way would be to use the readtext package. This allows you to read an entire folder ("my_doc_files/") or use wildcards ("my_doc_files/article*_.txt). cran.r-project.org/web/packages/readtext/vignettes/readtext_vignette.html#microsoft-word-files-.doc-.docx
If I understand you correctly, I think it's indeed possible to model a dependent variable with a tri-modal distribution with glm. Actually, you might not even need glm for that. Whether a distribution is multimodal is a separate matter of the distribution family. A tri-modal distribution might be a mixture of three normal distributions, three binomial distributions, etc. Take the following simulation as an example. Here we create a y variable that is affected by a continuous variable x, and a factor with three groups. Since there is a strong effect of the group on y, this results in y being tri-modal. ## simulate 3-modal data n = 1000 x = rnorm(n) group = sample(1:3, n, replace=T) group_means = c(5,10,15) y = group_means[group] + x*0.4 + rnorm(n) hist(y, breaks=50) m1 = lm(y ~ x) m2 = lm(y ~ as.factor(group) + x) summary(m1) ## bad estimate of x (should be around 0.4) plot(m1, 2) ## error is non-normal summary(m2) ## good estimate after controlling for group plot(m2, 2) ## error is normal after including group
Dear Kasper, thank you very much for your videos. I am just getting into text analytics and I have a quick question. I am planning to work on Turkish language, and I don't know how to handle the stopwords and stemming processes. There are compatible files for TR to work through quanteda, but I don't know how to actually make them work. Could you please give some hints about that also? )
Good question. I'm not an expert on Turkish, so I don't know how well these bag-of-word style approaches work for it, but there does seem to be some support for it in quanteda. Regarding stopwords, quanteda uses the stopwords package under the hood. That package has the functions stopwords_getlanguages to see which languages are supported. Importantly, you also need to set a 'source' that stopwords uses. The default (snowball) doesn't support Turkish (which I assume is TR), but it seems nltk does: library(stopwords) stopwords_getsources() stopwords_getlanguages(source = 'nltk') stopwords('tr', source = 'nltk') Similarly, for stemming it uses SnowballC. Same kind of process: library(SnowballC) getStemLanguages() char_wordstem("aslında", language='turkish') # (same should work for dfm_wordstem) So, not sure how well this works, but it does seem to be supported!
Hi, I'm trying to do this on reddit data but the files I have are too large (100gb+) for only 3 months of data. That's in .zst. Do you have any suggestions on how to deal with this and apply these techniques on this data set in R?
If your file is too large to keep in memory, the only option is to work through it in batches or streaming. So the first thing to look into would be whether there is a package in R for importing ZST files that allows you to stream it in or select specific rows/items (so that you can get it in batches). But perhaps the bigger issue here would be that with this much data you really need to focus on fast preprocessing, so that you'll be able to finish your work in the current decade. So first make a plan what type of analysis you want to do, and then figure out which techniques you definitely need for this. Also, consider whether it's possible to run the analysis in multiple steps. Maybe you could first just process the data to filter it on some keywords, or to store it in a searchable database. Then you could do the more heavy NLP lifting only for the documents that require it.
Hi @67lobe, I don't think I discuss word files in this tutorial. But I think the best ways are to use the 'readtext' package, or 'antiword'. The readtext package is probably the best to learn, because it provides a unified interface for various file types, like word, pdf and csv.
What if you have combinations of two different groups. For example, you measure blood pressure from volunteers after drinking a certain number of units of alcohol. You do that in two different locations. So you want to fit a line per individual, but you also want to control for the location effect. Right?
You can certainly have multiple groups. First, you could have groups nested in groups. If you perform the same experiment in many countries across the world, your units would be observations nested in people (group 1) nested in countries (group 2). Second, you could have cross-nested (or cross-classified) groups. For example, say we want to study if the effect of more alcoholic beverages on blood pressure differs depending on the type of alcoholic beverage (beer, wine, etc.). In that case, each person could have observations for multiple beverages, and each beverage could have observations for multiple people.
@@kasperwelbers I see, thanks. I can imagine that having all these nested and cross-nested groups can complicate quite a lot the model and its interpretation.
Thank you for informative text analysis videos. I am just begginner on texxt analysis and R, I start with your videos. I have got a question at 12 :13 min, kwic() needs tokens() so, I applied toks <- tokens(corp) k = kwic(toks, 'freedom', window = 5) . Is it true?
Yes, you're correct. The quanteda api has seen some changes since this video was recorded. You can still pass a corpus directly to kwic, but it will now throw a warning that this is 'deprecated'. This means that at the moment it still works, but at some point in the (near) future it will be mandatory to tokenize a corpus before using kwic
Hi Brian! LDA itself does not care about language, because it only looks at word occurrences in documents. Simply put, as long as you can preprocess the text and represent it as a document term matrix, you can apply LDA.
Hello Kasper, thanks for this great video. Just wondering where I will get the document/chapter where all the codes are given. I mean the document from where you copied the codes and paste them into the R. Please let me know.
Hi @Dr Dilsad. Sorry, it seems I only included the link in the first video (about GLMs). More generally, we maintain some R tutorials that we regularly use in education on this GitHub page: github.com/ccs-amsterdam/r-course-material . The multilevel one is under frequentist statistics. There is a short version in the "Advanced statistics overview" that I think is the one from this video, and also a slightly more elaborate one in the "Multilevel models" tutorial.
Again thanks for the fine presentation. How about xpath? Have you considered covering that? I was hoping that would help with a table I was scraping but I could not figure out what to hang my hat on. The website is very unusual. You can view a table (the one I would like to scrape) but the code returns a list of three tables not one. What is annoying is that the html code has no distinct tags or marks to work with.
Hi Haraldur. It's true that xpath is a bit more flexible, so that might help address your problem. But you can also get quite creative with CSS selectors. If there are no distinct tags/ids/classes or whatever for the specific element you want to target, the only way might be to target the closest parent, and then traverse down the children based on tags and their positions. For instance, something like: "#top-nav > div > nav > ul > li:nth-of-type(2)". What can help with those types of annoying long paths, is to use something like the Google Chrome "SelectorGadget" plugin (which I didn't know existed when I made the video). This let's you select an element on a page and it gives you either the CSS selector or XPath.
@@kasperwelbers Kasper, Thanks for the information. It clearly takes a lot of experimenting. I wound up settling down on these two code options trying it extract the third table: html_doc |> html_elements("table") |> html_table(header = TRUE) |> pluck(3) pluck is from purrr package (pull will not work here). Or using xpath: html_doc |> html_elements(xpath = '//center[position() = 3]/table') |> html_table(header = TRUE) The pluck method is more elegant in my mind but xpath is clearly worth learning at one point or another. By the way I am using the native pipe which will not always work but the regular magrittr will. H
@@haraldurkarlsson1147 pluck indeed offers a nice solution here! There is certainly some value in learning xpath, as it's more flexible and also works well for XML files. That said, when doing it in R I tend to prefer your first solution, because it's easier to debug. Probably the xpath approach is slightly faster, but in webscraping the main speedbump is the http requests so in practice I think the difference in speed would hardly be noticiable.
Thanks for the video. Very clear explanations. I was wondering why you used days as a random effect slope, and didn’t add day:subject as an interaction term in your model?
Hi Michael. Good question. You could indeed fit something very similar to a multilevel model using dummies and interaction terms. Specifically, instead of random intercepts, we could have included fixed effects for every day using dummy variables. And instead of random slopes, we could then have added interaction terms for these dummy variables with the 'days' variable. So more generally speaking, we could indeed have used fixed effects instead of random effects to model differences between subjects. There are some benefits to using random effects though. Aside from not having your model cluttered with many dummies and interactions, using fixed effects eats up degrees of freedom. For example, if you used dummies for subjects, you could not add something like gender to the model, because there are no df left at the subject level.
@@kasperwelbers Ty for the reply, that clears up some of my confusion. I'm new to multi-level models, but have experience with multiple linear regression. I think I'm a bit confused about when one would choose to use a covariate as a random effect versus an interaction term as a fixed effect, as well as all the other possibilities. I kind of wish there was a complex enough dummy dataset where each possible scenario could be displayed with figures. For example: 1) response ~ var_1; 2) response ~ var_1:var_2, 3) response ~ var_1 + (1 | var_2); etc. etc...
@@Maicolacola A great way to learn about how these models work is actually to create your own dummy data. You can think of fitting models as trying to find the data generating process. So if you understand the model you can generate data that will fit in a certain way. I should perhaps do a video on this, but here's a quick example: ## simulate data about the effect of doing homework on grade. ## students are nested in classes, which have different average grades, and ## a different effect (random slope) of doing homework. n = 10000 ## students groups = 100 ## classes homework = rnorm(n, mean=10, sd=5) ## simulate time spend on homework group = sample(1:groups, n, replace=T) ## assign random groups ## generate the random parts of the model ri = rnorm(groups, mean=10, sd=3) ## random intercepts (group level) rs = rnorm(groups, mean=2, sd=1) ## random slopes (group level) e = rnorm(n, mean=0, sd=10) ## individual level variance ## simulate the grade. Note that we can use the 'group' integers as index of ri and rs grade = ri[group] + rs[group] * homework + e ## put the data together d = data.frame(homework, grade, group) ## Now run a random intercepts+slopes model to see if we can recover the parameters ## we plugged in above (mean intercept is 10, mean homework slope is 2) library(lme4) library(sjPlot) m = lmer(grade ~ homework + (1 + homework | group), data=d) tab_model(m)
Kasper, Just working through your tutorial this week and it is excellent. It is obviously some time since you did the video and the coding in the IMDb has changed. For example the CSS selectors now have different names which just makes it a bit more challenging and interesting. Thanks for doing this.
Thanks, and double thanks for framing my outdated CSS selectors as a learning challenge :). Still, I think I should then update them at least in the document, so (third) thanks for the heads up!
Thanks! About the identity function, I think its uncommon to use matrix notation for link functions because many are non linear. So I prefer to also just think of the identity link more generally as an identity function. But maybe I'm missing something?
Thanks for the video! How do you define the documents for the corresponding president, such as Obama? Does R do it automatically? How? Thanks in advance.
Hi Zolzaya. The corpus function automatically adds any columns in the input dataframe (except the 'text' column) as document variables. So we do need to tell R that there is a president column, which is now done by importing a csv file that has this column. Hope that clarifies it!
4:51 its showing this message "Error in textplot_wordcloud(dtm, max_words = 50) : could not find function "textplot_wordcloud"" i have all the relevant packages but still getting this. Do you know why ? and how to solve it?
Hi Sakif, that error message does typically mean that the package that contains the function is not 'loaded'. Are you sure you ran library(quanteda)? Also, if you installed some packages afterwards, R might have restarted, in which case you might have to run that line again.