Kasper Welbers

11
174 459

This has been hastily created to upload some video lectures in times of Corona. In time I might actually add a useful description.

Комментарии

@gauravsutar9890 7 дней назад

Hello it was good to learn LDA from this video, but can you arrange any videos for Structural topic modelling full explanation ?

@mollymurphey4526 29 дней назад

How do I add my own csv file as the corpus?

@EurekaRaven Месяц назад

Many thanks for great work! What software/tools do you use to record these videos if you don't mind me asking.

@kasperwelbers Месяц назад

Thanks! I mostly used OBS, which is an open source tool for recording and streaming. I found it quite intuitive (with some tutorials), and as someone without any editing experience was able to set up a good simple system for switching and layering windows. (Though to be honest, this was amid early pandemic despair over how to manage online teaching, so I probably did spend quite some time on it). For the weather-person effect of talking in front of a screen, I bought a pull-up greenscreen, though since then I think automatic background filtering has come a long way, so a greenscreen might no longer be needed. I also used Kdenlive for editing. In my case I only used this for cutting and pasting pieces of recordings, which didn't really take long to figure out, but I think that tool also supports more advanced editing.

@EurekaRaven Месяц назад

@@kasperwelbers thank you so much!

@juliantorelli4540 Месяц назад

Kasper, how would this work for a correlation topic model heat map with topic rows/topic columns?

@kasperwelbers Месяц назад

If I recall correctly, the correlated topic model mostly differs in that it takes the correlations between topics into account in fitting the model. It probably adds a covariance matrix, but there should still be posterior distributions for document-topic and word-document, and so you should still just be able to visualize the correlations of topics and documents (or topics with topics) in a heatmap. Though depending on what package you use to compute them, extracting the posteriors might work differently.

@juliantorelli4540 Месяц назад

@@kasperwelbers Thank you! I tried this code, it seems to have worked for basic LDA: beta_x <- tidy(x, matrix = "beta") beta_wider = function(x){ pivot_wider(x, values_from = beta, names_from = topic) %>% arrange(term) %>% select(-term) %>% rename_all(~paste0("topic", .)) } beta _w <- beta_wider(x) cor1 <- cor(beta_w) I then plotted a correlation matrix.

@randomdude4411 Месяц назад

This is a brilliant tutorial on GLM in R with a very good breakdown of all the information in step by step fashion that is understandable for a beginner

@paphiopedilum1202 Месяц назад

thank you french accent man

@marcosechevarria6237 Месяц назад

The dfm function is defunct unfortunately :(

@moviezone8130 Месяц назад

Kasper, I found it very helpful, it was a great video and you set the bar high. Very very informative filled with concepts.

@MK-fp6tg 2 месяца назад

This is a great tutorial. I have a quick question. Which file type do I have to convert my current data set in an Excel file?

@yifeigao8655 2 месяца назад

Thanks for sharing! The best tutorials I've watched. No fancy slides, but very very useful code line by line.

@Aguaires 2 месяца назад

Dank u!

@Roy-xr2wq 2 месяца назад

Best Explanation, the visuals bring the whole idea into life. Thanks

@pieracelis6862 3 месяца назад

Really good tutorial, thanks a lot!! :)

@rubyanneolbinado95 3 месяца назад

Hi, why is R studio producing different results even though I am using the same call and data.

@kasperwelbers 3 месяца назад

Hi! Do you mean vastly different results, or very small differences? I do think some of the multilevel stuff could in potential differ due to random processes in converging the model, but if so it should be really minor.

@davidgao9046 4 месяца назад

very clear layout and superb explanation for the intuition. Thanks!

@gma7205 4 месяца назад

Amazingly well-explained, thanks! Please, make more videos. Nonlinear models, Bayesian... some extra content would be nice!

@michellelaurendina 4 месяца назад

THANK. YOU.

@genesisbarahona7030 5 месяцев назад

What a legend! You have no idea how much your videos have helped me. Thanks for making it clear and easy to understand:)

@zafarnasim9267 6 месяцев назад

Great video, nicely explained

@kasperwelbers 6 месяцев назад

Thanks!

@DavidKoleckar 6 месяцев назад

nice audio bro. you record in bathroom?

@kasperwelbers 6 месяцев назад

Ahaha, not sure whether that's a question or a burn 😅. This is just a Blue Yeti mic in the home office I set up during the COVID lock downs. The room itself has pretty nice acoustic treatment, but I was still figuring out in a rush how to make recordings for lectures/workshops and it was hard to get clear audio without keystrokes hitting through.

@mariuskombou6729 7 месяцев назад

In order to be able to plot with textplot_wordcloud, you need first to load the "quanteda.textplots" library. I guess so few things have changed after 3 years. Otherwise it is not going to work. Thank's for the video dear Kasper.

@roxyioana 8 месяцев назад

can not use - dfmat_inaug <- dfm(toks_inaug, remove = stopwords("en") -is outdeated - what can I do insted?

@kasperwelbers 8 месяцев назад

Hi @roxyioana, please check the link to the tutorial in the description. We keep that repository up-to-date with changes. (and at some point I hopefully find the time to re-record some videos)

@bignatesbookreviews 9 месяцев назад

god bless you

@youtubeaccount1152 9 месяцев назад

What program do you use for r session

@kasperwelbers 9 месяцев назад

I use RStudio, and I would definitely recommend it

@youtubeaccount1152 9 месяцев назад

Hwo do you make your presentation (these slideshows)

@bobmany5051 9 месяцев назад

Hello Kasper, I appreciate your great video. I have a question. Regarding your example data, what if there are two or more data points for each day for each person? Let's assume that you measure reaction time 4 times each day across participants. Do you need to average those data points and make one data point for each day? or do you use all data points?

@kasperwelbers 9 месяцев назад

Interesting question. We can actually add more groups to the model instead of aggregating, but it depends on your question. In the example, we used days as a continuous variable, because we wanted to test if there was a linear effect on reaction time. If you also want to consider the time of the day as a continous variable, then it indeed becomes awkward how to combine them. However, maybe your reason for the four measurements is just to get more data points, so you think of them as factors rather than continuous. While aggregating might be viable, you could also consider adding another level to your model, for whether the measurement was in the (1) morning, (2) afternoon, (3) evening, or (4) night. You could then have random intercept, for instance to take into account that people might on average have lower reaction times in the evening due to their after-dinner-dip. (though note that with just 4 groups you might rather want to use fixed effects with dummy variables) Perhaps more generally, what you're interested in is multilevel models with more than one group level. This is possible and very common/powerfull. Groups can then either be nested or crossed. be nested, for instance people living in cities.

@user-ui8uz6nf7y 10 месяцев назад

what about importing text from multiple pdf/docx?

@kasperwelbers 9 месяцев назад

I think the easiest way would be to use the readtext package. This allows you to read an entire folder ("my_doc_files/") or use wildcards ("my_doc_files/article*_.txt). cran.r-project.org/web/packages/readtext/vignettes/readtext_vignette.html#microsoft-word-files-.doc-.docx

@audreyq.nkamngangk.7062 10 месяцев назад

Thank you for the tutorial. Is it possible to create a glm model with a variable to explain which has 3 modalities

@kasperwelbers 9 месяцев назад

If I understand you correctly, I think it's indeed possible to model a dependent variable with a tri-modal distribution with glm. Actually, you might not even need glm for that. Whether a distribution is multimodal is a separate matter of the distribution family. A tri-modal distribution might be a mixture of three normal distributions, three binomial distributions, etc. Take the following simulation as an example. Here we create a y variable that is affected by a continuous variable x, and a factor with three groups. Since there is a strong effect of the group on y, this results in y being tri-modal. ## simulate 3-modal data n = 1000 x = rnorm(n) group = sample(1:3, n, replace=T) group_means = c(5,10,15) y = group_means[group] + x*0.4 + rnorm(n) hist(y, breaks=50) m1 = lm(y ~ x) m2 = lm(y ~ as.factor(group) + x) summary(m1) ## bad estimate of x (should be around 0.4) plot(m1, 2) ## error is non-normal summary(m2) ## good estimate after controlling for group plot(m2, 2) ## error is normal after including group

@kobeoncount 10 месяцев назад

Dear Kasper, thank you very much for your videos. I am just getting into text analytics and I have a quick question. I am planning to work on Turkish language, and I don't know how to handle the stopwords and stemming processes. There are compatible files for TR to work through quanteda, but I don't know how to actually make them work. Could you please give some hints about that also? )

@kasperwelbers 10 месяцев назад

Good question. I'm not an expert on Turkish, so I don't know how well these bag-of-word style approaches work for it, but there does seem to be some support for it in quanteda. Regarding stopwords, quanteda uses the stopwords package under the hood. That package has the functions stopwords_getlanguages to see which languages are supported. Importantly, you also need to set a 'source' that stopwords uses. The default (snowball) doesn't support Turkish (which I assume is TR), but it seems nltk does: library(stopwords) stopwords_getsources() stopwords_getlanguages(source = 'nltk') stopwords('tr', source = 'nltk') Similarly, for stemming it uses SnowballC. Same kind of process: library(SnowballC) getStemLanguages() char_wordstem("aslında", language='turkish') # (same should work for dfm_wordstem) So, not sure how well this works, but it does seem to be supported!

@kobeoncount 10 месяцев назад

@@kasperwelbers This is so helpful, thank you!!

@conservo3203 11 месяцев назад

Hey Kasper. Bedankt voor je gratis youtube premium in een airbnb in Berlijn afgelopen week 😅. Ik heb voor je uitgelogd toen ik naar huis ging. 👍🏻

@kasperwelbers 11 месяцев назад

Hahahaha 🤣. Nice, thanks!!

@ethanjudah8420 11 месяцев назад

Hi, I'm trying to do this on reddit data but the files I have are too large (100gb+) for only 3 months of data. That's in .zst. Do you have any suggestions on how to deal with this and apply these techniques on this data set in R?

@kasperwelbers 11 месяцев назад

If your file is too large to keep in memory, the only option is to work through it in batches or streaming. So the first thing to look into would be whether there is a package in R for importing ZST files that allows you to stream it in or select specific rows/items (so that you can get it in batches). But perhaps the bigger issue here would be that with this much data you really need to focus on fast preprocessing, so that you'll be able to finish your work in the current decade. So first make a plan what type of analysis you want to do, and then figure out which techniques you definitely need for this. Also, consider whether it's possible to run the analysis in multiple steps. Maybe you could first just process the data to filter it on some keywords, or to store it in a searchable database. Then you could do the more heavy NLP lifting only for the documents that require it.

@user-ro9ex5im2p 11 месяцев назад

great thanks

@user-ro9ex5im2p 11 месяцев назад

This is amazing. Thank you

@67lobe Год назад

hello i' can't find the moment where you speak bout word documents. I'm having my words documents to crete a corpus

@kasperwelbers Год назад

Hi @67lobe, I don't think I discuss word files in this tutorial. But I think the best ways are to use the 'readtext' package, or 'antiword'. The readtext package is probably the best to learn, because it provides a unified interface for various file types, like word, pdf and csv.

@m9017t Год назад

Very well explained, thank you!

@MrJegerjeg Год назад

What if you have combinations of two different groups. For example, you measure blood pressure from volunteers after drinking a certain number of units of alcohol. You do that in two different locations. So you want to fit a line per individual, but you also want to control for the location effect. Right?

@kasperwelbers Год назад

You can certainly have multiple groups. First, you could have groups nested in groups. If you perform the same experiment in many countries across the world, your units would be observations nested in people (group 1) nested in countries (group 2). Second, you could have cross-nested (or cross-classified) groups. For example, say we want to study if the effect of more alcoholic beverages on blood pressure differs depending on the type of alcoholic beverage (beer, wine, etc.). In that case, each person could have observations for multiple beverages, and each beverage could have observations for multiple people.

@MrJegerjeg Год назад

@@kasperwelbers I see, thanks. I can imagine that having all these nested and cross-nested groups can complicate quite a lot the model and its interpretation.

@learning.data.science Год назад

Thank you for informative text analysis videos. I am just begginner on texxt analysis and R, I start with your videos. I have got a question at 12 :13 min, kwic() needs tokens() so, I applied toks <- tokens(corp) k = kwic(toks, 'freedom', window = 5) . Is it true?

@kasperwelbers Год назад

Yes, you're correct. The quanteda api has seen some changes since this video was recorded. You can still pass a corpus directly to kwic, but it will now throw a warning that this is 'deprecated'. This means that at the moment it still works, but at some point in the (near) future it will be mandatory to tokenize a corpus before using kwic

@briantheworld Год назад

Hello! I have a question.. is there a way to implement LDA in other languages? I'm trying to applied to Italian Reviews from the web

@kasperwelbers Год назад

Hi Brian! LDA itself does not care about language, because it only looks at word occurrences in documents. Simply put, as long as you can preprocess the text and represent it as a document term matrix, you can apply LDA.

@briantheworld Год назад

@@kasperwelbers Thanks a lot for your fast reply. And of course thanks for the high quality content videos.

@davidrogerdat Год назад

Thank you for this!! Gracias por esto!!

@abhijitthakuria1368 Год назад

Hi kasper, nice explaination on TM, i am not able to figure out how to plot latent topics to visualise the evolution of topics yearwise.

@drdilsad1 Год назад

Hello Kasper, thanks for this great video. Just wondering where I will get the document/chapter where all the codes are given. I mean the document from where you copied the codes and paste them into the R. Please let me know.

@kasperwelbers Год назад

Hi @Dr Dilsad. Sorry, it seems I only included the link in the first video (about GLMs). More generally, we maintain some R tutorials that we regularly use in education on this GitHub page: github.com/ccs-amsterdam/r-course-material . The multilevel one is under frequentist statistics. There is a short version in the "Advanced statistics overview" that I think is the one from this video, and also a slightly more elaborate one in the "Multilevel models" tutorial.

@haraldurkarlsson1147 Год назад

Again thanks for the fine presentation. How about xpath? Have you considered covering that? I was hoping that would help with a table I was scraping but I could not figure out what to hang my hat on. The website is very unusual. You can view a table (the one I would like to scrape) but the code returns a list of three tables not one. What is annoying is that the html code has no distinct tags or marks to work with.

@kasperwelbers Год назад

Hi Haraldur. It's true that xpath is a bit more flexible, so that might help address your problem. But you can also get quite creative with CSS selectors. If there are no distinct tags/ids/classes or whatever for the specific element you want to target, the only way might be to target the closest parent, and then traverse down the children based on tags and their positions. For instance, something like: "#top-nav > div > nav > ul > li:nth-of-type(2)". What can help with those types of annoying long paths, is to use something like the Google Chrome "SelectorGadget" plugin (which I didn't know existed when I made the video). This let's you select an element on a page and it gives you either the CSS selector or XPath.

@haraldurkarlsson1147 Год назад

@@kasperwelbers Kasper, Thanks for the information. It clearly takes a lot of experimenting. I wound up settling down on these two code options trying it extract the third table: html_doc |> html_elements("table") |> html_table(header = TRUE) |> pluck(3) pluck is from purrr package (pull will not work here). Or using xpath: html_doc |> html_elements(xpath = '//center[position() = 3]/table') |> html_table(header = TRUE) The pluck method is more elegant in my mind but xpath is clearly worth learning at one point or another. By the way I am using the native pipe which will not always work but the regular magrittr will. H

@kasperwelbers Год назад

@@haraldurkarlsson1147 pluck indeed offers a nice solution here! There is certainly some value in learning xpath, as it's more flexible and also works well for XML files. That said, when doing it in R I tend to prefer your first solution, because it's easier to debug. Probably the xpath approach is slightly faster, but in webscraping the main speedbump is the http requests so in practice I think the difference in speed would hardly be noticiable.

@Maicolacola Год назад

Thanks for the video. Very clear explanations. I was wondering why you used days as a random effect slope, and didn’t add day:subject as an interaction term in your model?

@kasperwelbers Год назад

Hi Michael. Good question. You could indeed fit something very similar to a multilevel model using dummies and interaction terms. Specifically, instead of random intercepts, we could have included fixed effects for every day using dummy variables. And instead of random slopes, we could then have added interaction terms for these dummy variables with the 'days' variable. So more generally speaking, we could indeed have used fixed effects instead of random effects to model differences between subjects. There are some benefits to using random effects though. Aside from not having your model cluttered with many dummies and interactions, using fixed effects eats up degrees of freedom. For example, if you used dummies for subjects, you could not add something like gender to the model, because there are no df left at the subject level.

@Maicolacola Год назад

@@kasperwelbers Ty for the reply, that clears up some of my confusion. I'm new to multi-level models, but have experience with multiple linear regression. I think I'm a bit confused about when one would choose to use a covariate as a random effect versus an interaction term as a fixed effect, as well as all the other possibilities. I kind of wish there was a complex enough dummy dataset where each possible scenario could be displayed with figures. For example: 1) response ~ var_1; 2) response ~ var_1:var_2, 3) response ~ var_1 + (1 | var_2); etc. etc...

@kasperwelbers Год назад

@@Maicolacola A great way to learn about how these models work is actually to create your own dummy data. You can think of fitting models as trying to find the data generating process. So if you understand the model you can generate data that will fit in a certain way. I should perhaps do a video on this, but here's a quick example: ## simulate data about the effect of doing homework on grade. ## students are nested in classes, which have different average grades, and ## a different effect (random slope) of doing homework. n = 10000 ## students groups = 100 ## classes homework = rnorm(n, mean=10, sd=5) ## simulate time spend on homework group = sample(1:groups, n, replace=T) ## assign random groups ## generate the random parts of the model ri = rnorm(groups, mean=10, sd=3) ## random intercepts (group level) rs = rnorm(groups, mean=2, sd=1) ## random slopes (group level) e = rnorm(n, mean=0, sd=10) ## individual level variance ## simulate the grade. Note that we can use the 'group' integers as index of ri and rs grade = ri[group] + rs[group] * homework + e ## put the data together d = data.frame(homework, grade, group) ## Now run a random intercepts+slopes model to see if we can recover the parameters ## we plugged in above (mean intercept is 10, mean homework slope is 2) library(lme4) library(sjPlot) m = lmer(grade ~ homework + (1 + homework | group), data=d) tab_model(m)

@Maicolacola Год назад

@@kasperwelbers Great idea about simulating with dummy data. I'll give that a try. Cheers!

@alanscott9258 Год назад

Kasper, Just working through your tutorial this week and it is excellent. It is obviously some time since you did the video and the coding in the IMDb has changed. For example the CSS selectors now have different names which just makes it a bit more challenging and interesting. Thanks for doing this.

@kasperwelbers Год назад

Thanks, and double thanks for framing my outdated CSS selectors as a learning challenge :). Still, I think I should then update them at least in the document, so (third) thanks for the heads up!

@djangoworldwide7925 Год назад

Nice video but at the end to state "identity" without explaining about the fact it's the I matrix, is a bit lacking

@kasperwelbers Год назад

Thanks! About the identity function, I think its uncommon to use matrix notation for link functions because many are non linear. So I prefer to also just think of the identity link more generally as an identity function. But maybe I'm missing something?

@mikhalexx Год назад

who are you? why are you gay?

@timmytesla9655 Год назад

This was really helpful. Thanks for this awesome tutorial.

@yahiarafik9965 Год назад

Very clear, thank you so much for this explanation, you just helped a lot of people in my major :))

@zolzayaenkhtur8309 Год назад

Thanks for the video! How do you define the documents for the corresponding president, such as Obama? Does R do it automatically? How? Thanks in advance.

@kasperwelbers Год назад

Hi Zolzaya. The corpus function automatically adds any columns in the input dataframe (except the 'text' column) as document variables. So we do need to tell R that there is a president column, which is now done by importing a csv file that has this column. Hope that clarifies it!

@sakifzaman Год назад

4:51 its showing this message "Error in textplot_wordcloud(dtm, max_words = 50) : could not find function "textplot_wordcloud"" i have all the relevant packages but still getting this. Do you know why ? and how to solve it?

@kasperwelbers Год назад

Hi Sakif, that error message does typically mean that the package that contains the function is not 'loaded'. Are you sure you ran library(quanteda)? Also, if you installed some packages afterwards, R might have restarted, in which case you might have to run that line again.