R: Regression With Multiple Imputation (missing data handling)

Подписаться 4,2 тыс.

Просмотров 12 тыс.

50% 1

How best to treat missing data in linear regression analysis? The current view is that multiple imputation by chained equations (mice) is one of the best ways for missing data handling in regression. This multiple imputation tutorial is going to show you how to use the mice package in R to analyze datasets with missing data (MCAR, MAR) in a regression framework.
Here is a current journal article giving theoretical background and specific recommendations regarding the use of multiple imputation for missing data:
Austin, P. C., White, I. R., Lee, D. S., & van Buuren, S. (2020). Missing data in clinical research: a tutorial on multiple imputation. Canadian Journal of Cardiology.
www.sciencedirect.com/science...
Companion webpage with the R code:
www.regorz-statistik.de/en/r_m...
Tutorial for checking regression assumptions with multiple imputation:
• Multiple Imputation an...

Опубликовано:

28 ноя 2022

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 36

@Dr-Lex Год назад

THANK YOU for this video with clear audio! I have been searching all over for a reference example for handling simple regressions with mice(), and so many of the videos out there sound like they were recorded via laptop mics while standing right under an air conditioner. Clear and helpful, thank you again!

@kamarularifinkasim3138 10 месяцев назад

Thank you so much for making such video. Your explanation and coding are way simple and clear which it is easier to understand and very helpful for my analysis for my dissertation where I used simulacrum dataset

@DariaKoksal Год назад

Thank you very much for the video! Could you explain please how to save the complete file?

@RegorzStatistik Год назад

In my code example the dataframe with the completed data is called imp.datasets. You can save that as you would any other dataframe in R, e.g. with the write.csv() function.

@elissamsallem688 Год назад

Thank you for this video! If I want impute missing values for only 1 categorical variable in a large dataset. What should I do?

@RegorzStatistik Год назад

The key question is which other variables to include in order to impute the categorical variable. You should at least include all variables you are going to use in your regression model.

@malithapatabendige6541 Год назад

Thanks for this! It is crystal clear up to pooling. However, I have 2 questions. 1. How can we get a final dataset with pooled results? the combine function gives a dataset with 10 or 20 cycles and do we need to get one final pooled dataset? 2. If we have more than one variable with missing data, do we need to do the regression model for each of these? 3. Do we need to upload the full dataset with other non-missing variables for the MICE process?

@RegorzStatistik Год назад

1. With multiple imputation there is no pooled dataset. The results are pooled, not the datasets. 2. During imputation more than one variable can be imputed. 3. If you want to use other variables to help with imputation then you have to upload them.

@malithapatabendige6541 Год назад

@@RegorzStatistik Thanks very much for your prompt reply. 1. It means we can select one of 5 (if m = 5) datasets with imputed values for the final analysis. Am I right? 2. What is the aim of 'pooling the results'? Is it to decide whether our assumptions are correct? (MNAR or MAR) 3. What if the pooled results contain statistically significant estimates? 4. Can we use Random forest for this? Many thanks

@RegorzStatistik Год назад

@@malithapatabendige6541 1.-3. No. MI has 3 steps: Step 1: Imputing m datasets Step 2: Running your analysis in each of your datasets - you don't choose one dataset but you use all of them. So you get m different regression results. Step 3: Pooling the results - here you get one result from your m results - and this pooled result (and its p-values) is what counts. I recommend reading an introductory journal article about MI to get a theoretical understanding of the procedure. I don't know if MI works with random forests.

@malithapatabendige6541 Год назад

@@RegorzStatistik Thanks. These 3 steps are clear. But, nobody has mentioned how to 'interpret' pooled results and how to get the 'final imputed data for the analysis of the original research. Basically, once it is pooled, what imputed dataset is to be selected out of m number of sets. "Step 3: Pooling the results - here you get one result from your m results - and this pooled result (and its p-values) is what counts" - next step has not been mentioned anywhere. It is strange what are we supposed to do with the pooled result and where can we get one single dataset with imputed data to 'start' the original analysis.

@malithapatabendige6541 Год назад

@@RegorzStatistik I think I have to compare pooled estimates, p-values, F-statistic, etc, with each of m data sets and get the BEST GUESS of the imputed data set out of it. Thanks.

@andreapatrignani2026 4 месяца назад

Thank you veary much, i have a question, why does you do the pooling on imputed values model instead of compleate dataset? couldn't be better to have information also from the not imputed datas in the model before pooling? so u can have better datas for modelling and after pooling?

@RegorzStatistik 4 месяца назад

Pooling is the 3rd step, after running the model in all imputed datasets (2nd step) and "imputed datasets" does not mean that they only contain the cases with missing values, those are completed datasets. You can see that at 0:10:09 in the video - the regression result is based on the df a regression with all cases.

@bornaloncar2458 8 месяцев назад

Thank you, this is very informative. Could you point me to a source or clarify 1. how the regression is meant to be set up if more than 1 item/variable is missing and you want to imputate? Is the dependent variable in the regression model the only variable that gets imputated? 2. How do you obtain a table that combines inputated data and original data? Thank you!!

@RegorzStatistik 8 месяцев назад

1. I don't have a source available. But MI does not change whether there is 1 item missing or more (in my example, there are rows with more than 1 item missing - so the dependent variable is not the only variable that gets imputed) 2. Only by combining those tables per hand (e.g. with tidyverse). However, that rarely makes sense because you don't have one imputed dataset! In my example you have 50 imputed datasets so combining those 50 datasets with the original dataset would lead to somethin quite large and difficult to interpret.

@EHJ599 10 месяцев назад

Thank you very much for this clear and helpful tutorial! Interestingly, my imputed datasets consisted of fewer rows per variable than I expected (9 to be exact). Do you have any idea what happened and how to get R to impute all missingness? Thank you in advance :).

@EHJ599 10 месяцев назад

Ps. I checked if the # of ms or iterations made a difference. It did not, and neither did the seed or a change of methods.

@RegorzStatistik 10 месяцев назад

Based on that information I don't know why that happened.

@solomonwafula311 Год назад

What if I want to impute variables before using them in PCA. regressions may not work. Kindly suggest how to handle that

@RegorzStatistik Год назад

Maybe you could look into the package missMDA. There seems to be a function you can use for imputing a PCA (but I haven't used it yet). search.r-project.org/CRAN/refmans/missMDA/html/MIPCA.html

@shadens98 4 месяца назад

Super interesting video, do you have any videos or tips on how we can get the pooled results of MLR after MI using spss? i try to do it, but for the important values i get either no pooled values or many missings in the pooled values so i can report them properly?

@RegorzStatistik 4 месяца назад

Unfortunately, I don't know how to do it in SPSS.

@shadens98 4 месяца назад

thanks a lot for getting back to me so quickly! will try to it out with R, is there something extra one must do if i am importing already imputed data file from SPSS before i run the regression and pooled regression code there?@@RegorzStatistik

@RegorzStatistik 4 месяца назад

@@shadens98 I only know how to do imputation completely in R, unfortunately.

@christoph3933 6 месяцев назад

How about auxiliary variables? Are they not needed here?

@RegorzStatistik 6 месяцев назад

I think in this case age is an auxiliary variable since it is not used in the regression model (but during imputation).

@666dazai 25 дней назад

Hello, thank you for this video but I get this error and I could not figure out how to solve it: > imp.data

@RegorzStatistik 24 дня назад

This looks to me that for some of the models the regression did not converge. However, I am somewhat astonished about "glm.fit" - I would expect that message in, e.g., a logistic regression, not in a linear regression.

@666dazai 22 дня назад

@@RegorzStatistik I used logreg as the imputation method for my variables as they are dichotomous. I am suspecting that is the reason

@RegorzStatistik 22 дня назад

@@666dazai That could be the case - I am not sure whether that package works with log regression or not (haven't tried it yet).

@666dazai 22 дня назад

@@RegorzStatistik Alright, thank you for your answer!