Dealing with MISSING Data! Data Imputation in R (Mean, Median, MICE!)

Подписаться 10 тыс.

Просмотров 20 тыс.

50% 1

===== Likes: 307 👍: Dislikes: 2 👎: 99.353% : Updated on 01-21-2023 11:57:17 EST =====
Annoyed with empty, NULL, or NA values? Confused as to what imputations are? Look no further! This is a comprehensive guide in understanding what imputations are and how to apply them!
Questions? Let me know down in the comments below!
R "mice" package documentation
cran.r-project.org/web/packag...
Additional instructional material
datascienceplus.com/imputing-...
Forgot what KNN's are?
• Applying and Understan...
What are Neural Networks again?
• Understanding and Appl...
Github:
github.com/SpencerPao/Data_Sc...
0:00 - What is Imputation?
0:29 - Mean Median Imputation Pro's and Con's
2:02 - Additional imputation methods
3:13 - Steps for MICE
5:05 - Code setup, understanding data
7:05 - Mean and median imputation
7:42 - MICE Implementation
13:40 - Extracting Imputated data (by specific feature)
15:32 - Using ALL Imputed data + Interpretation
17:37 - Additional Steps

Наука

Опубликовано:

6 июл 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 101

@sudeyaren4221 Месяц назад

I would like to say, you are doing a good job. You explain the topics with the detail, show the codes and more in a short time. Thank you!

@MrMareczek111 2 года назад

I really appreciate your videos - I hope your channel will grow. Keep making this great content!!!

@khushisrivastava835 3 месяца назад

I am in thesis for my undergrad. And you saved me from the multiple breakdowns I have had over the last entire week!!!

@mukhtarabdi824 3 года назад

very useful, thanks Spencer. waiting to hear more from you

@ej9432 Год назад

This video is great. Keep up the great work buddy!

@principia1372 2 года назад

Fantastic video man, thank you so much

@justinarends5871 3 года назад

Loving the content, my dude! In this context, I would love to see you take it further and validate your imputing method as discussed at the end. Looking forward to more videos!!

@SpencerPaoHere 3 года назад

I'm glad you are liking the content :) That is a good point you've made. I'll probably validate future imputations with future models that I'd like to go through and compare and contrast down the road.

@larrygoodnews 3 года назад

@@SpencerPaoHere Hi thx for the video, may i know where can i get the data set and code you used?

@SpencerPaoHere 3 года назад

@@larrygoodnews As requested, here is a link to the code and data github.com/SpencerPao/DataImputation

@HangwHwang 3 года назад

Thank you for your video. Extremely helpful. I just subscribed and look forward to learning more from you.

@SpencerPaoHere 3 года назад

Hi! Thank you for your support! :)

@AndyA86 11 месяцев назад

Excellent video thank you.

@ericpenichet7489 Год назад

Excellent video!

@ShivSutradhar Год назад

Thanks, brother, for solving my doubt.

@simwaneh1685 Год назад

Thanks dear

@Philantrope 4 месяца назад

This is very instructive - thank you! One question I have: How would you proceed when having serveral thousands of variables in the dataset. Would you do some prior feature selection in that case? Only with complete variables? All the best for you.

@yidong7706 2 года назад

This is exactly what I'm looking for, thank you! I've been searching forever to find ways to impute categorical variables and this video really helps to clarify the confusion. I was wondering if this method would still be effective if 1/3 of the dataset contains NAs or if you have any recommended methods for treating datasets with numerous NAs.

@SpencerPaoHere 2 года назад

The More NAs in the dataset = the less effective the method is unfortunately. If you have lots of NAs, perhaps in your dataset eliminate the observations that might not make the most sense, thereby decreasing the total amount of observations. However, at the end of the day, go ahead and try it out! Compare your imputed observation accuracy with whichever target variable in question.

@hoanggiangpham9312 4 месяца назад

Tks for sharing. I have a question: Why don't you use the finished_data_imputed after completing to do logistic regression instead of doing that for each imputed dataset ?

@lnsyrae Год назад

Super helpful! Are there strategies that you recommend for evaluating the model fit after MI with MICE?

@SpencerPaoHere Год назад

The typical train/val/test will go a long way when swapping the imputed datasets. You can get a general sense on how well a model performs on all of the datasets.

@renmarbalana1448 Год назад

Thank you so much! This is of great help for my data mining course!. One question though, how to set a limit for the imputed values if I don't want negative values to be generated?

@SpencerPaoHere Год назад

That depends. Are you expecting negative values? If not, then you may have to do some data cleaning in your original dataset. There are some unorthodox ways of imputation where you can perhaps set a boundary for imputations. But the method behind it is sort of a mystery to me. I attached a link which may be more of use to you. stats.stackexchange.com/questions/116587/multiple-imputation-introduces-negative-values-dataset-still-valid

@ThomasMesnard Год назад

Thank you so much for your video! Question: how do you create a new dataset with the estimated values of missing data from the regression? It seems like the last step is missing at the end of your video. It would be very usefull to know how you deal with that. Thks!!

@SpencerPaoHere Год назад

Are you referring to 14:00 in the video? Line 80 is doing just that. (new dataset with the estimated values of missing data)

@briabrowne2928 11 месяцев назад

Thank you for a great video! I wanted ask if there is a rule of thumb for choosing which cycle to use when running the MICE imputation? You used the first cycle but was there a particular reason or not?

@SpencerPaoHere 10 месяцев назад

No rule of thumb in particular!

@irinavalsova3268 2 года назад

Perfect explanation. Do you tutor? I need a specific task to complete on a dataset.

@SpencerPaoHere 2 года назад

I'm flattered :3 I don't tutor at the moment; however, if you have any "quick" questions, I do regularly answer comments on all my videos for free.

@nithidetail7187 2 года назад

@Spencer : Appreciate ! Good Video for kickstart MICE concepts. Git path URL is not woring Please update

@SpencerPaoHere 2 года назад

Hmm. The link: github.com/SpencerPao/Data_Science/tree/main/Data%20Imputations works with me.

@asrarmostofa818 2 года назад

Helpful for me and o......

@elissamsallem688 Год назад

thank you for this video really helpful! If I want to impute only 1 categorical variable in the dataset? How can we do it? and based on what do we choose the column for the final dataset? can we pool the columns? Thank you

@SpencerPaoHere Год назад

1) You could just extract that 1 specific imputed column from the imputed dataset and replace it to the raw dataset. 2) depending on your use case, you can use whichever column you so desire to be imputed (typically when you want a filled column) 3) Pooling columns? As in merge 2 different columns together? I mean you could.....

@felixangulo4677 2 года назад

Hey Spencer, awesome video on multiple imputation. Question: is it possible to run a MANOVA on the pooled imputed datasets and obtain a pooled parameter estimate? I followed your video all the way through to the 15:40 mark, and then attempted to run a MANOVA on the pooled data set but I'm running into some difficulties/errors. I'm basically trying to impute missing data and then run a MANOVA (or a repeated measures ANOVA) on the imputed datasets in order to obtain a pooled parameter estimate. I'm using two categorical (binary) predictor variables and two continuous dependent variables for my model. I normally use SPSS, but unfortunately SPSS doesn't allow to run general linear model tests on imputed data (or at least doesn't provide a parameter estimate of the pooled datasets).

@SpencerPaoHere 2 года назад

Thanks! Yes. You should be able to run the MANOVA on the imputed datasets. However you will have to run the algorithm on each set individually. Then aggregate the model results thereafter.

@felixangulo4677 2 года назад

@@SpencerPaoHere I was successfully able to run a Manova for each of my imputed datasets (m= 10). As an example, here’s the code I used to run the Manova for the 10th imputed dataset: model.10 = manova(Anxiety ~ Treatment + DepStatus + Treatment_X_DepStatus + BLanx, data = finished_imputed_data10). I’m still however having trouble aggregating the results, and I’m not sure if the code I’m using is correct. Here’s the code I’m using for this: pooled_model = with(imputed_data, manova(Anxiety ~ Treatment + DepStatus + Treatment_X_DepStatus + BLanx)). Does this seem correct? Here’s the error message I’m receiving: Error in (function (cond) : error in evaluating the argument 'object' in selecting a method for function 'summary': Problem with `summarise()` column `qbar`. ℹ `qbar mean(.data$estimate)`.✖ Column `estimate` not found in `.data`. ℹ The error occurred in group 1: term = Treatment. In addition: Warning message: In get.dfcom(object, dfcom) : Infinite sample size assumed.

@SpencerPaoHere 2 года назад

@@felixangulo4677 Hi! Yes. Try saving the model weights! (a model for each imputed dataset) And, once you have the model weights, you can ideally aggregate or do some form of model aggregation. You can also choose which model to go with based on best model performance. I did a video on just this: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-6pw9IDFxWFM.html

@kingraidi1578 Год назад

Hey Spencer, really appreciated the video and I am not sure if you will see this comment since the video is kind of old. However, I have two questions: 1. Regarding the way that we use the created imputations --> In the first variant you just selected one of the datasets with imputed values. This means that basically, eventhough we are using mice, it is a single imputation method at the end of the day right? So is this a viable method to use? 2. In the second variant you pooled all the imputations, creating a mix of all the imputed datasets (so a real multiple imputation methode I guess). I would like to replicate that, but I am not sure how to. I conducted a survey within a business (for my thesis) and I have various variables that are either intervals or categorical. However, basically all of them are independend variables.

@SpencerPaoHere Год назад

No worries! I am still active on this channel. It might just be a matter of comment volume, but I'll probably get around to it :) 1) Yes. I used a single dataframe that was imputed via the MICE method. 2) You can also copy and paste my code from github ! github.com/SpencerPao/Data_Science/tree/main/Data%20Imputations Other than that, I am unsure what the question is.

@mustafa_sakalli 2 года назад

So, for categorical values "rf" is godd, what about for any numerical ones? I have a dataset mixture of both categorical values. I wanna apply rf for categorical and something different for numerical, is it ok also?

@SpencerPaoHere 2 года назад

You can still use Random forest for a dataset that has numerical and categorical features! It would typically come down to comparing your model output to see which one performs the best and decide from there.

@mohammedabdulkhaliq8746 2 года назад

hello spencer thank you for the tutorial. By any chance do you know what package includes least square imputation. Thanks in advance.

@SpencerPaoHere 2 года назад

least squares? That's built in the Rstudio framework. Checkout out lm() EDIT: Least squares imputation: check out pcaMethods()

@mohammedabdulkhaliq2644 2 года назад

@@SpencerPaoHere thank you for the reply. Do we have same for python package?

@SpencerPaoHere 2 года назад

@@mohammedabdulkhaliq2644 Pandas has an interpolate function per column. pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html

@PoetenfranLevanten 3 года назад

Very helpful Spencer, thank you! I have a question regarding the difference between the complete function you used to finish the dataset and the pooled function? Do the finished dataset combine all the 5 imputations and can I use the finished dataset in other statistical programs like SPSS and Jamovi and still claim that the missing values has been been imputed by multiple imputation? Or do I always have to use the pooled function?

@SpencerPaoHere 3 года назад

Hi! I'm glad you liked it! :) The pool function averages the estimates of the complete data model (and outputs a variety of statistics related to those features) The complete function fills in the missing values with the imputed values. AND yes. This function imputes ALL of the features based ONLY on 1 imputed dataset that you have chosen. See complete(data, m) where m represents which imputed featured you would want to plug into your original dataset. You can use the finished dataset with other programs. All you need to do is write the file to a CSV and load the file into a different program.

@PoetenfranLevanten 3 года назад

@@SpencerPaoHere thanks for the clear answer. Is it possible to carry out an EFA (exploratory factor analysis) utilising the pooled function? If yes, is it possible for you to illuminate us with a tutorial on how to carry it out? :). Thanks in advance

@SpencerPaoHere 3 года назад

@@PoetenfranLevanten Hmm. I have done a video on Factor Analysis. You can check that out! The only thing different you would do is to apply the pooling function on whatever data you might have. THEN, go ahead and utilize the FA function as noted.

@PoetenfranLevanten 3 года назад

@@SpencerPaoHere Thanks, how would the code look like if I apply the pooling function? Tried to do that on a dataset I have but received an error message in R.

@SpencerPaoHere 3 года назад

@@PoetenfranLevanten Hi! you'd want to apply the pool function on a model object. i.e pool(fit) It won't work if you plug in a dataset.

@atthoriqpp Год назад

Hi, thanks for the video. It was helpful to further reinforce my learning on missing values! But I have a question. You said that if the missing values in a variable are more than 20%, it's better to drop them. I recently experienced this particular scenario, and if I were to MICE impute the value, is it better than dropping them? Or are dropping the missing values better because it crosses the 20% threshold of missing values tolerance? Oh, and one other thing, when choosing the imputed data result (from 1 to 5), what is the basis for determining which is better?

@SpencerPaoHere Год назад

When you have more than 20% of the rows missing for a column it might be best to drop them since the imputations are largely based off of the rest of the rows. So you may be getting garbage imputations (it would be intersting to see on a simulated smaller chunk of data if costs were a factor) The basis on which imputed dataset to choose is somewhat arbitrary. I'd just do batch inferencing to see which gets better predictive results. Though the resulting differences can be more or less neglible.

@atthoriqpp Год назад

@@SpencerPaoHere Well said. But what if the column is essential for the analysis?

@SpencerPaoHere Год назад

@@atthoriqpp haha yeah. You're going to need more data. Or, you can do a 'haily mary' and see how the imputations fair. (A lot of testing will be needed to ensure you are confident about the results.)

@atthoriqpp Год назад

@@SpencerPaoHere Thanks for the answer, Spencer!

@kalemguy Год назад

Thank you, this is what I need. Is it possible you also discuss Model Based Treatment of Missing Data in R using mdmb package? What do you think this method compare to MICE

@SpencerPaoHere Год назад

The mdmb seems like a brand new package. (10/13/2022) -- I am unfamiliar with it. Though from taking a peak at the documentation, it seems that the mdmb package uses MLM and or Bayesian estimation. This is definitely more "narrow" than MICE, which there are many more models to choose from. However, you may get better results? Not sure. You'd have to try on your data.

@kalemguy Год назад

@@SpencerPaoHere Thank you for your explanation. May I know, how many percentage of missing data is good for MICE? or other imputation methods?

@SpencerPaoHere Год назад

@@kalemguy I think if ~20% of your rows have a missing value, then you can probably impute. However, if you are missing 80% of your data, it might be advisable to just drop the feature altogether.

@kalemguy Год назад

@@SpencerPaoHere Thank you very much for your insight...

@abhijitjantre8427 2 года назад

When I executed md.pattern command, I got the table but not the plot that you have shown in the youtube video. Please share the means to get the plot.

@SpencerPaoHere 2 года назад

My code is located here: github.com/SpencerPao/Data_Science/tree/main/Data%20Imputations

@markelov Год назад

Hello! Loved this video! I am interested in using MICE’s defaultMethod approach. I recognize that doing so requires that objects be specified as the appropriate data type. I recently received a CSV of variable names and only the numerical values assigned. As a result, R is reading everything in as numeric. I understand that I could change these variables one by one when importing in the preview pane, but I’d rather do it through code. Is there a more efficient way to specify variable types than on an individual basis (example below)? dataframe$object1

@SpencerPaoHere Год назад

I believe you don’t even need to cast your objects to be a certain type. Try using the tidyr and dplyr package. You can define the features as a specific data type where you don’t have to cast over when reading the data.

@markelov Год назад

OK-will do! Thank you! I just have two follow-up questions if possible: 1. I got an error about logged events. My reading thus far tells me that these essentially get at perfect prediction and that R will not impute those values when this occurs. What I am confused about, though, is that (a) the number of logged events reported is greater than the number of missing values and (b) looking at my imputed data set shows no NA values. Do you have any read on these two pieces? 2. As opposed to pooling estimates and whatnot across all the imputed data sets, is it permissible from a statistical standpoint to select one of the imputed data sets to use for all analyses as opposed to the pooling procedure that you did here for the regression)?

@SpencerPaoHere Год назад

@@markelov 1) There can be a multitude of factors that cause the issues of logged events. Try printing out the loggedEvents of your MICE object. (Variable$loggedEvents) -- That might give a hint. Also, sometimes in a dataframe, NA values can be "null" values as well. So, you may need to run a few different checks to see if the null values actually do exist in the data. 2) You could just use one imputed dataset but it might not be a representative of all data. Pooling gives you a wider band and a concentration of outcomes in an area.

@markelov Год назад

Thank you!

@TheFabricioosousaa 2 года назад

Hi! I am using MICE to deal with my missing values. Instead of using complete() function and choose one number (between 1 and 5), I would like to combine these 5 options. I think that is what you have done using the with() function but I couldn't understand the arguments there. Or the step at 15:23 using the with() function has nothing to do with MICE imputation and is just another (and new) way of imputation? Thanks!

@SpencerPaoHere 2 года назад

Yep! In layman terms, the with() is inserting each of the imputed dataframes to the glm() model. As a result, at 15:23 or so, you will have 5 glm models where each model builds off of each individual dataframe. You can then do an average of the weights (or provide standard errors with the model weights) and evaluate predictions etc..

@TheFabricioosousaa 2 года назад

@@SpencerPaoHere Thanks for the answer! :) One last question: after using the function with() you just used the function plot() to have some information and analyse it. But using the with() function I was not supposed to be abble to extract the "new" values to substitute my NAs too? (The "combined" values)

@SpencerPaoHere 2 года назад

@@TheFabricioosousaa Can you provide a timestamp? (or a line of code); But you can think of it as having different models for similar but unique datasets.

@TheFabricioosousaa 2 года назад

@@SpencerPaoHere For example, if I have this: #INSTALL AND LOAD MICE install.packages("mice") library(mice) #IMPORT DATASET: library(readxl) data_md

@SpencerPaoHere 2 года назад

@@TheFabricioosousaa Yep! And, that should occur after your data_imp variable -- imputation should occur. (Try printing it out to see if that is what you are looking for form an imputation POV) Then, you will use the imputed data set(s) for your training/testing process.

@mahmoudmoustafamohammed5896 2 года назад

Hallo Spencer, thank you so much for your video and explanation :) I have a small question: I am using this code: data_imp

@SpencerPaoHere 2 года назад

Hmm. Yeah. It must be related to your input data. Try to one hot encode your categorical variable and see what happens.

@mahmoudmoustafamohammed5896 2 года назад

@@SpencerPaoHere I tried it and I got the same problem unfortunately. I tried to create a new dataset of these categorical variables and I imputed them and it works. But when I impute them with the main dataset I get this problem :(

@SpencerPaoHere 2 года назад

@@mahmoudmoustafamohammed5896 What's the stacktrace? There might be an issue with your other variables perhaps. It's strange that one set of features don't work but the others do? And combined?

@kaili3477 2 года назад

Hi Spencer, Could you explain the functions in the Mice command? 1. You said 'm' is the number of cycles? I'm a bit confused about that. I googled and said it's the number of imputations, I don't understand what that means, imputation is simply replacing the missing values. What happens if you increase the number of 'm' compared to decreasing 'm'? Why is it generally we use '5' 2. What is the 'maxit' in Mice command, I always thought that was the number of cycles, since it's the number of iterations. 3. Do you also understand the 'seed' in mice command? It affects the random number generator I believe, but what happens if you increase or decrease it. Sorry I know you didn't use 'maxit' or 'seed', but I'm having trouble understanding the r document explanation.

@SpencerPaoHere 2 года назад

1) The 'm' term refers to the number of times you want to impute your dataset. I was using the term "cycles" colloquially. So, if m = 5, you are expecting 5 different datasets with different imputed values. (more or less) 2) maxit : (in the mice package) just refers to the number of iterations taken to impute missing values. This is related to whichever objective function you utilize and thus uses the maxit as the upper ceiling for its iterations. 3) The "seed" is a deterministic number generator. It doesn't matter what value you use for seed(numeric) as long as the numeric is consistent among all your experiments. Hope that helps!

@kaili3477 2 года назад

That helps a lot, I understand the video a lot more now! Thank you! I know you learned a lot at post-secondary, but I'd love a video explaining your background, and tips/tools to self-learn to other people (if you have any)! We all want to be experts like you one day.

@SpencerPaoHere 2 года назад

@@kaili3477 haha thanks I appreciate that. Maybe when this channel gets bigger, I can do a video autobiography of some sorts.

@tsehayenegash8394 2 года назад

If you know please inform me the matlab code of MICE

@SpencerPaoHere 2 года назад

Perhaps this might help you? www.researchgate.net/post/I-am-looking-for-a-Matlab-code-for-Multiple-imputation-method-for-missing-data-analysis-can-anybody-help-me

@tsehayenegash8394 2 года назад

@@SpencerPaoHere I appreciate your help.

@dgeFPS 2 года назад

everythings helpful but the audio quality killed me

@SpencerPaoHere 2 года назад

In my later videos, I handled the background noise. Sorry for that!

@raihana3376 Год назад

the package MICE does not work : Warning in install.packages : unable to access index for repository YOUR FAVORITE MIRROR/src/contrib: impossible d'ouvrir l'URL 'YOUR FAVORITE MIRROR/src/contrib/PACKAGES'

@SpencerPaoHere Год назад

Strange. I've ran install.packages("mice") and the library was able to be installed -- then run library(mice) to load the package in your environment. Perhaps updating your Rstudio might do the trick?