That's correct: "missing value imputation" means that you are replacing missing values (also known as "null values") in your dataset with your best approximation of what values they might be if they weren't missing. Hope that helps!
Thanks for all the videos it helped me a lot. However I searched on google a long times but I could not find my problem. I am trying to fill missing values with others columns. I mean there are some missing values about cars body type but there are information about body type in another column.
Awesome video! I was wondering if you can share how the process works behind the scenes for cases where we have rows with multiple columns that are null, with respect to the iterative imputer that builds a model behind the scenes. I understand the logic when we only have a single column with null values but can't wrap my head around what will be assigned as training and test data if we have multiple columns with null values. Looking forward to your response. Thanks
Thanks for posting this. For features where there are missing values, should I be passing in the whole df to impute the missing values, or should I only include features that are correlated with the dependent variable I'm trying to impute?
This goes back to the bias variance tradeoff. If you are adding 100s of other columns that are likely to be uncorrelated, then I would suggest not doing that since that will likely overfit the data. You could use the parameter "n_nearest_features" which makes the IterativeImputer only use the top "n" features to predict the missing values. This could be a way to add all your columns from your entire dataframe while still minimizing the increase in variance.
Great question! IterativeImputer was inspired by MICE. More details are here: scikit-learn.org/stable/modules/impute.html#multiple-vs-single-imputation
Hi, I tried encoding my categorical variables (boolean value column) and then running the data through a KNNImputer but instead of getting 1's and 0's I got values inbetween those values, for example 0.4,0.9 etc. Is there anything I am missing, or is there any way to improve the prediction of this imputer ?
Great question! I don't recommend using KNNImputer in that case. Here's what I recommend instead: (1) If you're using scikit-learn version 0.24 or later, and you have categorical data with missing values, OneHotEncoder will automatically encode the missing values as a separate category, which is a good approach. (2) If you're using version 0.23 or earlier, I recommend instead creating a pipeline of SimpleImputer (with strategy='constant') and OneHotEncoder, which will impute the missing values and then one-hot encode the results. Hope that helps!
@@dataschool In that case; can we interpret the results (0.4 , 0.9) as the probabilities of those values being 0 or 1. Does it makes sense to assign a threshold like 0.5, transform below to 0 and above to 1?
Great question! KNNImputer can't be used for strings (categorical features), but you can use SimpleImputer in that case with strategy='constant' or strategy='most_frequent'. Hope that helps!
With KNNImputer, the features have to be numeric in order for it to determine the "nearest" rows. That is separate from using KNN with a classification problem, because in a classification problem, the target is categorical. Hope that helps!
Hey Kevin, quick question... should k in knn should always be odd... if yes than why and if no than why? as me in the interview... Thank for all your content.
Great question! For KNNImputer, the answer is no, because it's just looking at other numeric samples and averaging them (there is never a "tie"). For KNN with binary classification, then yes an odd K is a good idea in order to avoid ties. Hope that helps!
Are you generally performing your imputation prior to any feature selection, or after ? I always see mixed reviews about performing it before and after..
Hello ! Thank you very much for your interesting video ! Do you know where I can find a video like this one to know how many neighbors choose ? Thank you very much
Great question! For categorical features, you can use SimpleImputer with strategy='most_frequent' or strategy='constant'. Which approach is better depends on the particular situation. More details are here: scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html
My idea: line plot of cols which have null values with other continuous cols and box plot for discrete and then impute constant value according to result of this process, like say, Pclass is 2, so you impute median fare of Pclass 2 wherever fare is missing and Pclass is 2. Basically similar to iterative imputer, only manual work, slow but maybe better results because of human knowledge about problem statement. What are your thoughts about this idea ?
It's an interesting idea, but a manual process is probably not practical for any large dataset, and it's definitely impractical for cross-validation (since you would have to do the imputation after each split). In general, any "manual" process (in which your intervention is required) does not fit well into the scikit-learn workflow. Hope that helps!
@@dataschool I meant finding the values in an exploratory way and then using the values found as a constant in simple imputer in a pipeline during cross validation and evaluation. A custom transformer can also be created which does the imputation according to fitted values, like during transformation find similar records and then use the median. But then that's pretty much similar to KNNImputer and Iterative Imputer
Sure, you could probably do that using a custom transformer. Or if you think you could make a strong case for this functionality being available in scikit-learn, then you could propose it as a feature request!
@@dataschool I am not sure whether this is the correct platform, but I have written a library named custom_transformers which contains transformers for handling date,time, null, outlier and some commonly needed custom transformers, if you have time I would be greatly appreciated if you provided your valuable feedback on kaggle This is the notebook demonstrating the use of library www.kaggle.com/susmitpy03/demonstrating-common-transformers I intend to package it and publish on PyPi
In the example we have only 1 missing so the imputer is having "easy" mission. What if we had not only a few missing per this column/feature and we were facing "randomly" missing values for different col/features. How does the imputer decides to fill : which column first will be imputed and then based upon this filling it will advance to the "next best" (impute handling) column and fill in missing...and so on
Great question! I don't know the specific logic it uses in terms of order, but I don't believe it tries to use imputed values to impute other values. For example, IterativeImputer is just doing a regression problem, and it works the same way regardless of whether it is predicting the values for one row or multiple rows. If there are missing values in the input features to that regression problem, I assume it just ignores those rows entirely. I'm not sure if that entirely answers your question... it's not easy for me to say with certainty how it handles all of the possible edge cases because I haven't studied the imputer's code. Hope that helps!
What if the first column has a missing value? T It is a categorical feature and it would be better if we use multivariate regression. It has 0 or 1 but if we use KNNimputrr or IterativeImputer, it imputes as float value. I think there's the same question as mine in comments.
If you're using scikit-learn version 0.24 or later, and you have categorical data with missing values, OneHotEncoder will automatically encode the missing values as a separate category, which is a good approach. If you're using version 0.23 or earlier, I recommend instead creating a pipeline of SimpleImputer (with strategy='constant') and OneHotEncoder, which will impute the missing values and then one-hot encode the results. Hope that helps!
What is the effect to the dataset after imputation? Any bias or something? I understand it's a mathematical way to insert a valueinto NaN but I feel there must be any effect on this action. Then, when do we need to remove NaN and when do we need to use imputation?
If the percentage of the NaN in a column is more than 50%, we should eliminate the column, otherwise we should impute it using univariate methods like SimpleImputer or multivariate methods mentioned by the author.
In my opinion, if you are using methods like median, you can first impute missing value, but if you are imputing by methods like mean ( outliers will effect these) so it is good to remove outliers first.
You can use SimpleImputer instead, with strategy='most_frequent' or strategy='constant'. Here's an example: nbviewer.jupyter.org/github/justmarkham/scikit-learn-tips/blob/master/notebooks/27_impute_categorical_features.ipynb Hope that helps!
I can't think of a realistic example of where KNNImputer is better than IterativeImputer, IterativeImputer seems much more robust. Am I the only one thinking this?
The "no free lunch" theorem says that no one method will be better than other in all cases. In other words, IterativeImputer might work better in most cases, but KNNImputer will surely be better in at least some cases, and the only way to know for sure is to try both!
Thank you for such an amazing video! I used to encode my categorical data into numerical one and then ran the KNNImputer but its giving me Error - TypeError: invalid type promotion. Any insights what might be going wrong?
I'm not sure, though I strongly recommend using OneHotEncoder for encoding your categorical features. I explain why in this video: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-yv4adDGcFE8.html Hope that helps!