Impute missing values using KNNImputer or IterativeImputer

Data School

Подписаться 245 тыс.

Просмотров 43 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Опубликовано:

25 окт 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 108

@dataschool 3 года назад

Thanks for watching! 🙌 Let me know if you have any questions about imputation and I'm happy to answer them! 👇

@Liftsandeats 3 года назад

Does imputation a method to replace missing / null values in the dataset ?

@hakunamatata-qu7ft 3 года назад

Awsome bro very useful technique

@dataschool 3 года назад

That's correct: "missing value imputation" means that you are replacing missing values (also known as "null values") in your dataset with your best approximation of what values they might be if they weren't missing. Hope that helps!

@Steegwolf 3 года назад

Have you worked with fancyimpute? Offers even more variety and works great

@yunusemreylmaz3642 3 года назад

Thanks for all the videos it helped me a lot. However I searched on google a long times but I could not find my problem. I am trying to fill missing values with others columns. I mean there are some missing values about cars body type but there are information about body type in another column.

@levon9 3 года назад

I really love your videos, they are just right, concise and informative, no unnecessary fluff. Thank you so much for these.

@dataschool 3 года назад

Thank you so much for your kind words!

@fobaogunkeye3551 9 месяцев назад

Awesome video! I was wondering if you can share how the process works behind the scenes for cases where we have rows with multiple columns that are null, with respect to the iterative imputer that builds a model behind the scenes. I understand the logic when we only have a single column with null values but can't wrap my head around what will be assigned as training and test data if we have multiple columns with null values. Looking forward to your response. Thanks

@lovejazzbass 3 года назад

Kevin, you just expanded my column transformation vocabulary. Thank you.

@dataschool 3 года назад

Great to hear!

@sachin-b8c4m Год назад

thank you. love the clarity in your explanation!

@dataschool Год назад

Glad it was helpful!

@Matt-me2yh Месяц назад

Thank you! I really needed this to understand the concepts, you are an outstanding teacher.

@dataschool Месяц назад

Glad it was helpful!

@ilducedimas 2 года назад

Awesome video, couldn't be clearer. Thanks

@dataschool 2 года назад

Thank you! 🙏

@dogs4ever1000 Год назад

Thank you, this is exactly what I need. Plus you've explained it very well!

@dataschool Год назад

Glad it was helpful!

@atiqrehman8435 10 месяцев назад

God bless you man such valuable content you are producing!

@dataschool 10 месяцев назад

Thank you so much! 🙏

@dizetoot 2 года назад

Thanks for posting this. For features where there are missing values, should I be passing in the whole df to impute the missing values, or should I only include features that are correlated with the dependent variable I'm trying to impute?

@GabeNicholson 2 года назад

This goes back to the bias variance tradeoff. If you are adding 100s of other columns that are likely to be uncorrelated, then I would suggest not doing that since that will likely overfit the data. You could use the parameter "n_nearest_features" which makes the IterativeImputer only use the top "n" features to predict the missing values. This could be a way to add all your columns from your entire dataframe while still minimizing the increase in variance.

@ericsims3368 3 года назад

Super helpful, as always. Is IterativeImputer the sklearn version of MICE?

@dataschool 3 года назад

Great question! IterativeImputer was inspired by MICE. More details are here: scikit-learn.org/stable/modules/impute.html#multiple-vs-single-imputation

@SixSigmaData 2 года назад

Hey Eric! 😀

@mooncake4511 3 года назад

Hi, I tried encoding my categorical variables (boolean value column) and then running the data through a KNNImputer but instead of getting 1's and 0's I got values inbetween those values, for example 0.4,0.9 etc. Is there anything I am missing, or is there any way to improve the prediction of this imputer ?

@matrix4776 3 года назад

That's also my question.

@dataschool 3 года назад

Great question! I don't recommend using KNNImputer in that case. Here's what I recommend instead: (1) If you're using scikit-learn version 0.24 or later, and you have categorical data with missing values, OneHotEncoder will automatically encode the missing values as a separate category, which is a good approach. (2) If you're using version 0.23 or earlier, I recommend instead creating a pipeline of SimpleImputer (with strategy='constant') and OneHotEncoder, which will impute the missing values and then one-hot encode the results. Hope that helps!

@koklu01 3 года назад

@@dataschool In that case; can we interpret the results (0.4 , 0.9) as the probabilities of those values being 0 or 1. Does it makes sense to assign a threshold like 0.5, transform below to 0 and above to 1?

@vishalnaik5453 7 месяцев назад

That column or feature if having discrete values like 0 or 1 , better check the semantic of that column,most probably it would be under categorical

@-o-6100 2 года назад

Question: If we impute values of a feature based on other features, wouldn't that increase the likelihood of multicollinearity?

@rajnishadhikari9280 5 месяцев назад

We can do this for numerical data but what in the case of categoical data? Can you mention any method for that?

@mapa5000 Год назад

Fantastic video !! 👏🏼👏🏼👏🏼 … thank you for spreading the knowledge

@dataschool Год назад

You're welcome! Glad it was helpful to you!

@seansantiagox 3 года назад

You are awesome man!! Saved me a lot of time yet again!!!!

@dataschool 3 года назад

That's awesome to hear! 🙌

@primaezy5834 3 года назад

very nice video, however i want to ask, is the knn-imputer can use for data object (string )?

@dataschool 3 года назад

Great question! KNNImputer can't be used for strings (categorical features), but you can use SimpleImputer in that case with strategy='constant' or strategy='most_frequent'. Hope that helps!

@dariomelconian9502 10 месяцев назад

Do you have a recommended tool/package for doing imputation with categorical variables?

@dataschool 10 месяцев назад

The simplest way is to use scikit-learn's SimpleImputer.

@rishisingh6111 2 года назад

Thanks for sharing this! Why cannot KNN imputer be used for categorical variables? KKN algorithms works with classification problems.

@dataschool 2 года назад

With KNNImputer, the features have to be numeric in order for it to determine the "nearest" rows. That is separate from using KNN with a classification problem, because in a classification problem, the target is categorical. Hope that helps!

@hardikvegad3508 2 года назад

Hey Kevin, quick question... should k in knn should always be odd... if yes than why and if no than why? as me in the interview... Thank for all your content.

@dataschool 2 года назад

Great question! For KNNImputer, the answer is no, because it's just looking at other numeric samples and averaging them (there is never a "tie"). For KNN with binary classification, then yes an odd K is a good idea in order to avoid ties. Hope that helps!

@dariomelconian9502 10 месяцев назад

Are you generally performing your imputation prior to any feature selection, or after ? I always see mixed reviews about performing it before and after..

@dataschool 10 месяцев назад

Great question! Imputation prior to feature selection.

@evarondeau6595 Год назад

Hello ! Thank you very much for your interesting video ! Do you know where I can find a video like this one to know how many neighbors choose ? Thank you very much

@dataschool Год назад

Sure! ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-6dbrR-WymjI.html

@ling6701 Год назад

Thanks, that was very interesting.

@dataschool Год назад

Glad you enjoyed it!

@saravanansenguttuvan319 3 года назад

What about the best imputer for categorical variables??

@dataschool 3 года назад

Great question! For categorical features, you can use SimpleImputer with strategy='most_frequent' or strategy='constant'. Which approach is better depends on the particular situation. More details are here: scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html

@saravanansenguttuvan319 3 года назад

@@dataschool Ok...Thanks :)

@lejuge7426 3 года назад

@@dataschool Thanks a lot mate you re a LEGEND

@joxa6119 2 года назад

This imputation return an array as the OHE want a dataframe. How can we solve this if we want to put both inside a pipepline?

@susmitvengurlekar 3 года назад

My idea: line plot of cols which have null values with other continuous cols and box plot for discrete and then impute constant value according to result of this process, like say, Pclass is 2, so you impute median fare of Pclass 2 wherever fare is missing and Pclass is 2. Basically similar to iterative imputer, only manual work, slow but maybe better results because of human knowledge about problem statement. What are your thoughts about this idea ?

@dataschool 3 года назад

It's an interesting idea, but a manual process is probably not practical for any large dataset, and it's definitely impractical for cross-validation (since you would have to do the imputation after each split). In general, any "manual" process (in which your intervention is required) does not fit well into the scikit-learn workflow. Hope that helps!

@susmitvengurlekar 3 года назад

@@dataschool I meant finding the values in an exploratory way and then using the values found as a constant in simple imputer in a pipeline during cross validation and evaluation. A custom transformer can also be created which does the imputation according to fitted values, like during transformation find similar records and then use the median. But then that's pretty much similar to KNNImputer and Iterative Imputer

@dataschool 3 года назад

Sure, you could probably do that using a custom transformer. Or if you think you could make a strong case for this functionality being available in scikit-learn, then you could propose it as a feature request!

@susmitvengurlekar 3 года назад

@@dataschool I am not sure whether this is the correct platform, but I have written a library named custom_transformers which contains transformers for handling date,time, null, outlier and some commonly needed custom transformers, if you have time I would be greatly appreciated if you provided your valuable feedback on kaggle This is the notebook demonstrating the use of library www.kaggle.com/susmitpy03/demonstrating-common-transformers I intend to package it and publish on PyPi

@soumyabanerjee1424 3 года назад

can iterative imputer and knn imputer works with only numerical values ? Or can it also impute string/alphanumeric values as well?

@dataschool 3 года назад

Great question! Only numerical.

@gisleberge4363 Год назад

No need to standardise the SibSp and Age columns (e.g. between 0 an 1) before the imputation process? Or is that not relevant here?

@dataschool 10 месяцев назад

Great question! That's not relevant here because imputation values are learned separately for each column.

@AnumGulzar-iy7tl Год назад

Respected Sir, Can we multiple imputation in eviews9 for panel data?

@dataschool Год назад

I'm not sure I understand your question, I'm sorry!

@riyaz8072 2 года назад

why don't you have 2M subscribers man ?

@dataschool 2 года назад

You are so kind! 🙏

@RA-sv3bv 3 года назад

In the example we have only 1 missing so the imputer is having "easy" mission. What if we had not only a few missing per this column/feature and we were facing "randomly" missing values for different col/features. How does the imputer decides to fill : which column first will be imputed and then based upon this filling it will advance to the "next best" (impute handling) column and fill in missing...and so on

@dataschool 3 года назад

Great question! I don't know the specific logic it uses in terms of order, but I don't believe it tries to use imputed values to impute other values. For example, IterativeImputer is just doing a regression problem, and it works the same way regardless of whether it is predicting the values for one row or multiple rows. If there are missing values in the input features to that regression problem, I assume it just ignores those rows entirely. I'm not sure if that entirely answers your question... it's not easy for me to say with certainty how it handles all of the possible edge cases because I haven't studied the imputer's code. Hope that helps!

@akshatrailaddha5900 Год назад

is this works for categorical features also ??

@dataschool 10 месяцев назад

SimpleImputer works for categorical features, but KNNImputer and IterativeImputer do not.

@intelligencejunction Год назад

Thank you!

@dataschool Год назад

You're welcome!

@rongshiu2 3 года назад

Kevin, how does it work if let's say B and C are both missing?

@dataschool 3 года назад

I haven't read the source code, and I don't think the documentation explains it in detail, so I can't say... sorry!

@Kenneth_Kwon 3 года назад

What if the first column has a missing value? T It is a categorical feature and it would be better if we use multivariate regression. It has 0 or 1 but if we use KNNimputrr or IterativeImputer, it imputes as float value. I think there's the same question as mine in comments.

@dataschool 3 года назад

In scikit-learn, multivariate imputation isn't currently an option for categorical data. I recommend using SimpleImputer instead. Hope that helps!

@aronpollner Год назад

@@dataschool Is there any library that has this option?

@matrix4776 3 года назад

How to handle missing categorical variables?

@dataschool 3 года назад

If you're using scikit-learn version 0.24 or later, and you have categorical data with missing values, OneHotEncoder will automatically encode the missing values as a separate category, which is a good approach. If you're using version 0.23 or earlier, I recommend instead creating a pipeline of SimpleImputer (with strategy='constant') and OneHotEncoder, which will impute the missing values and then one-hot encode the results. Hope that helps!

@SUGATORAY 3 года назад

Could you please consider making another video on MissForest imputation? (#missingpy)

@dataschool 3 года назад

Thanks for your suggestion!

@isfantauhid Год назад

Can this apply on categorical data? Or for numerical only?

@Tazy50 11 месяцев назад

No, only numerical. He mentions it at the end of video

@joxa6119 2 года назад

What is the effect to the dataset after imputation? Any bias or something? I understand it's a mathematical way to insert a valueinto NaN but I feel there must be any effect on this action. Then, when do we need to remove NaN and when do we need to use imputation?

@whaleg3219 2 года назад

If the percentage of the NaN in a column is more than 50%, we should eliminate the column, otherwise we should impute it using univariate methods like SimpleImputer or multivariate methods mentioned by the author.

@joxa6119 2 года назад

@@whaleg3219 @DataSchool I see, what if there's NaN in target feature? Can we use imputation? Or removal or NaN is better?

@shreyasb.s3819 3 года назад

I have one doubt ...which is first process missing value impuation or outlier removal?

@dataschool 3 года назад

Off-hand, I don't have clear advice on that topic. I'm sorry!

@hemantdhoundiyal1327 3 года назад

In my opinion, if you are using methods like median, you can first impute missing value, but if you are imputing by methods like mean ( outliers will effect these) so it is good to remove outliers first.

@jongcheulkim7284 2 года назад

Thank you^^

@dataschool 2 года назад

You're welcome 😊

@ashwinkrishnan4435 3 года назад

What do I use if the values are catagorical

@dataschool 3 года назад

You can use SimpleImputer instead, with strategy='most_frequent' or strategy='constant'. Here's an example: nbviewer.jupyter.org/github/justmarkham/scikit-learn-tips/blob/master/notebooks/27_impute_categorical_features.ipynb Hope that helps!

@alazkakumu 3 года назад

How to use KNN to interpolate time series data?

@dataschool 3 года назад

I'm not sure the best way to do this, I'm sorry!

@whaleg3219 2 года назад

It seems that we should definitely not try it in a large dataset. It takes forever.

@WheatleyOS 3 года назад

I can't think of a realistic example of where KNNImputer is better than IterativeImputer, IterativeImputer seems much more robust. Am I the only one thinking this?

@dataschool 3 года назад

The "no free lunch" theorem says that no one method will be better than other in all cases. In other words, IterativeImputer might work better in most cases, but KNNImputer will surely be better in at least some cases, and the only way to know for sure is to try both!

@aniket1152 3 года назад

Thank you for such an amazing video! I used to encode my categorical data into numerical one and then ran the KNNImputer but its giving me Error - TypeError: invalid type promotion. Any insights what might be going wrong?

@dataschool 3 года назад

I'm not sure, though I strongly recommend using OneHotEncoder for encoding your categorical features. I explain why in this video: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-yv4adDGcFE8.html Hope that helps!