Handling Missing Data Easily Explained| Machine Learning

Просмотров 176 тыс.

% 3 129

Data can have missing values for a number of reasons such as observations that were not recorded and data corruption.
Handling missing data is important as many machine learning algorithms do not support data with missing values.
In this tutorial, you will discover how to handle missing data for machine learning with Python.
Specifically, after completing this tutorial you will know:
How to marking invalid or corrupt values as missing in your dataset.
How to remove rows with missing data from your dataset.
How to impute missing values with mean values in your dataset.
Github link: github.com/krishnaik06/EDA1
You can buy my book where I have provided a detailed explanation of how we can use Machine Learning, Deep Learning in Finance using python
url: www.amazon.in/Hands-Python-Finance-implementing-strategies/dp/1789346371/ref=sr_1_1?keywords=Krish+naik&qid=1560612272&s=gateway&sr=8-1

Опубликовано:

15 июн 2019

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 77

@stevechops3226 3 года назад

Your channel is awesome, please keep going! Can't tell you how valuable your videos are when starting to learn!

@himalayasinghsheoran1255 3 года назад

Today I started working on the titanic data. Tried to predict the missing age values but failed and was very tensed. So, I started watching your video in hope for a way. When you opened the notebook I felt such a relief - 'ki aab to ho hi jaega'. Thank you for making this video.

@rajagopal.g3533 3 года назад

What you had done for cabin data set .We can't remove this simply by saying there are many missing values .

@equiwave80 3 года назад

Thanks Krish. I can't think of an easier explanation of a tricky topic!!! Simply superb!!!👍

@raunasur9710 2 года назад

Thank you Krish sir. I was following the kaggle learn course on machine learning but couldn't understand this topic even after so much of hard work - now it's all clear. Keep it up.

@hanman5195 4 года назад

Your explanation is pretty much amazing and your my perfect as usual.

@dishydez 3 года назад

Honestly, I really love your videos, simple and easy to understand. Always answering my machine learning and data science questions! I do have one though. I watched your video on standardisation and normalisation. I am trying to build a benchmark/index, would it be okay to make the data standardized before creating it or?

@aimenbaig6201 3 года назад

Thank you for making life so much easier for us!

@kukulaarohi9850 22 дня назад

beautifully explained with the detailing!

@radifantaufik8085 3 года назад

I think there is a quantitative justification why we should fill the NaN values on 'Age' with median that classified by 'Sex' and 'Pclass'. On EDA step, we can print or visualize heatmap of the correlations between each columns (dataset.corr().abs()). We can see that 'Age' columns has relatively high correlation to 'Sex' and 'Pclass' columns.

@bhaktibailurkar1936 Год назад

Cleared all my doubts! Great..Thank you so much!!

@finance_tamil 5 лет назад

Thought that you will also implement Regression Model for synthetic imputation. But the content is great!!

@coolsun-lifestyle 5 лет назад

Thanks a lot for detailed explanation. It really helps

@konradpyrz8559 2 года назад

Great way of explaining things. I like it very much.

@strangereview2414 3 года назад

Nice explanation, conclusion depending on your end goal, and whether if drop or change to mean will affect on your analysis, in he’s example he need the age but he didn’t need the cabinet.

@saurabhtripathi62 4 года назад

a really good idea of creating seprate model thanks for sharing.

@dilipgawade9686 5 лет назад

Hi Krish, Your videos are quite useful and simple to understand. My request is if you can create video on how we can deploy ML model with Flask that will be very useful..

@krishnaik06 5 лет назад

Sure I will do that

@amarendrakolukula4592 Месяц назад

Thank you Krish, you have explained the second option very well. Wondering how we do this for categorical columns and when values are missing from multiple fields

@jobihara 3 года назад

Thanks Kris, very helpful.

@cutecreature_san 3 года назад

staring data science with so many years career gap..your videos are god leavel

@gaziya8815 2 года назад

thanks a lot for sharing your knowledge with us. Kindly address one confusion that do we need to impute missing values in the test data set the same way you have taught in the video?

@channelfisikaasik1124 2 года назад

Well, I appreciate the video that Mr. krish naik made and i love to see his videos and I really want to discuss on how can we handle missing values. Ok well using separate model to see relation between variables that have complete dataset is not great though because the value, since it's a value generated from machine learning, is not a real data and may statistically far from central of population data because it comes from other equation. I would love to use statistical method like mean, median or mode and, I don't know this will work or not, checking the range of population mean and make sure that the value is not going far from population mean

@radifantaufik8085 3 года назад

But anyway, thankyou for your sharing. It help me a lot to learn how to handle missing values, nice works!

@gopi3e 4 года назад

Thanks for the video, you said that option -2 (model based imputation) is less preferred for huge datasets, does that mean that in general it is good to go with statistical based imputation over model based imputation in real world datasets? Since we get lot of data in real world?. I am working on Home-Credit-Default-Risk (kaggle competetion dataset) request your comment on which imputation method to use?

@karndeepsingh 4 года назад

Sir, CAN YOU PLEASE TELL US ABOUT THE ROLE OF ROC AND CAP curve analysis for improving model performance

@ayselceferzade8587 2 года назад

great! thanks for explanation

@PriyaSharma-xb3ju 2 года назад

Hi Krish, could you please tell what to do when there are missing values in the dependent variables?

@vineethp8925 3 года назад

Hi , i want to know you used box plot median to replace missing values in age column but why no mean or mode ? can you please tell me the reason

@marsrover2754 4 года назад

What's the recommended rule for deciding the whether to do data imputation techniques or just simple dropping of the rows having missing values. As the missing values can have any patterns like Mising Data at Random, Not missing at Random and so on. So what to do in that case.

@louerleseigneur4532 2 года назад

Thanks Krish

@Abhishekpandey-dl7me 5 лет назад

Thank you so much

@kajalkapasiya193 4 года назад

Hi Krish,want to understand why did you choose Pclass to replace null values for Age. Why not any of the other attributes.

@scifimoviesinparts3837 3 года назад

Could you please make a video on missing values imputation using decision trees ?

@dineshbaisla951 4 года назад

hi krish i have a doubt. How will you treat if one variable is having missing values around 30% and that variable is important to consider. Overall records are around 550K

@ahmedelsabagh6990 2 года назад

Very helpful video

@Ro45256 3 месяца назад

Krish, I have one doubt. You are saying that we need to compute the null values by considering the other releted columns. Then tell me how we can implement the same as a pipeline(sklearn.pipeline import Pipeline) so that the pipeline can be used to compute the missing values of the test dataset. Please clear my doubt if anybody knows! It will be helpful for me.....

@mihirkamble9095 2 года назад

Thank you so much ..

@gauravsalwatkar8324 4 года назад

Sir can u please build a video on named entitity recognition using tensorflow keras

@alphonseinbaraj7602 4 года назад

By using Flask,u can do some more deployment ..please Mr.Krish

@vishwa021094 3 года назад

Hi Krish I find your videos very useful for beginners like me. Here you have shown how to handle missing values for numbers and string fields. We also need to handle for date and time columns. Please guide us through this.

@eshaal2525 Год назад

Hi how to deal with year like...2006 0 in same column

@rapchhos 3 года назад

Hi Krish, I think the age column in the distplot is right skewed. I do not think that it has a normal distribution.

@aikagyan999 4 года назад

Thanks sir, I was confused in this part only, about nan values and why we take sum of those nan values..

@BretskoD 3 года назад

If you're using the isnull() function, it will turn all your missing values into True (or 1) and not-null into False (or 0). After that you can just sum() all of the 1's to find out how many nan values in your dataset.

@aikagyan999 3 года назад

@@BretskoD : Thank you sir.

@NA-by7rv 4 года назад

Great !

@priyaranjanswainjitu 4 года назад

Hi Krish , I have one doubt on this case study . Why you have imputed on basis of class column ? We can also do it on basis of Gender column as well . Median/mean of Male passenger & mean/median of female passenger . Also we have normally distributed age data , Can we apply mean instead of median ?

@pratikrandad1990 4 года назад

Awesome

@aimenbaig6201 3 года назад

QUESTION: why did you choose the imputed value of age with respect to the Pclass and not respect to male or female?

@rachanakotha6059 5 месяцев назад

Are these only for numerical data? What all methods can I used for characters/names or Years? Please suggest, Thanks!

@aimenbaig6201 3 года назад

Loved it

@isaiahdickinson9039 6 месяцев назад

Please sire what do I do if I have 80% of missing values in my target variable. I'm trying to predict the gross of movies but the target variable to train my model with has 80% missing values.

@malleswararaomaguluri6344 2 года назад

Sir, if we have missing values in output column, then how separate model will utilise?

@mvcutube 3 года назад

nice video

@PuneethSaiBhaskar 4 года назад

👍👍👍

@ezbitz23 4 года назад

How can we decide whether to use the mean, median or mode to replace a missing value?

@adityan8536 4 года назад

Based on our data you have to decide

@amalsunil4722 4 года назад

Our first priority is mean...if we have large outliers we go for either mode or median depending on the situation as these 2 figures are least affected by the outliers

@jaiminshah143 3 года назад

How to handle missing(NaN) values in column having binary data values i.e Just 0 or 1 ?

@RAJI11000 4 года назад

Sir plz give suggestion regarding cabin feature if it has low number of missing values how we deal with that type?it is a combination of catrogical and neumerical

@rajagopal.g3533 3 года назад

Same doubt bro .Do you know the answer.

@sam45330 2 года назад

First, I want to thank "Krish" for all your content, i have numpy array of continuous value obtained from regression model but i don't know how to fill the null value using the continuous np array of, can any one help me out?

@samirelzein1095 2 года назад

i am thinking Netflix problem-solution type of filling missing, kind of minimizing a cost function

@nikosterizakis Год назад

Why write and not just make text appear (as in: pre-typed so people can read it and use transition? )

@simonelgarrad 4 года назад

At 3:45 mins you said that we delete the record , but what if that variable / feature is significant ?

@siyabongazungu1640 4 года назад

We can replace the null values with the mean of each column

@ankitvarma2319 4 года назад

Sir I have a doubt in this. What is we have 50 Pclass values it become really tedious to write all of them. Is there any way we can use list of such pclass values while using the list of the potential age list while defining the function. For ex If Pclass == list1: return age ==list2

@waynelai9312 3 года назад

Just use a dictionary where key is pclass and value is the mean.

@gopalakrishna9510 4 года назад

in which senirio we can delete data in missing values ?

@khusheekapoor 3 года назад

When there is a very large data set.

@sachinborgave8094 4 года назад

Hi sir, how to fill missing values using Linear regression?

@subhajitdutta1443 3 года назад

Sir I was unable to under stand the programming part in Udemy. That is why searched in the youtube but here I can see both of them are exactly same.. You should at least change the digits.. With all due respect.. Chap diya apne Udemy se..