Finding an outlier in a dataset using Python

Подписаться 1 млн

Просмотров 193 тыс.

50% 1

In this video we will understand how we can find an outlier in a dataset using python.
ref: #medium articles
#Outlierdetection
github url: github.com/kri...
Data Science Projects playlist: • Generative Adversarial...
NLP playlist: • Natural Language Proce...
Statistics Playlist: • Population vs Sample i...
Feature Engineering playlist: • Feature Engineering in...
Computer Vision playlist: • OpenCV Installation | ...
Data Science Interview Question playlist: • Complete Life Cycle of...
You can buy my book on Finance with Machine Learning and Deep Learning from the below url
amazon url: www.amazon.in/...

Опубликовано:

2 окт 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 118

@jakekiddall5108 3 года назад

Is there any anamoly detection videos that dont use credit card fraud as an example???

@shadrul2783 4 года назад

Here is the correction lower bound = q1 - 1.5*IQR and upper bound = q3 + 1.5*IQR

@rohankupate5917 Год назад

You mean in video it's mistake?

@Kishor_D7 8 месяцев назад

Yes bro, check statistics playlist by krish naik.

@vamsinadh100 3 года назад

13:57 Correction Lower bound=Q1-IQR*1.5 Upeer bound= Q3+IQR*1.5

@aggreykip2006 Год назад

can you use Upper bound in a histogram as a max value?

@hritwijkamble9988 Год назад

Why threeshold = 3

@Blodia1990 3 месяца назад

It represents the quartile

@shujashakir9952 Год назад

The tutorial offers a lucid explanation of a complex problem of outliers. It is well-presented with examples that made it easier to follow. However, threshold = 3 isn't working for me. I modified it to threshold = 3+std to make it work properly. Moreover, declaring outliers = [ ] outside the function is causing problems if you want use this function in another dataset in the same notebook. So, declaring outlier list inside the function would be a better approach, I think.

@smalirizvi8026 2 года назад

I have a couple of questions. 1. Is it always better to remove the outliers or could it be big mistake as well? You gave an example of a fraudulent transaction. Now, an outlier indeed is a hint that the transaction was fraud. If I remove all transactions at the first place, how am i going to achieve my results? 2. You did not explain how do we perform outlier checks with multivariate dataset. Suppose IRIS dataset. I have seen a couple of videos here and there but no proper way is coming out. What is the proper way to identify outliers with multivariate datasets. Tahnks

@magicmushroom9670 3 года назад

Every single RU-vid channel explain with perspective of Univariate. Can you please explain this with Multivariate ? There is very less data about that on internet.

@gyapti-fctfinder3336 3 года назад

Nice Content and you explained it very well.ThankYou So Much

@mridulagarwal5881 4 года назад

You have explained things well. Just one correction - it's inter-quartile range and not inter-quantile range.

@FaraazKhanfz 3 года назад

It's Inter Quartile Range

@nosseibagacem9014 2 года назад

Hello sir, i hope you are doing well, i was hoping if you can help me with OD, I'm doing a thesis on the subject and i'm very new to python and programming, i hope to hear from you and thank you in advance.

@AbhishekMishra-mq4jw 3 года назад

what to do with natural outliers? the outliers which are expected to be there which are not because of any artificial errors

@yourkarma7012 3 года назад

Clustering techniques are also widely used in industry to detect outliers. Specially isolation forest algo

@parikshitgupta343 3 года назад

How is lower bound which you said is q1*1.5 is greater then lower quartile which you said it's q1 Lower bound seems like something which should be less then lower quartile

@dikshadhiman2474 3 года назад

Thankyou sir for this content.

@vishalb1204 5 лет назад

Can you please enable English subtitle?

@dhivya_animal_lover 4 года назад

Hi Sir , a smal doubt in the video part where you talk about the Std Normal Distribution. You told the graph is about Std normal distribution, but the you told when data falls before and beyonf 3rd std deviation, you will not consider it. Kindly clarify

@deeptijoshi377 3 года назад

What will we do in case when outliers are not following gaussian distribution and outlier is present in between the data distribution but not at the extremes

@Getrocknete_Kotze_Schlabbern 3 месяца назад

i dont understand why we compute 1.5 * iqr , what does this 1.5 mean where do you get this number?

@muditmathur465 2 года назад

Why do we use 1.5 times IQR? Can we take any other number?

@muhammadyazidbaihaqi1479 2 года назад

why your video no subtitle? please make it, thanks

@bhagyaraj5506 4 года назад

in z-score threshold value mentioned as 3 , threshold is nothing but 3rd standard deviation is it?

@mohitjoshi4209 4 года назад

yes you're correct

@arjyabasu1311 4 года назад

Sir, shouldn't the threshold value be 3*std and not just 3 ?? Because the rule is a data point is will be considered to be a outlier if it falls outside 3rd standard deviation and not just value 3.

@jondoe3693 4 года назад

Do you mean when z score = 3? Then it is correct to use threshold of 3 because you have standardized the data and standard deviation of z scored values is 1 and its mean is 0.

@NickolayGrin 5 лет назад

Using mean is Ok, but not best idea for outlier detection. Median based methods usually more robust.

@manavagarwal9763 10 месяцев назад

where can i get this jupyter notebook for revision

@doubando 8 месяцев назад

Amazing Krish, now I understand the concept of outliers, thanks

@terwasevictorsesugh3902 Год назад

What if the data does not follow a normal distribution?

@sanathdas4071 4 года назад

Sir,please can you tell me the difference between anomaly and outliers? I am confused about this two. please, sir answer me

@muhammadmuneebkhanafridi154 4 года назад

Very well explained.

@raghavgirigiri1 3 года назад

Krish i just wanna make a small correction, while saying "less than 2" OR "less than 3" say "10% of the data (or whatever the data is) fall below 2 or 3"....otherwise it's great, Good job !!

@yomeshyadav3407 3 года назад

sir, I have a doubt, threshold is nothing but 3rd standard deviation as you said so it must be 3 * sigma but here you have taken the threshold as 3 can you please clarify this

@somomitachattopadhyay2846 Год назад

yes thats because here in standard normal distribution the standard deviation is considered to be having the value 1 , sigma = 1

@ganeshkumarpatel 4 года назад

Why to do such calculations and looping to find outlier... Just apply standard scaling and create new conditional dataframe of scaled data which contains morethan 3 std values... Those are outliers... Isn't it?

@saniyamanchekar9978 4 года назад

How can I find out outliers when there will be many numbers of Columbus in a large datasets.

@Ashokkumar-sc3vt 5 лет назад

Hi Krish, well explained. can you please post a video on how to equate the outliers using any dataset. Thanks in advance.

@mohanadjibory2191 2 года назад

Thanks , i wonder how to detect outliers in ndarry numpy. I mean n by m shape array. You explained for 1D array, what abot 2d?

@thedatascientist_me Год назад

Nice work mate. I also tried something similar but with Upper and Lower Bound on the Return

@jorgeeg2668 2 года назад

how detect outliers in fuction to datetime?

@nosseibagacem9014 2 года назад

@sakhawathossain3812 Год назад

Very helpful...

@AmitSharma-po1zb 4 года назад

Superb explanation...in very simple way..

@ahmedbaheeg Год назад

Thanks

@mdazizulislam9653 4 года назад

Any suggestions for multivariate outliers having mixed variables (continuous & Categorical)?

@bonishagarwal9315 4 года назад

In case of categorical data, it will be better to find the outlier using a scatter plot as sir explained.

@kaka83185 3 года назад

Just a correction, when calculating z-score , you are doing subtraction of i to an array, you should enumerate on datasets and then subset i from the current index of mean and std.

@nosseibagacem9014 2 года назад

@karimdandachi9200 2 года назад

mean and std are not arrays... the mean of a list of values is a single value and so is the standard deviation

@cliffkwok 5 лет назад

Hi Krish, I just ordered your finance book in Amazon, which is the newest one in whole amazon about python in finance, will you do more video on finance?

@krishnaik06 5 лет назад

Thanks Kwok for buying my book...yes I will be uploading more videos on finance.

@varunchandrappa5123 3 года назад

@@krishnaik06 Hands-On Python for Finance is out of stock..Please let us know when it will be available for sale

@ksoftqatutorials9251 5 лет назад

I have been following your videos and I have learnt many things Krish Naik. Could you please tell me have you written any Datascience and machine learning books. I would like to buy your books and follow your videos to clinch Datascience job as soon as possible.

@krishnaik06 5 лет назад

Hi Kiran, I have written a book on finance with ML and DL

@ksoftqatutorials9251 5 лет назад

@@krishnaik06 could you please share the link,so that I would buy that book..looking forward to more videos.

@sekharpink 5 лет назад

Hi Krish I like ur videos alot..very informative..Could you please put videos related to word2vec models like skipgram, CBOW, gensim, glove.. Thanks in advance.

@ryando4556 6 месяцев назад

Well explained, would be great if you can add some plot for visualization.

@BAIBHAVPATHYBEE Год назад

for z score how did you know the threshold value ???

@chandrasekharpoluboyina8865 4 года назад

Generally we remove this noise, But for fraud detection and identifying a rare disease outliers will be helpful, in such cases how to handle or use them instead of removing them.

@adarshrai22 3 года назад

@krish naik how to remove outliers from non-normal distributed dataset?

@jatingupta4026 3 года назад

how to remove those values that are more than the upper bound and lower than the lower bound values respectively? Please tell that too sir

@otroleonarbe 3 года назад

thanks for sharing this video. One correction, in the loop it should be *outliers.append(i) * not outliers.append(y)

@aws384 4 года назад

great video and really it is inspiring

@subhamasthan7294 4 года назад

Hi Krish thank you so much for a nice video can you pls share the link of nxt video where you applied these techniques on kaggle dataset ?

@aashaygoel7338 3 года назад

During a project in ml I come to an scenario where when I split the dataset with train_test_split the test set contained some categorical column that were not present in the train set while label encoding it. Can you please explain what to do in this type of scenario and also do the outliers be detected before train test split or after. I have seen that you explain each topic in detail. Please help me in this scenario.

@samarendrapradhan5067 4 года назад

Sir,pls help if i have a dataset which contains 10 features each with a date for a particula index,how can i detect and see the outliers for it happens for an index in one or more than one fearures.i have 4000 fixed indexes and feature values are updates for each date.thanks

@chandrasekharpoluboyina8865 4 года назад

tell us about robust outlier

@adityapradhan8474 4 месяца назад

Thank you so much sir, I understood everything

@rushikeshbulbule8120 4 года назад

Excellent👍👏😆

@rizkamilandgamilenio9806 Год назад

Is there any condition better we use one method over another?

@PratapO7O1 3 года назад

14:06 here it is a single dimension df how to sort multidimensional df. We can't sort all rows at once we need to specify one row or 2 how to do it with multi-dimension df? Thank you

@iliyasn2760 4 года назад

we need to append 'i' value not 'y'

@prateeksmithpatra5796 3 года назад

outliers.append(y) y is not defined but how did you complied it

@niveshtayal979 5 лет назад

Hi Krish Thanks for excellent explanation....But if we get some outliers in any feature should we remove those records containing outliers(but in this case we loose some data), if not then how can we handle outliers??? Please cover this portion also :)

@amanpreetsinghgulati2475 2 года назад

Capping (wensorization) is another way where we can deal with outliers by imputing the values (within the range) in that case the data will not be lost

@satheeshswaminathan2328 4 года назад

Hi Krish, Thank you so much for the tutorial, Very clear and crisp explanation, loved it :)

@RahulKumar-hj8qk 4 года назад

if we have more than one feature, after that we remove the outliers than, is it not affect other features

@bonishagarwal9315 4 года назад

You need to remove the whole sample of that outlier because if you remove only the outlier from one feature, it results in an empty space leading to inaccurate predictions. Eg. if you have Age, Height, and Weight as your input features and u find an outlier in your Age column, you need to remove the whole sample of that particular outlier i.e. remove the complete row of that outlier. Hope I have answered your question.

@nabilahhannani2326 4 года назад

I've applied both of the method in my dataset, but I found different results for both of them? Which one should I choose? Is it possible they have different result?

@aayushijain2160 4 года назад

Sir I understood that how to identify outliers using Z-score and IQR but can you tell us how to fix them like either we should drop that column or what else we should do to remove that outlier from the dataset????

@farazmev3430 4 года назад

drop rows or replace them (mean,mode,median)

@pratikramteke3274 3 года назад

How to find outliers in multiple linear regression?

@zehraup4722 4 года назад

codes: www.kaggle.com/c0derr/outlier-detection

@srijeetful 3 года назад

Very clear and crisp explanation, loved it

@aparnashrivastava5837 4 года назад

Thanks

@yuktikhantwal2342 5 лет назад

great video sir. great content, and explained in the cleanest way possible. thanks

@satyanarayanajammala5129 5 лет назад

excellent

@dhirendrajha9667 5 лет назад

Hi, Krish, well explained, can you build one video on rasa chatbot.

@karishmaqweera3869 4 года назад

Sir, Are you having handwritten notes of whatever you taught in ML course videos?Please share them Sir.

@dineshlakshitha7309 3 года назад

amazing video supper explanation

@shishirdixit5996 4 года назад

Sir once we have detected these outliers using z score method and if they are too many outliers how can we drop those outliers

@SkipperPlaysYT 4 года назад

you can use .difference() method to do that If A and B are two sets then you can calculate the difference as : A.difference(B) , equivalent to (A-B) of the set. Similarly (B-A) = B.difference(A) Hope this helps

@mithunkumar7063 5 лет назад

Thank you

@AmeerulIslam 4 года назад

should be i instead of y in outlier.append(i)

@AmeerulIslam 4 года назад

i can see you have fixed it in the video but not in github.

@deepquest 3 года назад

Hi Krish, How can we identify root cause of an outlier?

@newbie8051 2 года назад

Due to human error in data entry/recording or maybe due to some error/bug in the Data Pipeline

@amitsawant4961 2 года назад

insightful for me

@LailahaillahChannel 4 года назад

can u do a ransac

@ga43ga54 5 лет назад

Please talk about data strategy

@jayantdikshit4181 4 года назад

Hi Krish thanks for making such an amazing content. I have a query at 09:35. As you have mentioned that we can find outliers using scatter plots. But how can we find outliers if we do have multiple features(more than 2 features)? Your views/response on this would be much appreciated. Thanks in advance.

@rachittoshniwal 4 года назад

You can try with any two random features from your data You'll either see most values following a trend with a few outliers, or you'll see most values cluster at a place with a few outliers. Or maybe something else too!

@sanjaysanjay862 2 года назад

yes, you can do it by plotting each feature with the target.

@sheetalyoutub 2 года назад

Very helpful !

@meghnasingh9941 4 года назад

great explanation, kudos !

@econdoc3000 4 года назад

Hi Krish, your definition of quantiles is wrong! If you have 0.1=F(x) with F() being the cumulative density, then its 0.1 = F(x)=P(X

@sanjaysanjay862 2 года назад

yes, and your definition is nice.

@KNfarming882 3 года назад

its not data set its data point which away from >=3

@abdulaziz-lh3nb 2 года назад

what if I have a lot of outliers in the dataset (around 27%), how to handle that?

@newbie8051 2 года назад

If I were you, I would go for missing value treatment first, then try to go with outlier treatment, also if I had to deal with such high % of outliers, my first thought would be treat them like normal data points, as deleting outliers would lead to loss of too-many data points. Can you share how you solved the problem ?

@mashirnizami134 4 года назад

Gr8

@aakashsinghrawat3313 4 года назад

sir, in any dataset like bank loan prediction, what if credit score is beyond its ranging(300-850), will they considered as outliers? if yes, how to handle them? great fellows are welcome to help...please

@rachittoshniwal 4 года назад

If the range itself is 300-850 and you are having values above or below that range, then that is a data error, and you can drop them unless you can devise a way to find the real value