Tutorial 46-Handling imbalanced Dataset using python- Part 2

Подписаться 997 тыс.

Просмотров 71 тыс.

50% 1

Machine Learning algorithms tend to produce unsatisfactory classifiers when faced with imbalanced datasets. For any imbalanced data set, if the event to be predicted belongs to the minority class and the event rate is less than 5%, it is usually referred to as a rare even
If you like my efforts please do subscribe the channel and share with your friends
github url :github.com/krishnaik06/Handle...
Data Science Interview Question playlist: • Complete Life Cycle of...
Data Science Projects playlist: • Generative Adversarial...
NLP playlist: • Natural Language Proce...
Statistics Playlist: • Population vs Sample i...
Feature Engineering playlist: • Feature Engineering in...
Computer Vision playlist: • OpenCV Installation | ...
You can buy my book on Finance with Machine Learning and Deep Learning from the below url
amazon url: www.amazon.in/Hands-Python-Fi...

Опубликовано:

26 июн 2019

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 95

@generationwolves 4 года назад

For those wondering, SMOTE stands for Synthetic Minority Over-Sampling Technique

@angadjitsingh6144 2 года назад

Thanks bro

@emamulmursalin9181 3 года назад

Just to inform you all, the "ratio" keyword in RandomOverSampler is changed into "sampling_strategy".

@vishalrai2859 3 года назад

thank you brother

@MrPrashanth55 5 лет назад

Thanks for the explanation. Good explanation

@hokapokas 5 лет назад

Thanks for the videos. It's great for beginners 🙏🙏🙏🙏

@nikhildixit5167 4 года назад

Superb explanation !!! Thanks a ton

@tejas260 3 года назад

Hey krish, at 6:20 sec the addition of 284315 with 492 isn't 283781, any explanation how we got 283781 points??

@edwards7423 4 года назад

thank you man ... you nailed it. superb

@yashsethi2402 2 года назад

I just watched your video and you said "I have just reached 5k subscribers. Thanks for supporting" And now that I am watching your video, you have crossed 500k subscribers. That's quite an acheivement. Congratulations... You deserve it. :)

@karangupta6402 4 года назад

Just for a note, those who are watching it ratio has been deprecated from a newer version of imblearn and we should use sampling_strategy.

@vikasrajput1957 4 года назад

Yes, its not working for me Can you please let me know, how you solved it?

@karangupta6402 4 года назад

@@vikasrajput1957 you should use sampling_strategy instead of ratio

@girijadhawale8410 3 года назад

Thank you.

@auroshisray9140 4 года назад

Thank you for precise videos

@soumyaranjansethi1790 4 года назад

Thank you it really helps me to solve one my problem

@madushandinusha4302 4 года назад

thanks bokka. great work keep it up

@vinayaksharma-ys3ip Год назад

Thanks a lot Krish Sir!!

@bhaveshrathod9898 3 года назад

Thank you very much sir for this video.

@Rishi-fo8qj 5 лет назад

What is the difference between SMOTETomek and RandomOverSampler ?

@arpitjaiswal1055 4 года назад

how to perform it for multiclass classification problems..

@shefalisingh4675 2 года назад

Hi Krish, thanks for the video, but I do have a question if we handle the imbalanced data then there is a change of adding outliers to our dataset, so we should be removing those outliers before building our model?? And there won't be a scenario that all the new samples added will be outliers?? And the second question after applying any algorithm if we train our model then it will have a high chance of overfitting our model. How do we handle that scenario?

@abhinaygupta8243 3 года назад

i am trying to apply same technique in one of my use-case i am seeing that my accuracy is going down , is this applicable on large no. of dataset

@dipeoluayomide5097 11 месяцев назад

You're the best Krish

@sandipansarkar9211 3 года назад

Thanks Krish. Need to practice in Jupiter notebook

@SAYYAM55 3 года назад

Hey, I am interested in buying your book, but the amazon link says its unavailable? Do you sell the ebook somewhere?

@babbibaljeethkaur3995 5 лет назад

Thank you so much Krish 🙏, Very clearly explained. Searched Whole Web for this, finally found the functions could you please make a video on "Bias and Variance tradeoff"-- using python function. i'll be gratefull :)

@kadhirn4792 4 года назад

Please check "statquest" channel

@myownbasement 4 года назад

@Krish Naik What happens in the case of dynamically changing datasets, for example if we have a case where majority and minority class are dynamically evolving with time, do we have a concept of dynamic sampling for such scenarios?

@yasink18 Месяц назад

I guess random sampling technique , or under sampling technique would be better

@neon91111 5 лет назад

hi krish.. i am an avid follower of your videos... i am a beginner in machine learning. i have a doubt regarding over sampling.. ie, how can u use over sampling for multiclass data...

@sudiptasen634 3 года назад

Hi Krish, Thank you for the video. Wanted to check if the oversampling method mentioned here, would work in case of imbalanced multi-class Text classification also? Since i was trying to utilize the same method in text-classification data, there is some issue with the remarks column since the independent variable is in string (text). Thank you.

@ntokozotyumre5352 2 года назад

Did you figure it out? if yes please share the information on how you did it

@auroshisray9140 4 года назад

Will oversampling hamper the accuracy of output?

@rashedin6356 10 месяцев назад

SIR, aap jaan ho meri, kia padhate ho yrrrrr

@bikrantanand1811 4 года назад

Smote is different from oversampling ..in oversampling same data is copied until reaches threshold number...smote adds different data points...so what actually this library is doing?

@user-fq4nn7ve1m 8 месяцев назад

the imblearn package when imported is showing errors , can you please show a method to fix this?

@hemantsah8567 3 года назад

How would you balance those datasets in which target columns have 4 categories? You performed on 2 categories.

@sharaththatikonda5386 2 года назад

Same approach for other class variables as well.

@abhinavgarg4884 Год назад

Hey mate, I am applying the same method of random oversampling my class imbalance ratio is like 99:1(0:1 classes). But since I'm building a model I'm allowed to do this just on the training split to keep test data without any noise for later predictons. And when I predict on test dataset my precision and recall comes like 0.1 0.1 percent which could be possibly due to false negatives. How do I fix this. Help would be appreciated 🙏 @krishnaik06

@thepresistence5935 3 года назад

5000 subcribers sir youare having more than 3lak subscribers thankyou so much for your video sir

@kuldeepsharma7924 3 года назад

will sampling techniques affect other independent features? If yes, how? If no, how?

@salmimabrouka4562 4 года назад

can you show us how to manipulate oversampling with SMOTE?

@srik9170 3 года назад

Hi krish-is it mandatory to perform oversampling or under sampling for.all classification problem if it's an highly imbalanced data let's say 80:20.

@ibechibuike1412 Год назад

yes it is

@swaruppanda2842 4 года назад

Kindly make a video on Time Series analysis

@webdeveloper3116 3 года назад

do you think ,oversampling is effective? like if you see same positive person again and again.will it increase your ability to recognize another positive kind of person?if it is so,what is need of using lots of positive or negative dataset,just take one positive and one negative dataset in a for loop

@gautamvishwakarma8651 3 года назад

bro, smote generates synthetic data points for minority class which is different from repeating the same data point.

@lakshman587 3 года назад

Is oversampling concept similar to Data Augmentation concept??

@shishirdixit5996 4 года назад

In RandomOverSampler do we have a parameter 'ratio' because I was getting this error - TypeError: __init__() got an unexpected keyword argument 'ratio'.

@BBell1988 4 года назад

not sure if you still need help but i would simply put the value in without ratio. RandomOverSampler(1)

@SAINIVEDH 3 года назад

it's been changed to sampling strategy

@SabyasachiMoitra 3 года назад

Is imbalanced data handling done after EDA?

@sauravsingh3153 Год назад

can anyone tell what actually random state and random.seed is??

@MrMadmaggot Год назад

Won't you overfit the model by using over sampling

@MANOKARANJRC 2 года назад

Nice

@_iamankitt_ 2 года назад

ty sir

@hamdallahfolashadesulaimon9565 4 года назад

Thank You so much Krish Naik, this video came handy, Thank You. Can you please make a video on how to save the new dataset after creating a balanced dataset.

@jackyhuang6034 3 года назад

you can use np.save or just simply pickle them.

@folashadesulaimon8814 3 года назад

@@jackyhuang6034 Okay, Thank You

@thegorillaz4759 4 года назад

Thank u sir

@rishankjain7678 Месяц назад

Thx buddy

@vinimator 4 года назад

@Krish, i wonder why the random state is always selected as 42 in the codes? Please help me to understand.

@manishsharma2211 4 года назад

You can use anything

@ebujak1 3 года назад

42 shows up a lot in programming cause 42 is "the meaning of life." See "Hitckhiker's Guide to the Galaxy."

@abhinavkumar7730 2 года назад

why is it getting doubles to 567562?

@alphonseinbaraj7602 4 года назад

hi krish , in case ,dependent variable have "Nan" .For instance ,['class'] have Nan value,How to change

@sravan9958 3 года назад

Remove that row @Alphobse Inbaraj

@alphonseinbaraj7602 3 года назад

@@sravan9958 if I going to remove, then important details also will be affected. So any idea

@sravan9958 3 года назад

@@alphonseinbaraj7602 it depends on how many total records and how many nan records ,if the records are high there is no problem removing those NAN records if you have some record you have to look ? I don't know how to assign a class for dependent features

@sandipansarkar9211 3 года назад

Did my practice in Jupyter notebook but could not import the library Nearmiss and imblearn.Please guide

@niharjamdar4869 3 года назад

pip install imblearn run it in anaconda3 prompt before importing in Jupyter notebook it will work

@AbdulSattar-zq5yg 4 года назад

Can you please create a video tutorial for handling a multi-class imbalance data

@santoshkumar-vw7cq 3 года назад

strategy = {0:33000, 1:33000, 2:33000} oversample= SMOTE(sampling_strategy=strategy) X, y = oversample.fit_resample(X, y) counter = Counter(y) for k,v in counter.items(): per = v / len(y) * 100 print('Class=%d, n=%d (%.3f%%)' % (k, v, per)) # plot the distribution plt.bar(counter.keys(), counter.values()) plt.show()

@MrMadmaggot Год назад

I have 3 outputs (Y's) and it worked

@ATULSINGH-pd6dz 3 года назад

I have a doubt here . are we oversampling on the whole dataset and not on the training data?

@adipurnomo5683 2 года назад

On the while dataset

@mirakaddour2753 3 года назад

thanks

@anuragmishra6262 4 года назад

Do i need to perform over sampling when my split is 65% : 35%

@salmimabrouka4562 4 года назад

no, only when the proportions under ( 10%,90%)

@mirakaddour2753 3 года назад

no need is fine

@MrMadmaggot Год назад

@@salmimabrouka4562 What if my data is 53% 27% 20%? I have -1,0,1

@bhargavasavi 4 года назад

@ 6:17, 248315+492 does not give 283781....Good explaination though

@adityavyas4843 4 года назад

should we be applying these techniques if the imbalance is say 40% to 60%??

@harshakumarks5581 4 года назад

I think 40-60 ratio is good enough. You can proceed without changing anything!

@manishsharma2211 4 года назад

No. Don't

@ajayalex2382 4 года назад

Hi, I have 2 doubts : 1. If we have a classification problem with a dataframe that has large no of features (columns > 100) and if say 20/30 of them are highly correlated and the target columns (y) is very skewed towards one class ; should we first remove the imbalance using Imblearn or should we drop the highly correlated columns ? 2. In a classification problem should we first standardise the data or handle the outliers ?

@mirakaddour2753 3 года назад

u need deep learning algorithmms

@OriginalBernieBro 4 года назад

I like to see the results of a classifier and check that the classification_report support column is split evenly. Or running an accuracy test as such print(accuracy_score(y_test, [0 for i in range(len(X_test))])) to see if the model is truly using the balanced dataset.