SMOTE (Synthetic Minority Oversampling Technique) for Handling Imbalanced Datasets

Подписаться 105 тыс.

Просмотров 108 тыс.

50% 1

Whenever we do classification in ML, we often assume that target label is evenly distributed in our dataset. This helps the training algorithm to learn the features as we have enough examples for all the different cases. For example, in learning a spam filter, we should have good amount of data which corresponds to emails which are spam and non spam.
SMOTE synthesises new minority instances between existing (real) minority instances.
If you do have any questions with what we covered in this video then feel free to ask in the comment section below & I'll do my best to answer those.
If you enjoy these tutorials & would like to support them then the easiest way is to simply like the video & give it a thumbs up & also it's a huge help to share these videos with anyone who you think would find them useful.
Please consider clicking the SUBSCRIBE button to be notified for future videos & thank you all for watching.
You can find me on:
GitHub - github.com/bhattbhavesh91
Medium - / bhattbhavesh91
#ClassImbalance #SMOTE #SyntheticMinorityOversamplingTechnique #machinelearning #python #deeplearning #datascience #youtube

Опубликовано:

7 май 2019

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 151

@bhattbhavesh91 5 лет назад

Something went wrong while using pd.crosstab! So the updated confusion matrices are as follows - At 7:50 The correct confusion matrix is 92303 14 1535 135 At 10:30 The correct confusion matrix is 93798 41 40 108 Sorry for the mistake :)

@sahubiswajit1996 5 лет назад

Why we are using "random_state=12" ?

@chrislam1341 4 года назад

@@sahubiswajit1996 it is just his preference, for being able to get the same result from the randomness.

@sumitshukla3689 4 года назад

When we apply SMOTE, the number of samples doesn't changes. But as explained by you, if we are adding some synthetic samples, the training example should also increase right??

@KumarHemjeet 3 года назад

@@sahubiswajit1996 you can take any number

@elliothank2823 3 года назад

I guess it's kinda off topic but does anybody know a good site to stream new tv shows online ?

@prathameshmohite3008 4 года назад

Hi Bhavesh, Very good explanation. I was particularly confused about implementing SMOTE on the main data. But I guess you're correct that we must implement SMOTE on training data. Thank You

@carl2143 10 месяцев назад

I'll come back to this video. Seems helpful!

@SurajSingh-pw9ew 4 года назад

Thanku Bhavesh❣️❣️.Bina bore kiye padhaya 👏🏻👏🏻👏🏻 excellent

@dhananjaykansal8097 4 года назад

Your handwriting is pretty. Thanks for the explanation once again. Cheers!

@ddxccc 3 года назад

Most helpful and professional video I found on SMOTE. Thanks a lot!

@bhattbhavesh91 3 года назад

I'm glad you like it

@srikrshnap6036 Год назад

Lovely Explanation! Thank you!

@kokl123ify 2 года назад

hi bhavesh could you please confirm in order to ensure the oversampling method doesnt reduce the accuracy of the model should we always use hyperparameter tuning or is there some other method also to undo the damage of oversampling method in logistic regression for attrition prediction

@sparshdutta 5 лет назад

Thanks for teaching new stuff.☺

@shishirdixit5996 4 года назад

Here while fitting the training dataset after tuning hyperparameters using gridsearchcv why you have used X_train and y_train and why not X_train_res and y_train_res dataset

@bhuvneshsaini93 5 лет назад

Hi, you used only two target 0 and 1 , how to do with more than two . Suppose target 1 is around 2000 , target 2 is around 200 , target 3 is around 11 and so on.

@bishalmohari8748 3 года назад

I started watching the undersampling video for a problem and ended up watching the full series cause of how well explained they are. Gald I discovered your channel! Wish I did sooner xD

@bhattbhavesh91 3 года назад

Glad it was helpful!

@KaushikJasced 2 года назад

Thank you sir for giving a wonderful lecture. Can you tell me how I can put the sampling ratio as per my choice instead of 1:1 using SMOTE?

@AizirekTolonova-od1ks 2 месяца назад

Thank you so much for the great explanation!

@bhattbhavesh91 2 месяца назад

Glad it was helpful!

@shishirdixit5996 4 года назад

I have a categorical dependent variable with 3400 records in which the distribution of 0s and 1s are 2677 and 723 respectively, Will this be considered as an imbalanced dataset ? or if I would have 1s less than 5% of the total record only then it would be considered as imbalanced. Kindly clarify the doubt

@siddharthkenia9089 2 года назад

Not only you explained really well the illustration were perfect for a beginner to understand what oversampling mean. Thank you:)

@bhattbhavesh91 2 года назад

Glad it was helpful!

@7810 4 года назад

Quite interesting! Thanks for the lesson.

@bhattbhavesh91 4 года назад

Glad you liked it!

@karndeepsingh 4 года назад

Very well explained sir!!!

@ganeshreddypuli3101 2 года назад

If we want to normalize the data as well, should we do it before applying SMOTE?

@princeok12 5 лет назад

Very well explained Thank you. Especially appreciated the explanation of nearest neighbor

@harshparikh7060 2 года назад

Thanks, Bhavesh!

@bhattbhavesh91 2 года назад

Glad you enjoyed it

@abhishekwagh8246 4 года назад

I have a sample of only 28. Unfortunately I don't have more sample. Will SMOTE work? Secondly, which logistic regression should be used? Sklearn or statsmodels? Both give different results. Please help.

@bhagyashreeln1304 2 года назад

Hi, what do we do if we have a balanced dataset but still want to increase the number of rows

@sridhar6358 3 года назад

so the idea of opting for ratio parameter in SMOTE to be a hyperparameter is to ensure we get better results is that correct, in general is it a good option to make ratio option of SMOTE to be a hyperparameter rather then fixing it to 1

@Nirja3 3 года назад

When I tried to set up the smote ration, getting invalid ratio parameter for SMOTE.Can u help?

@bintehawa7712 9 месяцев назад

Thanks to explain with notes help me alot

@AnupKumar-nz2qq 4 года назад

After generating the synthetic data in which kind of situation this data can be useful any limitation of this type of data.

@WordofSpirit 2 года назад

Looks like the weights is also not working on smote. Any alternative way to test different weights?

@TejaDuggirala 5 лет назад

Good work bro.. thank you

@sirvachjumani7215 3 года назад

Hi Bhavesh, very nicely explained can you please tell me the literature of the following examples. thanks

@nesrinehadjamar2197 Год назад

Thank you ! Simple and clear explanation

@bhattbhavesh91 Год назад

Glad it was helpful!

@dipankarrahuldey6249 2 года назад

With SMOTE, can we achieve higher f1 in practice? I saw that f1 was around 0.72

@shandou5276 3 года назад

This is very well done :) Nothing overly flashy and yet very clear.

@bhattbhavesh91 3 года назад

Glad you enjoyed it

@Asma-cx8uc 2 года назад

Hello Sir ! Could you please describe how SMOTE technique can be used to balance data images

@danielniels22 2 года назад

6:20 what library u imported before declaring SMOTE() class?

@MrFcapri 2 года назад

kindly tell me I have 5 classes imbalanced data set. SMOTE will work for multi CLASS data set ?

@dhananjaykansal8097 4 года назад

shouldn’t it be generate_auc_roc_curve(pipe, X_test). If no if Bhaveshbhai you or anyone can explain pls.

@makhboulame9654 3 года назад

Can SMOTE be used for Multi label classification dataset ? Thank you

@bhargav7476 4 года назад

You have no idea how helpful that was

@bhattbhavesh91 4 года назад

Thank you so much :)

@priyas8871 2 года назад

Can u please tell how this SMOTE can be applied for streaming data- In Test then Train Framework??

@ankushjamthikar9780 3 года назад

Very Good Explanation. But, can we use this method for multiclass problem? Also, does SMOTE leads to overfitting issue?

@shwetasharma1996 4 года назад

Nice content! I would like to compare some techniques of oversampling.. Can you pl help me out to get the hard code of SMOTE not the packaged one..thanks

@MY_PARIDE 21 день назад

Great Explanation....👏

@charmilam920 3 года назад

Thank you for this video. Understood SMOTE very well. Please make videos more often and How do you explain things so effortlessly with such clarity ? Where is this clarity coming from ? Great job

@bhattbhavesh91 3 года назад

Thank you! Will do!

@adityaraikwar6069 Год назад

very informative video, simple and to the point keep it up

@bhattbhavesh91 Год назад

Glad you liked it!

@EcommerceAdvices 3 года назад

Thanks alot. You mk it so simple :) Liked n subscribed bro.

@bhattbhavesh91 3 года назад

Thanks and welcome

@0SIGMA 3 года назад

You are some DOPE shit brother and by that i mean youre really good ! explained the important stuffs like only on train set beautifully ! really great !

@achyuthvishwamithra 2 года назад

When the final ratio came out to be 0.005, doesn't it imply that the we are going to be generating a very small number (0.005 * majority) of samples for the minority class? How will the length of minority class samples ever be equal to that of majority class?

@syedshaulhameed 3 года назад

How do I split my data into training and testing if my data is imbalanced?

@DanielWeikert 4 года назад

if we use smote in the pipeline, is it only upsampling on training or also on testing when we call predict? Thanks

@hieunguyenvan6590 Год назад

Do you need to remove outliers of dataset if you SMOTE?

@clintpaul6653 2 года назад

Can i apply sampling for test set too.. Becuase its also very unbalanced??? Plzzz reply

@mirroring_2035 2 года назад

in your crosstab function you have y_test[target]. What is that? why is target used to index the y_test object?

@spadbob24 3 года назад

thank you so much - very informative video

@bhattbhavesh91 3 года назад

Glad it was helpful!

@sadiaafrin7143 3 года назад

Good work man! Thanks

@bhattbhavesh91 3 года назад

Glad it helped!

@VINODKUMARIYA Год назад

Thank you sir !

@bhattbhavesh91 Год назад

Most welcome!

@MarsLanding91 3 года назад

Thank you for this video! 2 thumbs up! Question - at 4:06 you selected KNN = 3 but I didn't see you applying that concept in the code section. Can you please elaborate on where you set KNN as 3 in the code section? Did I misunderstand something?

@IykeDx 4 месяца назад

When KNN is not stated, the default is 5.

@channel-lk6xz 6 месяцев назад

I don't understand how we infer from auc roc. What are we seeing there and what are the values plotted here.

@hosseinroosta5154 Год назад

Realy thanks♥️

@bhattbhavesh91 Год назад

You're welcome 😊

@elaf8256 3 года назад

How we can overcame the problem of Overlapping when used SMOTE??

@harshavardhansvlkkb2290 2 года назад

Can we use smote to target column in data set

@debatradas1597 2 года назад

Thank you so much Sir

@bhattbhavesh91 2 года назад

Most welcome

@alanblitzer744 4 года назад

You are great bro

@rishisolanki554 29 дней назад

Really help

@jampavy6446 2 года назад

Nice explanation

@thomasayele5389 Год назад

Excellent explanation!

@bhattbhavesh91 Год назад

I'm glad you liked it

@randomforrest9251 3 года назад

how does smote work with categorical data?

@helll5894 3 года назад

What if there are more than 2 classes? In your video Sir, there are only 2 classes.. For example, I want to make 3 classes.. How can I implemented 3 classes on python use SMOTE?? Thank you, Sir

@advaitshirvaikar4751 3 года назад

Hey, when I try using make_pipeline(SMOTE(), SVC()) it gives me an error : All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' 'SMOTE(k_neighbors=5, kind='deprecated', m_neighbors='deprecated', n_jobs=1, out_step='deprecated', random_state=None, ratio=None, sampling_strategy='auto', svm_estimator='deprecated')' (type ) doesn't what's going wrong here

@bhattbhavesh91 3 года назад

The SMOTE function has changed after I created this video! Please refer to the documentation!

@travelsome 4 года назад

Perfection

@powellmenezes584 4 года назад

even i have this doubt - Hi, you used only two target 0 and 1 , how to do with more than two . Suppose target 1 is around 2000 , target 2 is around 200 , target 3 is around 11 and so on.

@TheRaviraaja 3 года назад

arxiv.org/pdf/1106.1813.pdf - check out algorithm, neighbours does matters.

@JT2751257 4 года назад

cello pointec- bachpan ki yaad dila di :)

@sourishmukherjee2404 3 года назад

The final ratio for the final model after Grid search CV was for SMOTE=0.0005/Does thatg imply that the ratio(Minority class/Majority class)=0.005 .?Then how is the minority class gettting oversampled to equal proportion as the majority class??

@ashishraj5882 3 года назад

again ROC auc curve is used ??

@saptarshibhattacharya1253 Год назад

can u elaborate with a random forest algorithm in google colab?

@akhilyeduresi8145 2 года назад

gettings errors as : __init__() got an unexpected keyword argument 'ratio' AttributeError: 'SMOTE' object has no attribute 'fit_sample'

@Eny11111 3 года назад

Thanks 👍

@bhattbhavesh91 3 года назад

Welcome 👍

@hamzaraouia8975 4 года назад

I have got this error when trying to run the smote: __init__() got an unexpected keyword argument 'ratio' any clues ? Thank you

@GurunathHari 4 года назад

You must have figured it out by now. Am only a student. It has been deprecated as the video is 1 year old. try using this sm = SMOTE(random_state=42, sampling_strategy = 'minority')

@bhattbhavesh91 3 года назад

Thanks Gurunath for sharing this!

@jgubash100 3 года назад

Well explained

@bhattbhavesh91 3 года назад

Thank you!

@deepikadusane9051 4 года назад

Hii bhavesh , i used ur this code of smote bt i m getting an error of ratio ie invalid parameter ratio for estimator Smote , how to resolve this

@bhattbhavesh91 4 года назад

I guess the function has changed! Do have a look at the documentation to learn more about it!

@OriginalBernieBro 4 года назад

The smote ratio parameter is deprecated, my off balanced dataset sklearn classification_report is off balanced in the support column even after smoting.

@bhattbhavesh91 3 года назад

The SMOTE function has changed after I created this video! Please refer to the official documentation!

@mramesh7085 2 года назад

Nice expalnation

@atwinemugume 5 лет назад

Thanks

@anshumanagrahri7816 4 года назад

Hiii, can you please tell how to use SMOTE on time series and sequential data

@bhattbhavesh91 4 года назад

you are a google search away for an answer!

@kavanalipanahi3505 3 года назад

True positive is 0 in the confusion matrix(by the formula the Precision and Recall should be equal to zero) .So how did you get that great number (over 70 %)?

@bhattbhavesh91 3 года назад

Please read the pinned comment!

@kavanalipanahi3505 3 года назад

@@bhattbhavesh91 I like your videos. :)))

@soumyadeeparinda1692 3 года назад

Can you please share the notebook with us using google colab?

@dhananjaykansal8097 4 года назад

Lovelyyyyyyy

@wenhongzhu8637 4 года назад

Hi~can you share the data set

@akhilthekkedath1850 5 лет назад

Sir, could you please make a video on outlier detection?

@bhattbhavesh91 5 лет назад

I have already created a video on outlier detection. Link - ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-2Qrost474lQ.html

@bhagwatchate7511 4 года назад

Nice

@burhanrashidhussein6037 5 лет назад

Does smote guarantee to improve classifier performance ?

@bhattbhavesh91 5 лет назад

Nope! It doesn't, it only upsamples your data by generating artificial samples! How good the model performs depends on how well your classes are apart!

@dastola8330 4 года назад

what is the use of defining random_state ?

@bhattbhavesh91 4 года назад

ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-c249O4giblM.html

@deeptigupta518 4 года назад

Smote can only be used in Logistic Regression or any classification model

@bhattbhavesh91 4 года назад

any classification algorithm!

@niyazahmad9133 3 года назад

Smote__ratio is not a parameter of smote help me out plz......

@bhattbhavesh91 3 года назад

The SMOTE function has changed after I created this video! Please refer to the official documentation!

@guico3lho Год назад

At the end of the video, how all the 4 metrics scored above 70% if the model did not predicted correct none of samples classified as 1? There was 0 True Positives and 63 False Negatives!