Tutorial 45-Handling imbalanced Dataset using python- Part 1

Подписаться 984 тыс.

Просмотров 130 тыс.

50% 1

Machine Learning algorithms tend to produce unsatisfactory classifiers when faced with imbalanced datasets. For any imbalanced data set, if the event to be predicted belongs to the minority class and the event rate is less than 5%, it is usually referred to as a rare even
If you like my efforts please do subscribe the channel and share with your friends
github url :github.com/krishnaik06/Handle...
Data Science Interview Question playlist: • Complete Life Cycle of...
Data Science Projects playlist: • Generative Adversarial...
NLP playlist: • Natural Language Proce...
Statistics Playlist: • Population vs Sample i...
Feature Engineering playlist: • Feature Engineering in...
Computer Vision playlist: • OpenCV Installation | ...
You can buy my book on Finance with Machine Learning and Deep Learning from the below url
amazon url: www.amazon.in/Hands-Python-Fi...

Опубликовано:

26 июн 2019

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 98

@ruchitaphalak3603 3 года назад

Hello there. Firstly a very big thank you Krish Naik for these videos. A major part of my ML coursework for masters degree was possible almost entirely due to your theory and practical sessions. You are doing such good work. Cant thank you enough. These videos are a big saviour. No kidding but none of my Masters professors could match your teaching way. Alas! I have to pay so much to the University only to learn from youtube. Thank you once again :)

@tahfim37 4 года назад

Thanks , explanations are easy to understand, constructive and well organized. Keep doing it 💯💯👏👏

@smrahman4595 2 года назад

Hey Krish Naik, your explanation is just awesome. Salute.. 👍👍👍

@rajeshreddy3133 4 года назад

@6:53 get rid of for used for loop list comprehension by.. columns = data.columns.to_list() target = columns[-1] columns =columns[:-1]

@drishtisharma6843 4 года назад

yes... :)

@mohammedameen3249 3 года назад

X = df.drop(labels='Class', axis=1) y = df['Class']

@BiranchiNarayanNayak 5 лет назад

Nice tutorial on imbalanced dataset.

@sandipansarkar9211 3 года назад

Thanks Krish. Great Explanation

@drishtisharma6843 4 года назад

superb! so helpful... thank you so much :)

@btsislife9659 2 месяца назад

Superb and amzing explanation....the teachers jas explained it very well and really really amazinggg🎉

@alricliew714 3 года назад

Suggestion: It would be convincing if you include the before-after performance comparison.

@dipeoluayomide5097 11 месяцев назад

You're the best Krish

@patelajay1010 3 года назад

I have one doubt. What if data contains Nan values and you want to do under_sampling? If you impute Nan values with Mean() then there will be information leakage because we impute data before splitting it into train and test dataset. Could you please tell me what should be the possible solution in this case?

@Alhamdou_lilah 3 года назад

hello , thank you for the video ! tell me please , how can plot histogramm for the new balanced data ? and how can i get the new X and Y ? thank you !

@thealgorithm7633 5 лет назад

Very helpful

@prajothshetty6848 4 года назад

sir should i do under-sampling ,if i am using random forest algo for classification problem?

@shreyasb.s3819 3 года назад

Nice work.

@rajivrsrivastava6379 4 года назад

It was really nice. Could you please make a video on SentimentAnalysis ?

5 лет назад

Krish Super fast express has just arrived in RU-vid station!

@aguyfromparalleluniverse Год назад

thankyou krish sir

@vinayaksharma-ys3ip Год назад

thanks a lot Sir!!

@yasserothman4023 2 года назад

Why the undersampling is applied before the train test split not on the training data set alone ?

@samisoomro1324 4 года назад

Can you please make a video on image classification for multiple classes with imbalanced classes

@babbibaljeethkaur3995 5 лет назад

Thank you so much Krish 🙏, Very clearly explained. Searched Whole Web for this, finally found the functions could you please make a video on "Bias and Variance tradeoff"-- using python function. i'll be gratefull :)

@learnenglish699 2 года назад

hello bibi have u got the job hun ta do sal hogee kithe jee>

@magelauditore333 4 года назад

Sir what if in a column of independent feature 90% of data is of same category. Then should i drop the column or not

@ameyakaranjkar9359 4 года назад

Hi sir, I am just a beginner in ML, so this question might be very naive, but in general is over-sampling better than under-sampling, as in under-sampling we are essentially decreasing the size of our data set which in turn would affect our accuracy?

@Ayra_Is_Cool_lol 2 года назад

He just said that you should under only when you have a large dataset, and over when your dataset is small

@reshamsundar6941 4 года назад

Sir could you please tell me that in an imbalanced dataset whether is it better to do train_test_split before performing under or over sampling or is it better to do after under or over sampling?

@Arjun147gtk 3 года назад

you can do it before. That will be better.

@shubhamgalande3315 3 года назад

if you apply train_test_split to imbalanced dataset it might takes only negative or positive classification in your training data.

@mohammedameen3249 3 года назад

Yes is better to split the data then apply under or over sampling on train data to avoid data leakage .

@kennytuwiro7561 3 года назад

nice one

@louerleseigneur4532 3 года назад

Thanks Krish

@AhmedKhan-fh8fq 3 года назад

hello sir, what kind of algorithm i apply to convert diabetes dataset into readable format guide me please

@renuverma5633 4 года назад

i am not able to import near miss from imblearn.under_sampling. ImportError: cannot import name 'six'. what to do?

@eduardodimperio 2 года назад

What's the difference between this and slice the same amount of rows from 0 that's exists in 1?

@sandipansarkar9211 3 года назад

Did my practice in Jupyter notebook but could not import the library Nearmiss and imblearn.Please guide

@priyankadurairajan887 4 года назад

Whether the technique is applicable for Multilabel classification or not.

@anupamborah2551 3 года назад

One question, At which stage do we use imblearn? Should we use it after feature engineering or right at the start and then go for feature engineering, feature selection, model training etc?

@shanbhag003 2 года назад

I think just before train test split

@shishirdixit5996 4 года назад

In my dependent categorical variable, I have 80% 0s and 20% 1s out of 3400 records, it will be considered as an imbalanced dataset?

@bytblaster 4 года назад

yes

@thegorillaz4759 4 года назад

Now how the train test split will work on that?

@MOHITBARTHWAL 2 года назад

FIrst of all thanks for bringing such nice video lectures on machine learning. I have Two query regarding down sampling. 1. As we have down sampled the dataset belonging to one class(drastically) will it not result in loss of significant amount of information. 2.And how to decide upto where the downsampling is to be done? as here in the video we downsample the dominant class and made it equivalent to the other one.

@satyamshukla4209 2 года назад

It will. This is a major disadvantage of undersampling.You may lose valuable information. Undersampling will be done in majority class. Oversampling will be done in minority class.

@AmeerulIslam 4 года назад

will it work for regression problem?

@kesavae9552 3 года назад

I have a doubt, by performing up or down sampling we are actually making the probability for fraud almost equal to genune in the dataset, which is not true. Then the model learns that both fraud and genuine are equally likely events...?? If I'm wrong please tell me why

@leeroyndlovu7983 3 года назад

Hmm 🤔 interesting point, but that’s actually not true. What we’re interested in is learning the characteristics of defaults and non defaults. So we’re not saying all are equally likely possibilities. We’re saying what attributes does a defaulter have and what attributes does a non defaulter have so when we see a new dataset we’d be able to properly classify it

@kesavae9552 3 года назад

@@leeroyndlovu7983 thanks budd, really helpful ✌️

@sreeramsaravanan8132 4 года назад

Krish make a video on multicollinearity

@winyourself553 3 года назад

12:13 what about remaining original data. after having 492 in each ..

@abhishekjn3390 3 года назад

why u mentioned randomstate as 42 in nearmiss?

@boringhuman9427 3 года назад

### Implementing Undecannot unpack non-iterable NearMiss object sampling for Handling Imbalanced nm = NearMiss() X_res,y_res=nm.fit_resample(X,Y) y_res.value_counts()

@sandeepdesale6395 2 года назад

TypeError: __init__() got an unexpected keyword argument 'random_state' : This error occurs when i am using NearMiss, what to do??

@guganr9321 2 года назад

i am getting this error sir, could you help? 'NearMiss' object has no attribute 'fit_sample'

@chdhc9922 2 года назад

undersampling can lead to underfitting?

@yuvrajverma6832 3 года назад

fit_resample give error: MemoryError: Unable to allocate 1.00 GiB for an array with shape (236, 568630) and data type float64 How can i solve it

@apurvakulkarni7725 2 года назад

Hey Krish, Thanks for the video but while Implementing Undersampling for Handling Imbalanced i am getting an error "TypeError: __init__() got an unexpected keyword argument 'random_state'" and unable to find an solution it would be great if you could assist.

@sahiltamboli7371 Год назад

did you found solution?

@omsonawane2848 Год назад

@@sahiltamboli7371 actually the function to be used is nm.fit_resample . And remove the random_state arguement in the NearMiss initiator.

@rahulsharma-dk5jf 3 года назад

Why random_state is set as 42 , is there a specifc reason for that

@94fuckmylife 3 года назад

This helps in reproducibility. If you run the code in your terminal with random_state = 42, it will give the same results as the one he has got in the video.

@TrNz21 3 года назад

had an error like this 'NearMiss' object has no attribute 'fit_sample' please tell me what to do?

@saitejagoud4143 3 года назад

I'm also facing same problem

@mohammedameen3249 3 года назад

nm.fit_resample(X,y)

@user-ly9eu7rx7w Год назад

Could you please share the dataset. it is not available on the link.

@thanveenjbrox 3 года назад

Suppose we have a dataset which is imbalanced and also has missing values ..which should be treated first ?

@CatBlack01 3 года назад

I have the same question. Can anyone answer it? My instinct is to try both ways and see if I get different results.

@priyanshugupta2104 Год назад

What's problem in this You can handle missing values

@DonutTechBites 4 года назад

sir where is tutorial 44 ?? please do upload it

@imayushthakur 3 года назад

Couldn't find the related data anywhere for practicing even not at the provided links so please share the source

@jayjagani5998 3 года назад

Dataset Link: www.kaggle.com/mlg-ulb/creditcardfraud

@samriddhlakhmani284 3 года назад

Haha, I had already worked on this data set.

@ashwinchavan6391 Год назад

I am getting this error whta should i do :- TypeError: __init__() got an unexpected keyword argument 'random_state'

@abhinandanpawar.7880 2 года назад

how to get this data

@souravsaha7751 2 года назад

Sir imblearn is not installing.

@harshiths5140 Год назад

Any contact information to get join subscription plan for projects please let me know.

@akshatrailaddha5900 Год назад

I'm getting No module named 'imblearn' error even after installing in anaconda prompt can anyone assist me with this

@akshatrailaddha5900 Год назад

even after successfully installing library in command prompt also and updating python , still not working

@abhinavpratapsingh4445 3 года назад

_init_ got an unexpected error

@pasalapravalli5968 4 месяца назад

Where is the creditcard csv file?

@ahmedhelal920 2 года назад

nm = NearMiss(random_state = 42) give me error __init__ unexpected keyword argument 'random_state'

@saurabhsharma-tm3co 4 года назад

Hi, I am unable to use imblearn package , tried to install conda install -c conda-forge imbalanced-learn conda install -c conda-forge/label/gcc7 imbalanced-learn conda install -c conda-forge/label/cf201901 imbalanced-learn and the above packages installed successfully using anaconda prompt but still getting error ModuleNotFoundError: No module named 'imblearn' Please help

@ishitagupta6584 4 года назад

I am also facing this error. Can you please share the solution if you got it. Thanks.

@abhishekrai6803 3 года назад

Same to me

@kajalkapasiya193 4 года назад

Hi Krish, Can you please provide the data file as well.

@generationwolves 4 года назад

www.kaggle.com/mlg-ulb/creditcardfraud

@mohammadkaif2534 6 месяцев назад

from 5000 to 1 miliion !!

@riteshtripathi8626 4 года назад

__init__() got an unexpected keyword argument 'random_state' any expert out there to help me here, using mac and executing command: # Implementing Undersampling for Handling Imbalanced nm = NearMiss(random_state=42) X_res,y_res=nm.fit_sample(X,Y)

@yashrajadventures 4 года назад

I am also getting same error , please help me .

@yashrajadventures 4 года назад

The error is you haven't installed imblearn library instead you may have installed imbalanced-learn. They both are different

@riteshtripathi8626 4 года назад

@@yashrajadventures thanks for suggesting out, though above suggestion hasn't helped yet, I have imblearn library preinstalled, and executed again pip install imblearn, it says requirement already satisfied. did yours work?

@yashrajadventures 4 года назад

@@riteshtripathi8626 Yes, please install via command "pip install imblearn" other commands don't work.After installing close every kernel and re-run all the cells.

@riteshtripathi8626 4 года назад

@@yashrajadventures nope, didn't work for me, tried again,it says:requirement already satisfied, looks like something's a miss. anyway, I am using Rstudio to handle imbalance datasets for my day to day work, I will keep digging more. Thanks,

@jagupatigolguri1875 Год назад

Some time this code from imblearn.under_sampling import NearMiss nm = NearMiss() X_res,y_res = nm.fit_sample(X,y) is not working so we us this also where we can adjust sample size from imblearn.under_sampling import RandomUnderSampler from collections import Counter # Instantiate RandomUnderSampler with desired sampling strategy rus = RandomUnderSampler(sampling_strategy={0: 4900, 1: 250}) # Perform under-sampling X_res, y_res = rus.fit_resample(X, y) # Check the class distribution after under-sampling print(Counter(y_res))