Thank you, Ashok! This is an outstanding explanation of a complex subject. You make it all feel very intuitive. Awesome stuff - I will look for more DataMites videos in the future!
Firstly, Thank you for sharing. I wanna ask something about time series. I have lots of data. But datas are different frequency. I wonder how deal with all datas. And assume that datas edited to same frequency. By the way datas are not fitted normal distribution so imbalanced that's why i am asking. If datas be same frequency, Smote can be appliable for time series? If not how to resample my time series?
Thank you very much for this video. I have a precipitation dataset containing 4 columns and 8000 rows, each of them has a lot of zeros and only a few continuous values. I would like to know if I can use smote in this case?
Running into a problem with sklearn 'support' column still looking unbalanced after smoting on print(classification_report(y_test, y_pred)) what gives?
Oversampling is increasing the samples for minority class to match with the majority class. Undersampling is reducing the samples for majority class to match with minority class.
SIr, I went on as per the recommended procedures but my jupyter environment giving an AttributeError that SMOTE object has no attribute '_validate_data'. Can you please help me with the.
Thank you for the informative video! In this video, you have used SMOTE to rectify imbalance in target label. What methods can we use to deal with class imbalance in categorical features( input) in order to make the model more robust?
Hi Sasidharan Sathiyamoorthy, Its property of input so if u balance the input it might affect the target variable. Make 2 models with and without balancing n check the performance
@@DataMites sir if we have 54% persons cancer patients and 46% non-cancer patients then do we need balancing? If yes then which balancing technique should be selected?
hi sir i have question how did we implement those resampling technique in neural network, let say if we implement embedding layer and work with multiple kind of data is that resampling technique make our data losing such information?
Nice video. After applying smote, balanced data was obtained. But balanced data (X_smote,y_smote) was not split (80:20) in to train n test data sets before reapplying classification model? Is it necessary or not to split the data again? Or orginal dataset itself was considered as test dataset.
Thanks so much bro..i have shown some data scientists used undersampling and oversampling before Splitting the dataset into training and testing..in my research paper we heve used NEARMISS technique to balance the dataset..i have got a good results with using cross validation Splitting and Extra tree classifier as model and also the same model to select the best importance features where my results are : (ACC 0.97 , F1 0.97 and AUC 0.99) are there results may be accepted for publishing?
Very informative, i have a question sir, it is possible to set how many synthetic data created by smote ? in example i want to set n_sample increase to 200% so, how to put this parameters in pyhton code ?
While dividing training and test data shouldn't you be doing "stratify=y" ? To ensure test data and training data set have equal proportion of outcome variable?
The aim of machine learning model is to generalization on training set so that performance on unseen Data is good.We don't care what the test data consist instead we try to given more generalized pattern to the algorithms.
Thank you for this nice explanation. I was making progress with the codes but when I tried to fit using the command X_train_smote, y_train_smote = smote.fit_sample(X_train.astype('float'),y_train), I got error saying AttributeError: 'SMOTE' object has no attribute 'fit_sample'. I need urgent help please. Thank you
hello sir, the material that you explain is very easy to understand. I want to ask about my project. I have imbalanced data, then I do smote and I model it with KNN, but why after smote does the accuracy go down? 79% to 78%, is there something wrong with my data? Can you help explain this? I am very grateful if you respond to my comment.
Using SMOTE, your model will start detecting more cases of the minority class, which will result in an increased recall, but a decreased precision. Accuracy is not a good measure of performance on unbalanced classes. That's because SMOTE technique puts more weight to the small class, makes the model bias to it. The model will now predict the small class with higher accuracy but the overall accuracy may decrease.
Hi, we cannot comment until we look in your data and all the approaches that you have taken. One of the possibility might be your prediction was previously overfitted.
I have one doubt. What if data contains Nan values and you want to do under_sampling? If you impute Nan values with Mean() then there will be information leakage because we impute data before splitting it into train and test dataset. Could you please tell me what should be the possible solution in this case?
It can due to multiple reasons like in logistic-regression doing classification more than 2 classes. Or due to the use of classifier if the target variable is continuous.
Thank you for informative video! I used your coding but got error " ValueError: could not convert string to float: '5more'"...plz tell me how can I resolve this error...Thanks in advance:)
@@DataMites sir I am predicting heart disease and out of my sample 54% people have heart disease and rest 46% don't have so which method I should use for balancing?
At 8.15 you have said it is taking the average of centroids which is completely wrong. SMOTE is calculated over the feature space...it goes like this 1. we take the feature vector of the minority class point. 2. we calculate the distance between the neighbours (neighbours=5). 3. we multiply the distance between the neighbours with a random number that is created between 0 &1. 4. Then we create the synthesized point. hope you got it 😀
SMOTE works by selecting examples that are close in the feature space, drawing a line between the examples in the feature space and drawing a new sample at a point along that line. Specifically, a random example from the minority class is first chosen. Then k of the nearest neighbours for that example are found (typically k=5). A randomly selected neighbour is chosen and a synthetic example is created at a randomly selected point between the two examples in feature space.