I am not a girl who generally comments on you tube videos but I am learning from your videos and this is my genuine comment that you are amazing and your concept in data science is very clear and to the point. I am very happy that the teacher like you is present here. Superb job Sir !
"I am not a girl" okay can't say these days "who generally comments on youtube videos" first of all youtube doesn't have any comment history data to prove this second How dare you call this another youtube video? How dare you generalised an educational video that free of cost while people pay an hefty amount of price for such contents? shame on you!
I was trying to understand NLP concepts referring to various books and videos from last two months but concepts were not clear for me.But this explaination is really awesome .Explained in very easy way .Thanks Krish
I would say to prevent leakage we should split our data before we fit_transform on the corpus. So in other words, we are teaching vocabulary to our model on the whole dataset which defeats the purpose of splitting into train and test after. The whole purpose of the test set is to test our model on unique data that our model has never seen before. Please correct me if I am wrong! Cheers!!
Best NLP videos of all time . A complete gist , mind you not for the faint hearted . Execllent job Krish. Initially ibhad given up NLP completely but now have renewed vigour after such exemplary teaching
Thank you very much sir,your videos are really very helpful i am learning NLP from your channel first time . I don't know machine learning thats why facing little problem
I am getting these accuracy values for different combinations: Stemming and CountVectorizer accuracy=98.5650% Lemmatization and CountVectorizer accuracy=98.29596% Lemmatization and TfidfVectorizer accuracy=97.9372197309417% Stemming and TfidfVectorizer accuracy=97.9372197309417%(same as Lemmatization and TfidfVectorizer)
Hi Krish, I am the newest subscriber of your channel and I hope your this video will help me to complete a project of mine own. Thank you so much. Will continue to learn
🎯 Key Takeaways for quick navigation: 00:00 📚 *Introduction to Spam Classifier Project* - Creating a spam classifier using natural language processing. - Overview of the dataset from UCI's SMS Spam Collection. - Reading and understanding the dataset structure. 01:47 📂 *Exploring the Dataset and Data Preprocessing* - Explanation of the SMS spam collection dataset. - Reading the dataset using pandas and handling tab-separated values. - Data cleaning and preprocessing steps using regular expressions and NLTK. 05:46 🧹 *Text Cleaning and Preprocessing* - Using regular expressions to remove unnecessary characters. - Lowercasing all words to avoid duplicates. - Tokenizing sentences, removing stop words, and applying stemming. 13:52 🎒Creating *the Bag of Words* - Introduction to bag-of-words representation. - Implementation of count vectorization using sklearn's CountVectorizer. - Selecting the top 5,000 most frequent words as features. 17:27 📊 *Preparing the Output Data* - Converting the categorical labels (ham and spam) into dummy variables. - Finalizing the output data with one column representing the spam category. - Overview of the preprocessed data for training the machine learning model. 21:04 📊 *Data Preparation for Spam Classification* - Data preparation involves creating independent (X) and dependent (Y) features. - Explanation of dummy variable trap in categorical features. - Introduction to the train-test split for model training. 22:30 🛠️ *Addressing Class Imbalance and Train Spam Classifier* - Discussion on class imbalance issue in the data. - Introduction to Naive Bayes classification technique. - Implementation of the Naive Bayes classifier using multinomial Naive Bayes. 24:22 📈 *Evaluating Spam Classifier Performance* - Explanation of the prediction process using the trained model. - Introduction to confusion matrix for model evaluation. - Calculation of accuracy score for the spam classifier (98% accuracy). 27:50 🔄 *Improving Spam Classifier Accuracy* - Suggestions for improving accuracy, including the use of lemmatization. - Mention of addressing class imbalance for better performance. - Recommendation to explore TF-IDF model as an alternative to count vectorization. Made with HARPA AI
Thanks Krish .Superb explanation once again.All my concepts about NLP is very crystal clear.I know career in NLP is superb.But can you explain what is its exact value in terms of data science carrer. Please guide and feel free to reply as I am eagerly waiting. Thanks once again.
How to decide when to use count vectorizer, or tfidf? How to decide whether/when to use Stemming or Lemmatization? Like in this example why didnt you use tfidf instead of bag of words? And why lemmatization was not used instead of stemming?
Hi Krish, good session. I have one comment. For getting test corpus, better practice may be to use transform. Fit transform on train and only transform test. And train test split to be done before we build corpus. Let me know what you think.
i have 2 questions first : Why only multinomialNB, is there specific reason, cant we use bernoulliNB or gaussianNB ?? second : if dataset is imbalanced we have used complimentNB, but how do we know that dataset is balanced or imbalanced??
BinomialNB - when spam classification is being done with a two step decision approach i.e if 'X' is present, then 'spam' else 'not spam' GaussianNB - used when the values are present and are continuous MultinomialNB - counts the presence of words and the frequency of occurrence to decide the decision boundary
Hi Krish, Why are we hard-coding Max_features=5000, What if this code is Migrated to Production as-is and face more Tokens/Features in Live Data(Ex: if live data has 0.1 Million(1 Lakh) features)? In this scenario, Do our Model fails?
i have created the model and saved the same using joblib. I am not getting how to use the model for prediction? Is there anyway where i can pass the email text to the body and model can detect spam or ham. I am newbie plz help. Thanks
You can do it like this. df=pd.DataFrame(['this message is a spam'],columns=['message']) corpus=[] for i in range(0,len(df)): review=re.sub('[^a-zA-Z]',' ',df['message'][i]) review=review.lower() review=review.split() review=[ps.stem(word) for word in review if word not in stopwords.words('english')] review=' '.join(review) corpus.append(review) df=cv.transform(corpus).toarray() pred=spam_detect_model.predict(df) label=pred[0] if label==1: print('Spam') else: print('Ham')
@@yogeshprajapati7107 how does model handle for features 2500 when doing predict? I believe there will mismatch between number of features from new message and number of features from trained model. can share how to overcome this?
I had gone through the 7 videos in the playlist . Well explained in every videos . Can you please tell me how can implement this program in real scenario ? Everyone has completing their videos by making only the models . So pls try to explain how we can use this model ? If I have text message. Then how to find whether it is spam or not using this model ..
Hi sir, please correct me if I'm wrong. In the line number 30 you are applying the transform function for the whole data , won't it be data leakage? The transform has to be applied after splitting the data right? Thank you.
sir i have tried running the code but the shape of x function and y is not the same so train test split is not working its saying Found input variables with inconsistent numbers of samples: [11144, 5572]
Hi Kris, supposing we need to implement a functionality for identifying spam afresh, how can we come up with a solution. The sample data used here already have something tagged as spam and ham by someone, sometime, somewhere. In practice, do we need to have a sample data upfront? Can you please advice?
Can we just use an if-else condition on the label column to derive the 0-1 (spam-ham) column? What is the purpose of using the get_dummies function for a binary class column?
At time 20.44 I think we should consider ham column as independent feature Y. Because say first sentences is positive sentence ham=1 spam=0 , if u consider spam column as independent feature it gets opposit meaning , negative sentence as positive and vise versa. Could someone correct if I'm wrong
Hi Krish, Nice video, Just had a question. What if i put the model in production and the new message have a word which are not part of my training dataset then the features won't match and the model will give error?
hello sir if we have different number of labels or category such as business,sports, entertainment,category,politics,tech,history.then how can we get the dummy variables and bag of words and how to find which present are the which labels.