@codebasics Hello Sir, Regarding the encoding approach (label encoding) used in the video, I read on the sklearn documentation that it should be used only on the target variable (output "y") and not the input feature ("x"). The documentation stated that for input feature one should use either onehotencoder, ordinalencoder, or dummy variable encoding. Also, I was expecting that you use onehotencoder(OHE) since the input features (company, job and degree) are nominal and not ordinal variables. Is it best practice to use OHE for nominal variables or it just doesn't matter? Please could you clarify for me??? Thank you.
This is by far the most straight forward and amazing video on decision trees I have come across! Keep making more videos Sir! I am totally hooked to your channel :) :)
This is unbelieveable. I saw someone used Random forecast, SVM, Gradient Boosting etc. The best score on testing data is 84%. With simple Decsion Tree, best score would be around 82%, i think.
Thanks for this video. I have used train and test csv files of titanic. Cleaned both datasets and implemented Decision Tree Classifier and got a test score of 0.74 ❤️
Amazin video, thank you so much! I have a question.. In the dummy variable video, you had mentioned that we should always make sure when we do the One Hot Encoding, we should create different columns. ie. if Monroe township = 1, Robbinville = 2 and West Windosor = 3.. and so we want to avoid confusing the model which may assume Monroe township < Robbinville < West Windosor.. But in this video, you're assigning company names Google = 0, ABC Pharma = 1 and Facebook = 2. Is it the right thing to do?
Decision tree is one of those algorithms where label encoding works ok in some cases like ours and you can save some memory space by not using OHE. Check this for some insights: datascience.stackexchange.com/questions/9443/when-to-use-one-hot-encoding-vs-labelencoder-vs-dictvectorizor Having said that since a number of categories are small we can use OHE as there is no concern with sparsity. If I have to re-record this session, I'd probably use OHE.
Thank you, sir, the exercise that you gave at the end of your lectures help us to experiment and get an in-depth knowledge of the algorithm. accuracy achieved =0.87
@@jaihind5092 how did you acheibe a score of 97.7 % ? i only achevied 82 :( even after removing all NAN values from age and conveting age n fare to int my score went from 74 to 80 to finally flattened at 82 ! help me improve .
how did you acheibe a score of 87 % ? i only achevied 82 :( even after removing all NAN values from age and conveting age n fare to int my score went from 74 to 80 to finally flattened at 82 ! help me improve . thanks
thank you for this ML playlist....your way of teaching is the best anybody can understand if they watch videos in sequence my model score is 1 i replace all the NaN values in age by mean value of age by Pclass
you didn't split the dataset into training and test and maybe that's why its 1 coz your test is same as train model. split the dataset and check the score
I question I have is arent we supposed to do OneHotEncoding since the variables are not ordinal or is it that decision trees takes care of it since it doesnt considers the magnitude of features but rather the values of feature to determine the rules
This is a wonderful video, very clear overview, thank you! Is there a way to predict a continuous variable vs just a binary one (yes/no)? For example if I wanted to take purchase amount, gender, and whether or not they started a subscription, how much is this person likely to spend over the next year? Thanks in advance!
@codebasics. Why did we used sklearn LabelEncoder instead of pd.get_dummies. SInce the company name, Job, degree are nominal categorial data we should have used pd.get_dummies instead of LabelEncoder. LabelEncoder should be used mostly for Target variables and that too when the data is ordinal Categorical Data e,g low < medium < high. Please help to clarify my doubt.
Got an accuracy of average 97-99%(for the different test/validation Dataset...using different values for randomstate) for the titanic dataset. Features used-->Age,Sex,Fare,Pclass
Maybe here is the answer: "Still there are algorithms like decision trees and random forests that can work with categorical variables just fine". datascience.stackexchange.com/questions/9443/when-to-use-one-hot-encoding-vs-labelencoder-vs-dictvectorizor
Sir i have a doubt regariding method .score() from sklearn.model_selection.DecisionTreeClassifier and accuracy_score() from sklearn.metrics. you have computed the performance of the model on the basis of .score().What if we compute on basis of accuracy_score()??Are they identically the same?? What if for a certain classifier accuracy is not the best parameter to measure the performance?i.e the best parameter might be precision or recall or something else
should we drop the na rows in exercise? since the ages are not correlated to each other, and, in my opinion, fillna with the mean value may affect the accuracy of the final model.
You have created 3 objects "in cell 7" (i.e. le_company, le_job, le_degree), but you have used only one object while creating new columns "in cell 8" (i.e. le_company) is it necessary to create 3 objects or we can get the job done by only one object like you do. ??
Accuracy score is - 0.748 Training score is - 0.977 Replaced the Age column , Null values with Median Dropped the unwanted features as mentioned in video
To convert categorical variable into numeric we have 2 techniques, dummy variable, onehotencoding, label encoding. My question is here we have used label encoding why not other technique ?
ValueError: could not broadcast input array from shape (2,712) into shape (1,712) I'm getting this error whenever I'm tryint to fit the (xtrain,ytrain) in the model can anyone please resolve it??
Decision tree is one such classifier where using labelencoder also works ok. But in general I agree, one should use OHE only. You can modify code in this tutorial using OHE and it works perfectly ok.
What is the datatype of target variable? I executed the query model.fit(inputs_n, target) and it throwed below error : ValueError: Unknown label type: 'unknown' . Pls help
Step by step roadmap to learn data science in 6 months: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-H4YcqULY1-Q.html Exercise solution: github.com/codebasics/py/blob/master/ML/9_decision_tree/Exercise/9_decision_tree_exercise.ipynb Complete machine learning tutorial playlist: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-gmvvaobm7eQ.html 5 FREE data science projects for your resume with code: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-957fQCm5aDo.html
Great Explanation Sir, Thanks a lot for your efforts and help. I got 97.76% accuracy. I did not map male and female to 1, 2 instead used as it is. Is it necessary to do that ? is there any significance of it?
Exercise results ::::: Accuracy : 0.8229665071770335 Actually I your csv file as training and for test data used test.csv provided on Kaggle >> which increase my training data(which would have been less if I had split my data) >> Increased Accuracy(As we have more data to train) >> Reduce chances of overfitting if i had used same data for both training and testing... Thank you.. for great video
ACCURACY --- 0.811111111 , But i have a question How do we know on when to use linear regreesion vs when to use decision tree on a dataset, PLEASE ANSWER THIS
In In (8) you use the "le_company" LaberEncoder object 3 times and never use the 'le_job" and 'le_degree' objects. It still works, so my guess would be that you'll only need one LabelEncoder object to do the job.
label encoder basically converts the categorical to numerical, since job and degree are categorical you still need them to be LabelEncoded. and he used them see carefully using fit_transform().
well here it worked as Sir used fit_transform but if he had splitted the data into test and train sets , then he would have used transform on remaining test set and for that different instances would be required for each coloumn.
My model got a score of 98.6%. I dropped all the Age Na values which reduced the sample size from 812 to 714. I label-encoded the Sex column and then used a test size of 0.2 with the remainder of 0.8 as the training size. I am all smiles. Thanks @codebasics
Hi sir, I am a 10th grade student and I am learning ML and in the exercise My model got 81% accuracy😀 sir. Will Make many models while learning and share with you. Thanks for the tutorials sir.
It is ok to learn ML but make sure you find time for outdoor activities, sports and some fun things. The childhood will never come back and do not waste it in search of some shiny career. If you are so much concerned, I would advice focusing on math and statistics at this stage and worry about ML later.
@@codebasicsAbsolutely correct, it’s great to learn new things. But learning all these is not your right age. Make more and more memories in childhood. I am 23 and trust me life is very painful…
In one hot encoding turorial you mentioned its better cos then we dont have encoding which has relation to each other. Please clarify. These videos are teaching me a lot.
Incredible video! Thank you for sharing your knowledge. Scored a 83.15%. I changed the hyperparameter "criterion" to entropy instead of gini and was consistently performing better. Looking forward to seeing how changing other hyperparameters effects accuracy.
Actually this man has made learning Machine Learning easy for everyone whereas if you will see other channels they show big mathematical equations and formulas..which makes beginners uncomfortable in learning ML. But thanks to this channel.♥️🥰
@@codebasics But then doesn't the model give a higher priority(value) to Facebook than to google on the basis of the number assigned in Label Encoding ...just confused here.
Hello sir at 7:50 LabelEncoder is used for all the columns like compony,job and degree but when we fit_transform then why only le_compony is used ? For job and degree we have to write le_job.fit_transform() and le_degree.fit_transform() ? Am I right please answer 😶
Hey, honestly I am not aware of any good resource for this. Kaggle.com is there but it is for competition and little more advanced level. Try googling it. Sorry.
Great Tutorials keep going but I have a doubt why haven't you used onehotencoder for company here as it is nominal variable? and please make a tutorial on what exactly these parameters are and on random forests
true, one hot encoding is better than labelEncoder as assigning categories would results in errors in prediction if that feature is chosen, because higher category is considered better over the others. so in this case if google =0 and Fb =1 , then FB>Google.
i have only started to learn about data science using python and i have a question: Why use labelencoder rather than getting dummy variables for the categorical variables? Is it more efficient using labelencoder?
@@elvenkim why you are removing the missing value whether it is possible to fill with whether mean or median it depends upon the outlier present in the column age
really appreciate your work. learning a lot... just want to confirm something from the tutorial @7:40 you are using fit_transform with le_company object for all the other columns and did not use le_job object and le_degree object. is it ok? or should we do it? Thank you very much again.
Do you have any thing related to sentiment analysis/Text mining/Text analysis? please have a tutorial for the text analytics as the other videos are so good I also request you to create chats for AUC and also create a model evaluation according to CRISP DM model
Thank for these awesome videos. I have been learning a lot through your ML tutorials. I replaced the missing values in the 'Age' column with the median. My test set was 20% and my accuracy on test data was 99.44%.
@codebasics I have a doubt here, for different companies and job we should have used get_dummies or one hot encoding but why we used Label Encoder here? Will our model not assume internally that Google is better than facebook and pharma company and vive versa. Please clarify if I understood it correctly.
It might indeed affect the accuracy. One-Hot-Encoding should be used when dealing with nominal categories - i.e. no inherent ordering. Similarly for the Titanic exercise, male or female should (in theory) also use One-Hot-Encoding and not Label Encoding.
Increasing score is an art as well as science. If your question is specific to only decision tree then try fine tunning model parameters such as criterian, tree depth etc. You can also try some feature engineering and see if it helps.
So, in this example, why aren't we converting categorical features to numbers? We did convert them to numerical values but we are not doing OneHotEncoding here like we did in one of the previous video. Do we need to convert Categorical features to different numerical columns only in case of linear models?
Amazing Video! But I have some doubts please help me here: 1. We made three Label encoder instances here. Cant we use just one to encode all three? 2. We Use label encoding and not OneHoteEncoding, however, the latter made more sense as our model might assume that our variables have some order/ precedence It would be great if you clarify my doubts. Thanks!
It is necessary to understand the underlying logic of the algorithm. In regression, the algorithm tries to fit to a line, curve (or higher dimensional object in SVM), so, what the relative value (order, or where it is on the axis) is matters. In decision tree, the algorithm is just asking Yes/No questions, such as Is the company Facebook?, Does the employee have only a bachelors degree?, etc, so the order is not significant. Therefore, a the Label encoder is valid for decision tree. While it could have been possible to lump the label encoders into one, say by using a power of 10 to distinguish them, it would have given too much weight to the highest power of 10 (the algorithm understands numbers, so it is going to ask >/< /= questions), but the whole point of using decision tree was for *the algorithm* to find the precedence of features that will give the quickest prediction. Therefore it is better to have more features (i.e. more Label encoders). Then, if more features is better, one could re-ask the question of why not one-hot encoding, that would give even more encoders. Now, the issue is the tradeoff of accuracy vs conciseness. Here, there were only 3 companies, but there could be a case where a problem was examining over 100 companies. Having a one-hot encoder for all the companies would get quite cumbersome.
Hello @codebasics thank you for the video, but i have a question. Why did you use the LabelEncoder instead of one-hot-encoding knowing that these vaules are not ordinal ones. If you guys know the answer, thanks for sharing with us.
@@jayrathod2172yes I was about to say that, and also possible if you have change the random state multiple times and your model has seen all your data, and is now overfitted
thank you for such amazing, well detailed and easy to understand tutorial(s) ! im following your channel exclusively for learning ML, along with kaggle competitions. also recommending your channel to my peers. great work..! PS - i got 75.8% as the score of my model in for the exercise. any tips to improve the score?
re execute the test train split function as it generates rows randomly. Then Again fit the model and execute. Continue this for 4-5 time until u get somewhere around 95% accuracy. So this set of data is the most accurate for training the model.
Your videos are absolutely awesome.... Those who wants a career transition in DS basically they use to spend more then 3k us dollars to do their certification and what they ultimately get is a diploma or a degree certification on Data Science not what exactly happening in data science, but when a scholar like you train us we come to know what's happening in it.
help me explain this : I use different methods to encode string from SX columns ( 1 : LabelEncoder, 2 get_dummies , 3 map ) then I fillna with mean() method and also test_size the same for 3 above encoding methods BUT I got different accuracy . Tell me why??
Sir, In the Exercise you perform map on sex column and I did it using LabelEncoder. I liked when you give us a difference approach to perform a same task .and one more question Sir, instead of mean why cant we use mode on age column ........btw My score is 79%
Thanks for the awesome tutorial.... Dropped all na values in Age column which reduced the sample size from 812 to 714 and ran the model couple times, the best accuracy I got was 83.21%
Thanks so much for these tutorials! These are the best tutorials I've found so far. The code shared by you for examples and exercises are very helpful. I got score 76% for the exercise. How is it possible to get a different score for the same model and the same data? The steps followed are the same too.
In train_test_split it will generate different samples Everytime so even when you run your code multiple times it will give different score. Specify random_state in train_test_spkit method, let's say 10, after that when you run your code you get same score. This is because now your train and test samples are same between different runs.