NOTE: You can support StatQuest by purchasing the Jupyter Notebook and Python code seen in this video here: statquest.gumroad.com/l/tzxoh Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/
Hi Josh! I had a question regarding why you would use One Hot Encoding instead of Label Encoding in this case. Wouldn't One Hot Encoding result in an increased number of dimensions and that would actually cause the Decision Tree algorithm to overfit ?
One-hot encoding works well when you don't have too many different options (which is the case in this video). It's also the method of choice for more advanced tree based methods, like XGBoost.
@@statquest Thanks for the clarification! How many unique categories should a feature have that would prompt one to switch from One Hot to Label Encoding ?
@@SarveshRelekar That is a great question! Unfortunately there is no hard and fast rule (except XGBoost, which says One Hot should be done regardless of the number of categories).
Though iam aware of the classical techniques like ESM,ARIMA family, UCM, IDM etc...I still cannot figure out how to use GBM,neural based/LSTM etc for time series forecasting for univariate and multi-variate cases ( using endogenous cases, where sales forecasting depends on revenue, profit, campaigns etc..)..I did go through few similar githubs but somehow cannot get the concepts right...
If you subscribe with "the bell" you should get announcements about webinars. If you become a channel member or a Patreon supporter ( www.patreon.com/statquest ), you'll get priority registration.
@@statquest thank you...also Guruji (teacher in hindi) if i want to access the jupyter notebook of the decision trees which you taught us here,how do I do that?
Josh, this is really great. Can you upload videos with some insights on your personal research and which methods did you use? And some examples of why you prefer to use one method instead of the other? I mean, not only because you get a better result in RUC/AUC but is there a "biological" reasoning for using a specific method?
My intro song for this channel: " It's like Josh has got his hands on python right, He teaches Ml and AI really Well and tight ---- STAT QUEST" btw thanks Brother for so much wonderful content for free.....
I actually think it can be great if you created more videos for other ML algorithms. After teaching us almost every aspect of machine learning algorithms as far as the mechanics and the related fundamentals are concerned, I feel it is high time to see those in action, and Python is, of course, the best way to go.
Thank you, this video helped me a lot! For anyone else following along in 2023, the way the confusion matrix is drawn here didn't work for me anymore. I replaced it with the following code: cm = confusion_matrix(y_test, clf_dt_pruned.predict(x_test), labels = clf_dt_pruned.classes_) disp = ConfusionMatrixDisplay(confusion_matrix = cm, display_labels=['Does not have HD', "Has HD"]) disp.plot() plt.show()
Great tutorial! One question, by looking at the features included in the final tree, does it mean that only those 4 features are considered for prediction, i.e., we don't need the rest so we could drop those columns for further usage?
I wish you were my uncle Josh or something. I could imagine how hard I would have had discussions with my parents to spend time with my TRIPLE cool uncle.
Unfortunately I never got around to that webinar. The closest thing I have is a video on how to impute data with a random forest. However, this feature is only implemented in R (not python): ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-sQ870aTKqiM.html
49:44 is each point on the plot, made of each alpha for different number of leaves? So it’s an average of all the different models possible at alpha say 0? We plotting avg squared residuals against alpha?
thank you for your great effort and simple explanation, i have only one question that is why did you split the data into X_train and y_trrain and then give it to cross_val_score , shouldn't coss validtion works on all X ?
This is legen..... wait for it ....dary!! 😎 This detailed coding explanation of Decision Tree is hard to find but Josh you are brilliant. Thank you for such a great video.
First, Thank you. You explain complicated things in very easiest way with visulization. But,You should have a better microphone with it. I think I am going to keep wathcing your videos.
Hi Josh. Loved this video. I have two questions: 1- Is there any way to save our final decision tree model to use it later in unseen data without having to train it all again? 2- Once you have decided on your final alpha: why not training your tree on a full-unsplit dataset. I know you will not be able to generate a confusion matrix, but wouldn't your final tree be better if it is trained with all the examples?
Yes and yes. You can write the decision tree to a file if you don't want to keep it in memory (or want to back it up). See: scikit-learn.org/stable/modules/model_persistence.html
Hi Josh, Request you to make more such ML videos in python which covers all ML concepts holistically. I am sure this course will then become more popular then any of the available ML courses. Pls pls pls....
Hey Josh. One thing that bugs me about this tutorial: when you do binary classification, you need to take into account class imbalance. Accuracy is the worst metric for this. Was that neglected for a reason?
Your videos are always very good. But today I’ll have to commend you on your fashion choice as well. Great-looking shirt! I hope you have had the opportunity to visit Brazil.
I have already commented but I watched the video again and I have to say I am even more impressed than before. truly fantastic tutorial, not too verbose but with every action clarified and commented in the code, beautifully presented (I have to work on my markdown; there are quite a few markdown formats you use that I cannot replicate...to study when I get the notebook). So all in all, one of the very top ML tuts I have ever watched (including paid for training courses). Can't wait for today's or tomorrows webinars. Can't join in real time as based in Europe, but will definitely pick it up here and get the accompanying study guides/code.
@@statquest At 19 minutes you say you have plans for a whole webinar on missing data! This is what I need. Where can I find it or is it still in production? :D
@@statquest Thanks for replying! I can see how easy it is to forget! You have so much content its unreal! V impressive! I just purchased your Notebook through the link - but it doesn't appear to arrived in my inbox. Can you advise? I am also strongly considering paying for your Patreon account. I currently pay for Datacamp - but your material is so much better!
@@oliveryoule11 Wow! Thanks for supporting me and I'm sorry you had trouble purchasing the notebook. If you contact me through my website, I can send it tor you directly: statquest.org/contact/
Great tutorial! But unfortunately, I´m struggling at min 48. How could it be, that I get a negative ccp_alpha of -2.168404344971009e-19? y values are 0 or 1 and all X values are positive? Have someone an idea what´s the reason for?
To see a full picture of decision tree at 41:00 try this code: from sklearn import tree clf_dtree = tree.DecisionTreeClassifier(random_state=42) clf_dtree = clf_dtree.fit(X_train, y_train) plt.figure(figsize=(44, 20)) tree.plot_tree(clf_dtree, fontsize=10, filled = True, rounded= True, class_names = ["No HD", "Yes HD"], feature_names = X_encoded.columns) plt.show() Click on miniaturized display.
when i want to plot the confusion matrix, the following error occurs at the import library stage : ImportError: cannot import name 'plot_confusion_matrix' from 'sklearn.metrics' (C:\Users\hp\Anaconda3\lib\site-packages\sklearn\metrics\__init__.py).. what do i do to rectify this?
Hi, Josh, recommend your videos to all my students and love watching and learning from them 👍. Can we still download this notebook?? Or do we need to buy it?? Regards from South Africa!
A doubt , when trying to calculate the best alpha ( scores = cross_val_score(clf_dt, X_train, y_train, cv=5)) the data used for the calculation it's train data, but I understood in the video How to Prune Regression Trees, Clearly Explained!!! that ALL the data were used to find the optimum alpha... , sorry probably it's clear but I can't find the answer
Unfortunately I am sometimes sloppy when I describe training and testing data. Sometimes I call something "testing data" when it is "validation data". So, in this case, the testing data is "validation" and "training data" represents ALL of the data that we will use to create the tree.
I've some questions what's was your's methodology that's you used, what's is interperet, did you used any Descriptive Statistical Analysis or Data Exploratory
Hi Josh, I see in Sklearn all the tree based ensembled algorithms has ccp_alpha as tuning parameter. Is it advisable to do so, rather is it feasible to do so for hundreds of trees (especially when trees are randomly created) or should we tune standard parameters like learning rate, no. of trees, loss function etc.
@@statquest Just wondering is it possible to tune this for random forest. Since we are creating 100's of trees with randomly selected features for every tree. As far as I understood, ccp is a tree specific parameter. Please give some insight of this in your next session. Hope so my query is relevant 🙂
@@SaurabhKumar-mr7lx With Random Forests, the goal for each tree is different than when we just want a single decision tree. For Random Forest trees, we actually do not want an optimal tree. We only want something that gets it correct a little more than 50% of the time. So in this case, we just limit the tree depth to 3 or 4 or something that, rather than optimize each tree with cost complexity pruning.
Can we hyper tune the parameters(not only alpha) with gridsearch ? (extra cross val) maybe we can optimize the tree to work even better? So we can have the cross val for alpha, and we can add a gridsearch for max_leafs ,gini,entropy,samples Thanks in advance
Can someone provide me with code link >? I am financially restrained and trying to move into Data Science. Cannot afford to pay. Thanks and Regards (Love from India)
Why did you determine alphas by using "training data" rather than "full dataset"?? As I remember what you talked in the video of pruning regression tree, you found alphas by full data.
I'm sorry that I was sloppy/imprecise with my terminology. "full data", I guess, refers to the full amount of data we are using to build the tree (and not some partition that we use for cross validation).