very very useful and clear, many thanks. I have a question. I usually come across assesments such as "the correct order of steps for machine learning tasks is to first split the data into training and testing sets, and then perform any preprocessing steps like creating dummy variables on each set separately to avoid information leakage.". But in SVM classification and regression videos you performed first dummies and then split the data. Are there any specific reasons for this choice? Or is it about using OneHotEncoding or OrdinalEncode vs. get dummies?
Great point Kadir. Yes, the best practice is to fit_transfrom(X_train) and then transfrom(X_test) to avoid data leakage (for any kind of data preprocessing). In this example, because there is no outliers or leverage points, I decided to do it all at once. I encourage you to try it your way as well and compare the results.
Hi, Is not it like that when you convert your dates to cat variables you break the dependency in time? So, this reduces your prediction power? We need to treat the data in a way to model seasonality and trend...
Good observation Amirali. note that if there is a timeseries variable in the data (like date) and you want to add it to the model, then you do NOT transform it into categories (as you pointed out, it will break the time dependency and etc). However, when there is a variable like week day, month or some other time variables with limited number of outcomes (12 month, 7 days and etc) we do make them categories and this will NOT break the seasonality effect. Hope that helps! in general, for time series prediction, there are better models to use and we rarely use SVM for timeseries data.
So the random searches of C and gamma are both faster and more efficient? This is only with three number searches? If we increased that to 4 or 5 does that assist in finding the most efficient numbers or potentially waste more time?
good question Clayton. Usually we are able to make an educated guess by trying out the first couple of alternatives for C and gamma. if not, we keep guessing! there is no clear cut answer for that.