@@josejdiazcaballero1646 Thanks for the contribution! I somehow missed the prediction portion. (Which is incredibly important). But, the code you provided should do the trick.
Good visualization of graphical plots. I'd have liked more if you'd have had shown selecting the optimum k-value, plotting of k-values with respective accuracy rates & confusionMatrix.
I'm glad you liked it! There are definitely quite a few ways on choosing k-values. I just stuck with the rule of thumb. Visualizing the results of K-values would have been a great addition to the video. And, why didn't I think of the confusion matrix! haha. I'll keep your suggestions in mind for future content! :)
Nice video! But as far as I understand, isn't the statement at 01:45 wrong? The smaller the K value, the greater the overfit and not the greater the underfit...?
Great catch! You are absolutely correct. I mixed up k value meanings. The smaller k value leads to greater OVERFIT. This is true because you are finding groupings for each pair. And, this will tremendously skew your predictions.
@@SpencerPaoHere hello Spencer, I mean I need to predict in which category my species falls. For example I have information that sepal length is 6, sepal width is 4, petal length is 1.5 and petal width is 0.4. I just want to know with these variables where my new data point will be categorised with the knn model. For example logistic regression can do this, but my data is non-linear… many thanks, also for the video, Maarten
Oooh. Yes. You can use the predict() function. I.e predict(model_name, new_observations) Where new_observations have the same number of features as input.
unfortunately, last plot is obstructed by the preview of your next videos. Otherwise, nice job! I would be curious to how to deal with non numeric data as well!
Plug in the completley new dataset into your trained algorithm to obtain the predicted classifications. (Your new dataset must have the same features and feature types)
Hello! There are certain features that can help with imputations and one of them can be knn! I did do a video on imputations which might be helpful -> ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-MpnxwNXGV-E.html
Hello Spencer, do you know how I can select the data which were easiest to classify and hardest to classify (highest and lowest probabilities of the correct class).
@@gabrielhimelfarb1757 I had same (i think) problem. error was a mismatch between length(data) and length of the geom_text(aes(label = test_labels). I had assigned the wrong value to test_labels. I had row_labels[train_ind] but should be row_labels[-train_ind]. Plotting works then.
@@أسيدمحمد-ه2ز In my script, row labels are data[,5] where data is the iris dataset. data = iris. So, first you would have to create the row_labels = iris[,5]. Then, you can call row_labels[-train_ind]. The reason why you were getting that error was because your "row_labels" was not initialized
@@euphoria1725 Yes! You can use categorical variables in knn models, but you will probably have to one hot encode the categories such that the data types are numerical. Then, you'd have to find an appropriate distance function that can handle binary outcomes.
@@SpencerPaoHere thanks for the reply! what if the categorical variable have 3 or more values and is not ordinal? like "english", "Chinese", "Japanese", can I set them as 1, 2, 3?
@@euphoria1725 One hot encoding will take care of multi-categorial variables in 1 column. It just creates n columns with True/False values where n is the number of categories.
@@ilhembenhenda3416 Hmm. If I am following correctly, you are having issues streaming data using Mapreduce to a knn algorithm? i.e distributed compute from some database?