K-means Clustering From Scratch In Python [Machine Learning Tutorial]

Подписаться 59 тыс.

Просмотров 75 тыс.

50% 1

In this project, we'll build a k-means clustering algorithm from scratch. Clustering is an unsupervised machine learning technique that can find patterns in your data. K-means is one of the most popular forms of clustering.
We'll create our algorithm using python and pandas. We'll then compare it to the reference implementation from scikit-learn.
You can find the full project code here - github.com/dataquestio/projec... .
You can download the data here - www.kaggle.com/datasets/stefa... .
Project Steps
- Write out pseudocode for the algorithm
- Code the k-means algorithm
- Plot the clusters from the algorithm
- Compare performance to the scikit-learn algorithm
Chapters
00:00 Intro
00:37 k-means overview
02:51 Loading in and cleaning FIFA data
06:11 Scaling the data
10:31 Initialize random centroids
14:20 Finding cluster labels for each data point
19:29 Update centroid values
23:30 Plotting k-means iterations
28:24 Pulling the algorithm together
35:25 Comparing our implementation to scikit-learn
37:56 Conclusion and next steps
------------------------------
Join 1M+ Dataquest learners today!
Master data skills and change your life.
Sign up for free: bit.ly/3O8MDef

Опубликовано:

10 июл 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 83

@vikasparuchuri Год назад

Here's all the code for this video - github.com/dataquestio/project-walkthroughs/tree/master/kmeans . Hope you enjoy it!

@tejasvinnarayan2887 Год назад

Amazingly clear! Thank you so much, Dataquest!

@stevenlomon Год назад

From the bottom of my heart; thank you. This was so clear and easily understandable, fantastic video!

@animal40 Год назад

This was amazing. Brilliantly explained, demonstrated and presented clearly. Helped me so much with my current bootcamp task. Thank you.

@maleck25 2 месяца назад

Thank you, sir. This is how tutorials should be conducted: with in-depth explanations, step-by-step implementation, and the release of all code and datasheets to enable everyone to practice and advance their own personal projects. Congrats!

@jessemunson7091 Год назад

Awesome stuff, Vik. Thanks for sharing.

@hounddog1 Год назад

Such good and clearly delivered material. Thanks a lot!

@mo_l9993 Год назад

One of the best tutorials on the internet, thank you.

@amandamorrow73 Год назад

This THE best tutorial online. I am so grateful for this! Thank you

@krlwshu Год назад

Great video. Really helpful looking at implementing it manually. Thank you so much

@VaradKashmire 2 года назад

Excellent video !! Many thanks 🙏🏼

@sashagalanova818 5 месяцев назад

very helpful and clear explanations - thank you!

@obeynjanjeni4466 2 месяца назад

This is amazing, keep up a good job

@elu1 5 месяцев назад

This is a nice and powerful way to learn. Thanks for teaching.

@photoish3863 Год назад

I have never thought that we can visualize K means by using Dimension Reduction (PCA)!! Awesome Tutorial Sir

@shreshthasingh Год назад

Thanks a LOT for this tutorial!😀

@MarianneHMiettinen 6 месяцев назад

Outstanding! Thank you, man! This really helped me do my masters thesis. I really appreciate that you explained every small step, and used as much visuals as possible, and focused on us being able to learn! - In case others run into the same problem: With Scikit K-means, when using the fit(data) function, I got an "split" error message. (attributeerror: 'nonetype' object has no attribute 'split'). I checked my BLAS, and updated through conda all libraries, then shut everything down and opened again, and this resolved the problem, but it took a long time. (I asked chatgpt for help)

@vishwas5344 5 месяцев назад

Your explanation is absolutely clear. You have best knowledge. Keep posting new topics and encourage us ❤

@allaguimaouia6510 8 месяцев назад

it's very great job , the only one in youtube that explain every place of code 👍👍

@ahmetatasever8315 Год назад

Thank you very much for this clearly understood video.

@user-sz3zb1rq5z 8 месяцев назад

I can't thank you enough. Thank you for this content.

@oskeeg619 2 года назад

Thank you, thank you, thank you!!! Being able to perform and explain what runs under the hood is really important- I agree. Please keep these videos coming 🙌🏼❤️ The “From Scratch” series :)

@Dataquestio 2 года назад

That's a great idea :) I'm working on linear regression from scratch.

@user-do6zb9mt5q Год назад

Thanks alot that was a great help !

@rajeshmanjrekar3614 Год назад

great video, you are a great teacher

@ytustatistics 4 месяца назад

you might be a hero... thansk a lot for the contents...

@Adya_uk 2 года назад

Absolutely fantastic Would love a similar video on PAM clustering for mixed integer and categorical variables

@Dataquestio 2 года назад

Thanks for the suggestion :)

@adriancondie831 2 года назад

Great video!

@elvykamunyokomanunebo1441 2 года назад

Very insightful and step by step code explanation. Thank you for this excellent tutorial :)

@Dataquestio 2 года назад

Glad it was helpful! -Vik

@elvykamunyokomanunebo1441 2 года назад

@@Dataquestio Vik, how do I assign new data points to a cluster i.e. once I have run my K-means cluster and want to use it to assign a cluster to new data sets just like out of time datasets or testing/validation datasets. There doesn't seem to be anything online about this. Is it the case that I'd have to re-run the K-Means with the new data included? Thanks in advance Elvy

@TimHerrin 2 года назад

Terrific implementation! I also really liked the way you used PCA for iteritive visualization... Nicely done

@Dataquestio 2 года назад

Thanks a lot, Tim! -Vik

@jagajaga6908 11 месяцев назад

good tutorial thank you

@itsamankumar403 7 месяцев назад

TYSM :)

@dedisupardi2815 2 года назад

Cool 👍

@HelloIamLauraa Месяц назад

I loved ur video it is so well-explained!! I only used scikit-learn but now I understand better how it's works. But I have a question: why is it not good no use height and wight to use as feature?

@saemamiftah1669 Год назад

More videos like these please on other algos

@payalpatel2560 11 месяцев назад

It's a very well explained video. Just a quick question, how can we add random_state in the final model code?

@akosuakoranteng3327 Год назад

Hi, Thanks so much for the video!! Can you please advise on how one adds a legend to the cluster scatter plots? I've been trying but can't figure it out.

@a3i3m1an Год назад

Thanks for the video. It is just brilliant. One of the best ones on Clustering that I have seen for sure! I just had a question. I tried using this on data with 13 variables. It worked perfectly but when I scale the data using n. distrb or skscalar rather than using min-max, I get an error following the PCA transformation code saying there are Nans in the data variable when there clearly were not before. I cant put my finger on what is causing this. Would appreciate any insights on your part. Thanks

@ayushadhikari2357 Год назад

Hi, thank you so much for this clear tutorial. I need one another help from you. How do we get this cluster result exported to a CSV file?

@virendrakhanduri4897 2 года назад

Great Video , BTW why did u use Geometric means instead Arithmetic mean for finding the clusters. Please make a whole series on building models From Scratch.

@UkrainVsRussoReaction Год назад

Very insightful explanation of codes. By the way how can I plot the Elbow plot using the SSE Vs K values at every k value iteratively. this will help me be able to optimise the K value using this codes... Looking foreword to hearing from you

@user-un6em6bd6h 4 месяца назад

can we follow up based on the identified clusters, by using them to regress for another variable, e.g. with a logistic regression?

@soothingszelam2607 4 месяца назад

thanks teacher, may you introduce how to calculate SSE for k means clustering solution when you choose not to use k means directly from sklearn package

@anirudhpurohit2251 6 месяцев назад

can we also use players pogition as one of the feature if yes then how (cauz that isn't numeric)

@user-un6em6bd6h 4 месяца назад

do we have to get rid of outliers beforehand?

@Anae2003 3 месяца назад

How do you know which 5 features to pick at the beginning?

@dataprofessor_ Год назад

Can you make a video implementing Local Outlier Factor (LOF) with Pandas and NumPy in Python for identifying outliers?

@user-un6em6bd6h 4 месяца назад

what is the maximum amount of variables recommendable for a clustering analysis?

@causticmonster 8 месяцев назад

How would you include Ordinal features ?

@prgyagupta8079 11 месяцев назад

if we have IP addresses in data should we still scale the data ? i had a dataset where ip add and fraud transactions are given, i converted ip add to numerical data

@sadeepmihiranga6958 Год назад

Your explanation is grate. I found out that the "k" parameter of method "new_centroids" has no effect for the application. Correct me if I'm wrong.

@NadeemAkhtar-gu4up Год назад

Which platform you are using for coding??

@goodnessawe4262 Год назад

Thanks for this, I really don't get how I can possibly use it for fraud detection

@jakubharas9477 7 месяцев назад

Could you explain the meaning of the x- and y-axis?

@sukshithshetty8349 Год назад

I didn’t understand why we took geometric mean instead of arithmetic mean??? Can you explain tht pls ????

@2919091986 3 месяца назад

I am getting an error when calculating centroids - 'float' object has no attribute 'sqrt'..... Please help

@rodneymawero9063 2 года назад

Keep sending the emails, thanks for the vids

@sukshithshetty8349 Год назад

Wht does groupby() return. ?? How can I see wht groupby() has returned??? Can you pls share the code too what data.groupby(labels) do ???

@AbrarMuhtasim Год назад

make a video on ''customer segmentation and clustering in retail using machine learning'' using real retail dataset

@ZigBehaviour Год назад

pls unpack what is going on in centroid = data.apply(lambda x: float (x.sample())) without the float cast the line returns a DataFrame with NaN values in none sampled/selected columns. There appears to be some VooDoo magic going on here, driven by the float cast!

@swayamjoshi7667 Год назад

can someone help with the issue at 29:48 when we use old_centroids=centroids in my code this error comes 'DataFrame' object has no attribute 'equal'

@engineervol 3 месяца назад

it should be .equals with an s

@63_mayukhdebnath22 Год назад

Sir how to find out the individual elements present in each cluster? For example, I'm working on a dataset of genes. How will i get the names of the individual genes that are present in each cluster?

@subhasishtripathy6933 Год назад

I am finding the same right now ? Are you able to get anything . If yes then please help me too😊

@shreyanshkhandelwal6499 Год назад

Please can someone tell me how to apply arithmetic mean instead of geometric mean in lambda function of getting new centroids. I am dealing with negative datasets and applying geometric mean is of no use to me. will it be like this : data.groupby(labels).apply(lambda x: np.mean(x,axis=0))

@animal40 Год назад

Thank you, I required arithmetic mean too and your code worked for me.

@dataprofessor_ Год назад

Why you did not apply fit_transform to centroids_2d variable as well?

@Dataquestio Год назад

Fit transform will both compute the fit and transform the data. In this case, we already computed the fit on the data, and we want to just apply the same fit to the centroids, so that they're all on the same scale and can be visualized. -Vik

@itsmitasha 5 месяцев назад

At 10:08, how did you know row 0 belongs to lionel messi?

@bgizzanm Год назад

Amazing!! But, how to implement the scatter without PCA?

@animal40 Год назад

Did you figure out? I'd like to know too.

@akosuakoranteng3327 Год назад

@@animal40 Just leave out the PCA- still transform the centroid T though and remember to include iloc here's my code: def plot_clusters(data, labels, centroids, iteration): centroid_T = centroids.T plt.title(f'Iteration {iteration}') plt.scatter(x = data.iloc[:,0], y= data.iloc[:,1], c =labels) plt.scatter(x = centroid_T.iloc[:,0],y = centroid_T.iloc[:,1]) plt.show()

@animal40 Год назад

@@akosuakoranteng3327 thanks very much for this. Tried a few things today but couldn't quite get it working. Will try again tomorrow with this. Appreciate it, cheers.