High Quality, High Performance Clustering with HDBSCAN | SciPy 2016 | Leland McInnes

Подписаться 67 тыс.

Просмотров 20 тыс.

50% 1

Data clustering is a powerful tool for data analysis. It can be particularly useful in exploratory data analysis for helping to summarize and give intuition about a dataset. Despite it's power clustering is used for this task far less frequently than it could be. A plethora of options for clustering algorithms exist, and we will provide a survey of some of the more popular options, discussing their strengths and weaknesses, particularly with regard to exploratory data analysis. Our focus, however, is on a relatively new algorithm that appears to be the best equipped to meet the needs of exploratory data analysis: HDBSCAN* has the strengths of density based algorithms, has a small robust set of parameters, and with suitable implementation can be made highly scalable to large datasets. We will discuss how the algorithm works, taking a few different perspectives, and explain the techniques used for a high performance implementation. Finally we'll discuss ways to extend the algorithm, drawing on ideas from topological data analysis.
More info on HDBSCAN here: github.com/lmcinnes/hdbscan.
See the complete SciPy 2016 Conference talk & tutorial playlist here: • SciPy 2016: Scientific...

Наука

Опубликовано:

16 июл 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 20

@DouglasDuhaime 4 года назад

Leland is truly a gentleman and a scholar

@kevon217 Месяц назад

Been on a Leland yt binge as of late, saw this comment, and truly agree.

@lelandmcinnes9501 8 лет назад

Thanks to the great people at conda-forge hdbscan is now available as conda packages (which is by far the easiest way to install it). conda install -c conda-forge hdbscan

@zwitter689 7 лет назад

Thanks, very nicely done. I installed hdbscan and am trying to mimic the examples you give but I can't find the data for the example on "Getting More Information About a Clustering". I like to follow the examples exactly so a copy of the actual data set you used would be great, can you help me with this?

@lelandmcinnes9501 7 лет назад

It's in the github repository with the notebooks: github.com/scikit-learn-contrib/hdbscan/blob/master/notebooks/clusterable_data.npy

@zwitter689 7 лет назад

Thank you and especially for the quick response.

@chengchu88 6 лет назад

Dr McInnes, thanks for the great video. I am using the HDBSCAN on a large dataset, and I know how to set 'memory' parameter to cache the hard computation. My question is, after I cache the computation during fitting, how do I change the min_cluster_size and min_sample_size and re-label the same data without going through the time-consuming fitting again? Could you provide a few sample python lines? thank you, Cheng

@elivazquez7582 6 лет назад

Great video! Great presentation - thanks for doing this!

@shyamsbox 6 лет назад

Very nice! We will try HDBSCAN.

@rajeshbalakrishnan2228 3 года назад

Wowwww!! One of best clustering discussion

@wexwexexort 2 года назад

great talk!

@karthik-ex4dm 5 лет назад

Great video...Since clustering cannot do better in high dimension space, the pair wise distance matrix should be fine if we are working in high dim spaces..right? but even computation of pairwise distance will also be computational expensive for very high dimension space right?. So the best choice must be finding best features using something like forward feature selection and then perform hdbscan. right?

@grygoriyzolotarov3228 5 лет назад

What is the font you use in your presentations (very appealing)?

@rednax3788 7 лет назад

HDBSCAN IS KING

@Marin-ct5my 3 года назад

HDBScan seems to be capable of producing clusters which share overlapping nodes, given that clustering for me is to identify shared points between clusters, what would I have to do to the algorithm to get those? I was surprised when nobody had a question about this and there was nothing said about it despite it being a possible feature of the algorithm.

@andrewdennis6976 6 лет назад

I am running your example code to just play around and keep getting an error. TypeError: descriptor 'get_metric' requires a 'hdbscan.dist_metrics.DistanceMetric' object but received a 'str' unfortunately there is not much documentation on this so its hard to find fixes. Any help?

@jennifermew8386 6 лет назад

how do you identify noise in HDBSCAN ? how do the algorithm tell the difference between outliers and noise?

@ashishkannad3021 6 лет назад

the ones which are not clustered in any cluster are our noises!

@enthought 8 лет назад

More info on HDBSCAN here: github.com/lmcinnes/hdbscan. See the complete SciPy 2016 Conference talk & tutorial playlist here: ru-vid.com/group/PLYx7XA2nY5Gf37zYZMw6OqGFRPjB1jCy6