Тёмный

How to use BERTopic - Machine Learning Assisted Topic Modeling in Python 

Python Tutorials for Digital Humanities
Подписаться 28 тыс.
Просмотров 36 тыс.
50% 1

Join this channel to get access to perks:
/ @python-programming
If you enjoy this video, please subscribe.
✅Be my Patron: / wjbmattingly
✅PayPal: www.paypal.com...
Article Referenced: www.ncbi.nlm.n...
GitHub Repo: github.com/wjb...
GitHub BERTopic: github.com/Maa...
If there's a specific video you would like to see or a tutorial series, let me know in the comments and I will try and make it.
If you liked this video, check out www.PythonHumanities.com, where I have Coding Exercises, Lessons, on-site Python shells where you can experiment with code, and a text version of the material discussed here.
You can follow me at:
/ wjb_mattingly

Опубликовано:

 

5 окт 2024

Поделиться:

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист
Посмотреть позже
Комментарии : 50   
@muhammadarhamriaz629
@muhammadarhamriaz629 2 года назад
I am brand new to DAW and soft soft - these tutorials are excellent an very helpful to get soone like up and running. Appreciate
@rvian4
@rvian4 Год назад
wow the features this bert approach provides really improves explanation of topic models
@dubey_ji
@dubey_ji Год назад
i found your channel today and man I must say thank you very good content
@python-programming
@python-programming Год назад
Thanks so much!! =)
@wasgeht2409
@wasgeht2409 2 года назад
two questions :) 1) Could i write a sentence and they give me after the training the probability for the topic based on the training ? 2) Could i use for example customer requests for training ? in this case you are using a unstructured data. I hope u understand my questions :D
@bentobenack2
@bentobenack2 2 года назад
This is incredible, I subscribed to your channel today while looking for topic modeling content, I found very good content. However, I would also like to find something from BERTopic, and a few minutes later after subscribed, I receive a notification from RU-vid of your channel, and I said, it can't be true! Thank a lot!
@python-programming
@python-programming 2 года назад
Haha! That is so perfect! Hope this video helps!!
@bentobenack2
@bentobenack2 2 года назад
@@python-programming Definitely helped!
@sarasharick5209
@sarasharick5209 2 года назад
Great video. I experimented with Top2Vec after that video, so looking forward to experimenting with BERTopic too.
@danieleriahe-him4693
@danieleriahe-him4693 Год назад
Thanks so much for the high quality content you published so far, your playlist are a gold mine for beginners and enthusiast into the AI field. Have you ever considered making a video to explain principles of creating an efficient dataset for text summarization, or other specific tasks? Many thanks in advance for your consideration!
@TheArnold2002
@TheArnold2002 Год назад
Best video on topic modeling I've seen so far. Can I get all documents related to a topic, instead of just the top 3?
@python-programming
@python-programming Год назад
Thanks! Indeed you can. BerTopic has changed a bit since I made this video, so I will have to check the docs but I am certain you can.
@mmishrafaculty
@mmishrafaculty Год назад
Awesome. That was so informative. And explained so clearly. Thank you so much.
@python-programming
@python-programming Год назад
Thanks so much! I am planning a new video on BERTopic soon to cover its new features.
@hankzhong
@hankzhong Год назад
Great intro, but the default has too many topics to be useful for human understanding, is there a way to reduce the number of topics naturally? Also can we measure perplexity and coherence of these topics like LDA? Thanks
@xevenau
@xevenau 9 месяцев назад
Do you happen to have a tutorial that explains how to turn articles into a dataset for topic modeling. Thanks!
@DoreenGyamfi-i7k
@DoreenGyamfi-i7k Год назад
this was so informative, thank you.
@python-programming
@python-programming Год назад
I am so glad it was helpful!
@suhasp2385
@suhasp2385 2 года назад
Just simply put the code, it works! thanks!
@somewhereovertherainbow9550
@somewhereovertherainbow9550 2 месяца назад
Thanks!!! very much helpful!
@BillVoisine
@BillVoisine 3 месяца назад
Thank you!!
@raziehfadaei4801
@raziehfadaei4801 7 месяцев назад
Thank you for your good video. Does BERTopic need any preprocesing like lemmatization or tokenization like LDA?
@hosseinahmadi1855
@hosseinahmadi1855 8 месяцев назад
Greeeeeeeeat!. Thanks. Another useful video
@KR-good
@KR-good 7 месяцев назад
Great presentation.
@andreasheiner3426
@andreasheiner3426 Год назад
Thanks, great tutorial. A question, what's your experience with quality of the model and sentence? Short sentences don't really work (to little semantics), long won't work either (too "much" semantics). Thoughts?
@python-programming
@python-programming Год назад
Thanks! And great question. If you are looking for an off the shelf solution try top2vec, but I think you may run into similar issues. What language are your docs? Also, how varied are they in size? A more custom solution may be necessary.
@andreasheiner3426
@andreasheiner3426 Год назад
@@python-programming I've standard English web sites, from product reviews to travel reports. Generally a page contains some 10 paragraphs. Content on a page is highly correlated (you'd expect), so the page content is defined by a few paragraphs. The topic of a paragraph is mostly in a single sentence; the rest is "glue". This turns out to be a reasonable assumption (eye balling). BERTopic supports these observations, especially if you remove paragraphs with the topic probability for the most dominant topic less than some cutoff (say 0.6; the reason that, worst case another topic is present for at most 0.4). From experience you're left with 3% unallocated documents; each allocated document has at most 3 topics. This is all nice, assuming BERTopic gives good results for both long and short paragraphs with the same hyper parameters. If my assumption is incorrect I've a problem :( So, thoughts?
@johnny_silverhand
@johnny_silverhand 2 года назад
Fantastic explanation
@yashjain2841
@yashjain2841 13 дней назад
How to run it on dataset with more than 12k rows?It is showing some "correct_alternative_cosine" error. Please help
@kennethgomes4727
@kennethgomes4727 Год назад
Please can you explain why didnt you use UMAP, HDBSCAN and C-TF-IDF for this?
@python-programming
@python-programming Год назад
Thanks for the question! You absolutely can. I have a whole other tutorial that walks through each of those steps. I think BERTopic, LeetTopic (my library), and Top2Vec provide a simpler solution for those who may not be familiar with a custom UMAP, HDBScan workflow. I try to make tutorials for users at all levels and I think these other libraries address the needs of those newer to Python/ML.
@sohinisarkar1935
@sohinisarkar1935 7 дней назад
Is it possible define number of topics here ?
@mrtn5882
@mrtn5882 2 года назад
Nice tutorial, thank you! If I follow the video correctly, about 25% of your documents are marked as outliers. Is that normal? Can you maybe talk about this a bit in a further video?
@python-programming
@python-programming 2 года назад
Yea that is a bit normal woth BERTopic. I plan to do another video that compares dofferent topic modeling approaches and that will be a key feature
@mrtn5882
@mrtn5882 2 года назад
@@python-programming Great, I’m looking forward to that video! 😊
@luiztauffer8513
@luiztauffer8513 Год назад
Thanks for the amazing content! Do you know if BERTopic could be used to train a model to identify similarity to custom, pre-defined topic?
@python-programming
@python-programming Год назад
Thanks! I would not use BERTopic, rather soaCy for text classification. You could use BERTopic to gather data for easy labeling.
@luiztauffer8513
@luiztauffer8513 Год назад
@@python-programming thanks a lot, I actually went on to search for it and found another one of your videos explaining EXACTLY what I wanted! For reference it's this one: "The EASIEST! way to do Text Classification with spaCy and Classy Classification" thanks again!
@python-programming
@python-programming Год назад
@@luiztauffer8513 haha! Perfect! No problem!
@emekaobiefuna4509
@emekaobiefuna4509 Год назад
Great info!
@wasgeht2409
@wasgeht2409 2 года назад
Wow
@johnny_silverhand
@johnny_silverhand 2 года назад
Best topic model to use for modelling 3000 documents each having 3 pages of text ?
@adambenari3944
@adambenari3944 2 года назад
BERTopic or Top2Vec will both work, but you'll need to reduce your corpus to shorter text. You can use an introduction or conclusion as your text, or perform some summarization before you start modelling
@tantzer6113
@tantzer6113 2 года назад
Does this work for Arabic documents?
@python-programming
@python-programming 2 года назад
As long as there is a BERT model for Arabic, yes. I know there is an NEH funded project for this but I am not sure if it is available yet. There is a lot of research in Arabic NLP so I would be surprised if another does not already exost. I do not have Arabic, though, so I cannot validate the results.
@tantzer6113
@tantzer6113 2 года назад
@@python-programming Thank you for answering.
@LearnProfessional1
@LearnProfessional1 2 года назад
is tNice tutorials ASMR?
@flosrv3194
@flosrv3194 4 месяца назад
no way to install this shit, get error popping from everywhere and when i resolve them, thre others appear, unusable crap
@olucasharp
@olucasharp Год назад
Comment to say thanks and support this absolutely awesome channel 🪩 Huge thanks and this is sooo clearly explained, good luck ⚡
@python-programming
@python-programming Год назад
Thank you so much for your support and this wonderful comment!
Далее
НЕ БУДИТЕ КОТЯТ#cat
00:21
Просмотров 1 млн
НЮША УСПОКОИЛА КОТЯТ#cat
00:43
Просмотров 781 тыс.
Women’s Free Kicks + Men’s 😳🚀
00:20
Просмотров 6 млн
10 Crazy Python Operators That I Rarely Use
11:37
Просмотров 28 тыс.
BERTopic Explained
45:14
Просмотров 25 тыс.
An Introduction to Topic Modeling
26:39
Просмотров 66 тыс.
What is RAG? (Retrieval Augmented Generation)
11:37
Просмотров 149 тыс.
How I’d learn ML in 2024 (if I could start over)
7:05
LLM-powered Topic Modeling
1:25:56
Просмотров 3,9 тыс.
НЕ БУДИТЕ КОТЯТ#cat
00:21
Просмотров 1 млн