Predict NYT bestsellers with wordpiece tokenization

Подписаться 15 тыс.

Просмотров 2,7 тыс.

50% 1

This screencast walks through how to predict which #TidyTuesday NYT bestsellers will be on the list for a long time vs. a short time, based on author names. We walk through how to use wordpiece tokenization for these names, and how to deploy your model as a REST API. Check out the code on my blog: juliasilge.com...
NOTE: I misspoke about how the wordpiece tokenization is implemented; it uses the pre-trained vocabulary from BERT.

Опубликовано:

11 окт 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 10

@r.k.8322 2 года назад

Hello Julia, I'm glad I've found your channel. Your book on text mining helped me immensely with the realization of my bachelor thesis and was a great introduction to new methods for someone like me, who studied in a field, where the use of digital tools always is one or two decades behind. This way of gaining insight on things is a lot of fun. I'm almost sad, that I didn't know about this earlier. Thank you, keep it up and have a great day.

@terrencerussell1999 2 года назад

Missed you Julia you really do great stuff and always have insightful content.

@conlele350 2 года назад

Thanks Julia, long time no see, pls keep up good work, I am still learning tidymodels through out your screencast everyday and hopefully can manage it at some point. cheers

@PA_hunter 2 года назад

Thanks Julia! Would be interested if you can improve prediction accuracy using a deep learning model via tidymodels

@jdonland 2 года назад

I was a little confused about word pieces, so here's the explanation from the package documentation: 1) Put spaces around punctuation. 2) For each resulting word, if the word is found in the WordPiece vocabulary, keep it as-is. If not, starting from the beginning, pull off the biggest piece that is in the vocabulary, and prefix "##" to the remaining piece. Repeat until the entire word is represented by pieces from the vocabulary, if possible. 3) If the word can't be represented by vocabulary pieces, or if it exceeds a certain length, replace it with a specified "unknown" token. It seems as though this will be effective for finding name suffixes like "-son": if "John" is in the vocabulary but "Johnson" is not, then the latter will be tokenized as "John" and "##son". I suppose this won't go so well with prefixes like "O'-", "Mac", "de", "van", etc.

@pragneshmaisuria4656 2 года назад

You are awesome!! Thanks!!

@glhrm506 2 года назад

Hello Julia, is it necessary to set the seed more than once for reproducibility? If I set it only once won't it be reused?

@JuliaSilge 2 года назад

Some modeling algorithms use the RNG so the next random number in the stream will be different after you use such a model. If you run the script from top to bottom (like when knitting an .Rmd) you will get the same thing every time if you only set the seed once at the top. However, I often find myself running bits of a script multiple times during interactive use, and then you want to make sure you are setting the seed every time. Putting the seed into the script before doing anything that involves the RNG is a bit of a safeguard I use. I have been bitten too many times!

@glhrm506 2 года назад

@@JuliaSilge Thanks!

@PA_hunter 2 года назад

If anyone is looking for a new pen name, you might consider Janet F. Steel