310 - Understanding sub word tokenization used for NLP

Подписаться 107 тыс.

Просмотров 4,9 тыс.

50% 1

310 - Understanding sub word tokenization used for NLP
Code generated in the video can be downloaded from here:
github.com/bns...
All other code:
github.com/bns...
Subword tokenization algorithm's philosophy is…
- frequently used words should not be split into smaller sub-words
- rare words should be divided into meaningful sub-words.
Example: DigitalSreeni is not a real word and a rare word (unless I get super famous). It may be divided as:
Digital (common word)
Sr
E
E
Ni (Common sub-word - Nice, Nickel, Nimble, etc.)
Advantages of sub-word tokenization:
Not very large vocabulary sizes while maintaining the ability to provide context-independent representations.
Handle rare and out-of-vocabulary words by breaking them into known sub-word units.
Byte Pair Encoding (BPE) reference:
arxiv.org/abs/...
BPE Starts with pre-tokenizer that splits the training data into words. Pre-tokenization can be just space tokenization where words separated by space are represented by individual tokens (e.g., GPT-2).
Using pre-tokenized tokens, it learns merge rules to form a new word (token) from two tokens of the base vocabulary.
This process is iterated until the vocabulary has attained the desired vocabulary size, set by the user (hyperparameter).
Both ByteLevelBPETokenizer and SentencePieceBPETokenizer are tokenizers used for subword tokenization, but they use different algorithms to learn the vocabulary and perform tokenization.
ByteLevelBPETokenizer is a tokenizer from the Hugging Face tokenizers library that learns byte-level BPE (Byte Pair Encoding) subwords. It starts by splitting each input text into bytes, and then learns a vocabulary of byte-level subwords.
using the BPE algorithm. This tokenizer is particularly useful for languages
with non-Latin scripts, where a character-level tokenizer may not work well.
On the other hand, SentencePieceBPETokenizer is a tokenizer from the SentencePiece library that learns subwords using a unigram language model. It first tokenizes the input text into sentences, and then trains a unigram language model on the resulting sentence corpus to learn a vocabulary of subwords. This tokenizer can handle a wide range of languages and text types, and can learn both character-level
and word-level subwords.
In terms of usage, both tokenizers are initialized and trained in a similar way.

Опубликовано:

21 сен 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 16

@shaktisethi3863 Год назад

Hi Sreeni, I am keenly waiting for ur NLP series.

@limon_halder Год назад

Plz make a video on nlp based food menu recommendations systems

@tilkesh Год назад

Thx

@GeraintWang Год назад

Hi Sreeni, will you prepare a video about deep belief network?

@LoneRanger.801 Год назад

Sreeni, do you hve a video on removing watermark from a picture?

@DigitalSreeni Год назад

No. If the watermark has a different color you can try converting the image to HSV and playing with colors.

@LoneRanger.801 Год назад

@@DigitalSreeni I need to remove it from a video. The idea was to write a code to first, separate out the individual frames. Then using a trained model (on a specific watermark) and run it through all the images and then stitch the frames back together. I just needed some guidance on how to actually train such model. ☺️

@DigitalSreeni Год назад

It is not ethical to remove watermarks from videos without the permission of the original owner, as they are often used to protect intellectual property. I am sure you are mindful of it but wanted to remind. You can use image to image translation approaches to train a model on images (or videos) with watermark and without. This trained model can then be used to remove watermark from future data. I did a couple of videos on pix2pix (video 250 and 251), may be you will find them useful.

@pietromonti399 Год назад

@@DigitalSreeni I saw you have deleted my comment about this topic. I would like to apologize if i suggested a method for an unethical purpose. I just answered straight away without actively thinking about the possible consequences. Once again, i am very sorry. I also take this chance to thank you for your awesome content!

@LoneRanger.801 Год назад

@@DigitalSreeni I completely agree. Thing is, I need to remove the name of my ex girlfriend who happened to be in tech and who had watermarked her name in all my pictures and videos. 🤷🏻‍♂️ Thanks for your suggestions. Will check out those videos.

@limon_halder Год назад

1st viewer