Тёмный

310 - Understanding sub word tokenization used for NLP 

DigitalSreeni
Подписаться 107 тыс.
Просмотров 4,9 тыс.
50% 1

310 - Understanding sub word tokenization used for NLP
Code generated in the video can be downloaded from here:
github.com/bns...
All other code:
github.com/bns...
Subword tokenization algorithm's philosophy is…​
- frequently used words should not be split into smaller sub-words ​
- rare words should be divided into meaningful sub-words. ​
Example: DigitalSreeni is not a real word and a rare word (unless I get super famous). It may be divided as:​
Digital (common word)​
Sr​
E​
E​
Ni (Common sub-word - Nice, Nickel, Nimble, etc.)​
Advantages of sub-word tokenization:​
Not very large vocabulary sizes while maintaining the ability to provide context-independent representations.​
Handle rare and out-of-vocabulary words by breaking them into known sub-word units.​
Byte Pair Encoding (BPE) reference:
​arxiv.org/abs/...
​BPE Starts with pre-tokenizer that splits the training data into words. Pre-tokenization can be just space tokenization where words separated by space are represented by individual tokens (e.g., GPT-2). ​
Using pre-tokenized tokens, it learns merge rules to form a new word (token) from two tokens of the base vocabulary. ​
This process is iterated until the vocabulary has attained the desired vocabulary size, set by the user (hyperparameter). ​
Both ByteLevelBPETokenizer and SentencePieceBPETokenizer are tokenizers used for subword tokenization, but they use different algorithms to learn the vocabulary and perform tokenization.
ByteLevelBPETokenizer is a tokenizer from the Hugging Face tokenizers library that learns byte-level BPE (Byte Pair Encoding) subwords. It starts by splitting each input text into bytes, and then learns a vocabulary of byte-level subwords.
using the BPE algorithm. This tokenizer is particularly useful for languages
with non-Latin scripts, where a character-level tokenizer may not work well.
On the other hand, SentencePieceBPETokenizer is a tokenizer from the SentencePiece library that learns subwords using a unigram language model. It first tokenizes the input text into sentences, and then trains a unigram language model on the resulting sentence corpus to learn a vocabulary of subwords. This tokenizer can handle a wide range of languages and text types, and can learn both character-level
and word-level subwords.
In terms of usage, both tokenizers are initialized and trained in a similar way.

Опубликовано:

 

21 сен 2024

Поделиться:

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист
Посмотреть позже
Комментарии : 16   
@shaktisethi3863
@shaktisethi3863 Год назад
Hi Sreeni, I am keenly waiting for ur NLP series.
@limon_halder
@limon_halder Год назад
Plz make a video on nlp based food menu recommendations systems
@tilkesh
@tilkesh Год назад
Thx
@GeraintWang
@GeraintWang Год назад
Hi Sreeni, will you prepare a video about deep belief network?
@LoneRanger.801
@LoneRanger.801 Год назад
Sreeni, do you hve a video on removing watermark from a picture?
@DigitalSreeni
@DigitalSreeni Год назад
No. If the watermark has a different color you can try converting the image to HSV and playing with colors.
@LoneRanger.801
@LoneRanger.801 Год назад
@@DigitalSreeni I need to remove it from a video. The idea was to write a code to first, separate out the individual frames. Then using a trained model (on a specific watermark) and run it through all the images and then stitch the frames back together. I just needed some guidance on how to actually train such model. ☺️
@DigitalSreeni
@DigitalSreeni Год назад
It is not ethical to remove watermarks from videos without the permission of the original owner, as they are often used to protect intellectual property. I am sure you are mindful of it but wanted to remind. You can use image to image translation approaches to train a model on images (or videos) with watermark and without. This trained model can then be used to remove watermark from future data. I did a couple of videos on pix2pix (video 250 and 251), may be you will find them useful.
@pietromonti399
@pietromonti399 Год назад
@@DigitalSreeni I saw you have deleted my comment about this topic. I would like to apologize if i suggested a method for an unethical purpose. I just answered straight away without actively thinking about the possible consequences. Once again, i am very sorry. I also take this chance to thank you for your awesome content!
@LoneRanger.801
@LoneRanger.801 Год назад
@@DigitalSreeni I completely agree. Thing is, I need to remove the name of my ex girlfriend who happened to be in tech and who had watermarked her name in all my pictures and videos. 🤷🏻‍♂️ Thanks for your suggestions. Will check out those videos.
@limon_halder
@limon_halder Год назад
1st viewer
@thechoosen4240
@thechoosen4240 11 месяцев назад
Good job bro, JESUS IS COMING BACK VERY SOON; WATCH AND PREPARE
Далее
311 - Fine tuning GPT2 using custom documents​
14:51
Eco-hero strikes again! ♻️ DIY king 💪🏻
00:48
323 - How to train a chatbot on your own documents?
43:13
Subword Tokenization: Byte Pair Encoding
19:30
Просмотров 18 тыс.
167 - Text prediction using LSTM (English text)
29:09
309 - Training your own Chatbot using GPT​
22:48
Просмотров 12 тыс.
BERT Research - Ep. 2 - WordPiece Embeddings
28:04
Просмотров 71 тыс.
101 Weird But Useful Minecraft Hacks
48:36
Просмотров 4,5 млн
Nobody Cares About Your Coding Projects
11:02
Просмотров 107 тыс.
Eco-hero strikes again! ♻️ DIY king 💪🏻
00:48