310 - Understanding sub word tokenization used for NLP
Code generated in the video can be downloaded from here:
github.com/bns...
All other code:
github.com/bns...
Subword tokenization algorithm's philosophy is…
- frequently used words should not be split into smaller sub-words
- rare words should be divided into meaningful sub-words.
Example: DigitalSreeni is not a real word and a rare word (unless I get super famous). It may be divided as:
Digital (common word)
Sr
E
E
Ni (Common sub-word - Nice, Nickel, Nimble, etc.)
Advantages of sub-word tokenization:
Not very large vocabulary sizes while maintaining the ability to provide context-independent representations.
Handle rare and out-of-vocabulary words by breaking them into known sub-word units.
Byte Pair Encoding (BPE) reference:
arxiv.org/abs/...
BPE Starts with pre-tokenizer that splits the training data into words. Pre-tokenization can be just space tokenization where words separated by space are represented by individual tokens (e.g., GPT-2).
Using pre-tokenized tokens, it learns merge rules to form a new word (token) from two tokens of the base vocabulary.
This process is iterated until the vocabulary has attained the desired vocabulary size, set by the user (hyperparameter).
Both ByteLevelBPETokenizer and SentencePieceBPETokenizer are tokenizers used for subword tokenization, but they use different algorithms to learn the vocabulary and perform tokenization.
ByteLevelBPETokenizer is a tokenizer from the Hugging Face tokenizers library that learns byte-level BPE (Byte Pair Encoding) subwords. It starts by splitting each input text into bytes, and then learns a vocabulary of byte-level subwords.
using the BPE algorithm. This tokenizer is particularly useful for languages
with non-Latin scripts, where a character-level tokenizer may not work well.
On the other hand, SentencePieceBPETokenizer is a tokenizer from the SentencePiece library that learns subwords using a unigram language model. It first tokenizes the input text into sentences, and then trains a unigram language model on the resulting sentence corpus to learn a vocabulary of subwords. This tokenizer can handle a wide range of languages and text types, and can learn both character-level
and word-level subwords.
In terms of usage, both tokenizers are initialized and trained in a similar way.
21 сен 2024