TLDR - from llama 2 to llama3 they switched from sentencepiece to tiktoken - vocab size 32k -> 128k - ~15% fewer tokens for english, ~50% fewer for "some other languages"
could someone from the meta LLaMa 3 team please explain how to train my very own tiktoken tokenizer like you guys did for llama 3. there is no opensource steps to recreate this
Classic example of a provably smart guy not being able to express his thoughts... 5 minutes of pain is all I managed to force myself to watch. A shame.