Good video, i have an question though, if i understand correctly the tokenizer learns to tokenize each patch of image to an integer. since there is a very large possible combinations of pixel in each token, won't the size of the vocabulary of the tokenizer become extremely large?