Coding Llama 3 from scratch in PyTorch - Part 2

Подписаться 968

Просмотров 3 тыс.

50% 1

In this video series, you will learn how to train and fine-tune Llama 3 model from scratch.
The goal is to code LLaMA 3 from scratch in PyTorch to create models with sizes 3B, 6B, 22B, 45B, 35B and 45BM params. In this second video, you'll learn about continous pretraining, LLM benchmarks and you'll also get to see the results.
🤖 Models:
Llama-3-6B-v0.1: huggingface.co...
Llama-3-6B-v0.1 adapters: huggingface.co...
Llama-3-6B-v0 (Untrained): huggingface.co...
📚Papers:
LoRA: Low-Rank Adaptation of Large Language Models: arxiv.org/abs/...
QLoRA: Efficient Finetuning of Quantized LLMs
: arxiv.org/abs/...
💻 To follow along you can use this colab notebook:
github.com/Bla...
🎥 Coding Llama 3 from scratch video series
Part 1: • Coding Llama 3 from sc...

Опубликовано:

15 окт 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 17

@yoanijosias 4 месяца назад

Very good, can’t wait to see updates to it.

@princecanuma 4 месяца назад

You and me both!

@kishoretvk 4 месяца назад

Thanks for committing to the open source and educating people on cutting edge knowledge.

@princecanuma 4 месяца назад

Most welcome, it’s my pleasure!

@liyanan2004 3 месяца назад

Could you please make a tutorial on vlm, and how it works. Like this series of videos, from scratch.

@princecanuma 3 месяца назад

That’s a great idea! 💡 Will do 👌🏽

@spkgyk 4 месяца назад

Why do you use 32 bit paged optimzier when the model is being fine-tuned with QLoRA? Surely QLoRA stores the weights in 8bit double quantized form, so using a 32 bit optimizer makes no difference, and the weight updates need to be converted back to 8 bit anyway? Please help me understand this

@princecanuma 4 месяца назад

Additionally, 8bit states are dequantized to 32bit for the update anyways. huggingface.co/docs/bitsandbytes/main/en/explanations/optimizers

@spkgyk 4 месяца назад

@@princecanuma Thank you for the quick response. With 8-bit optimizers, large models can be finetuned with 75% less GPU memory without losing any accuracy compared to training with standard 32-bit optimizers. The reduced memory requirements means 8-bit optimizers are 4x faster than a standard optimizer, and no hyperparameter tuning is required. Surely this means that using 32 bit just wastes compute power? Please correct me if I'm wrong, I'm really trying to understand the benefits. Is it because training with 32 bit means that despite converting to 8 bit for the weight update, the conversion leads to small accuracy gains?

@princecanuma 4 месяца назад

There are no accuracy gains only reduced GPU usage and potentially some extra speed. In terms of speed, I personally didn’t notice any changes. I tested it yesterday and besides reduced GPU usage I noticed that it would take just as long as the 32bit to complete training.