Тёмный

MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs | Haibin Lin 

@Scale
Подписаться 18 тыс.
Просмотров 1,1 тыс.
50% 1

MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs | Haibin Lin
In this presentation, I will discuss the design, implementation and engineering experience in building and deploying MegaScale, a production system for training large language models (LLMs) at the scale of more than 10,000 GPUs. Training LLMs at this scale brings unprecedented challenges to training efficiency and stability. Maintaining high efficiency throughout the training process (i.e., stability) is an important consideration in production given the long extent of LLM training jobs. Many hard stability issues only emerge at large scale, and in-depth observability is the key to address them. We developed a set of diagnostic tools to monitor system components and events deep in the stack, identify root causes, and derive effective techniques to achieve fault tolerance and mitigate stragglers. We share our operational experience in identifying and fixing failures and stragglers.

Опубликовано:

 

15 сен 2024

Поделиться:

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист
Посмотреть позже
Комментарии    
Далее
What is Prompt Tuning?
8:33
Просмотров 202 тыс.
What are Diffusion Models?
15:28
Просмотров 219 тыс.
@HolyBaam ультанул в конце 🧨
00:34
Просмотров 245 тыс.
LangGraph Deep Dive: Build Better Agents
46:13
Просмотров 15 тыс.