Тёмный

Scheduled Ethernet Fabric for Large scale AI training cluster 

Open Compute Project
Подписаться 14 тыс.
Просмотров 508
50% 1

Pengfei Huo, Sr. Network Architect - ByteDance
S. Kamran Naqvi, Chief Network Architect - Broadcom
Large-scale AI training clusters, hosting tens of thousands of GPUs, are designed to deliver unparalleled computational power for a variety of AI workloads. To fully unleash the power, a highly efficient network fabric that connects these GPUs is essential.
The fabric should support extensive GPU scale-out while maintaining excellence, handle diverse parallel workloads with efficient multi-tenancy and job segregation, be resilient against link failures or topology changes to reduce intervention for check-points, and be grounded in an open ecosystem for innovation and adaptability.
In this presentation, we will explain how the Scheduled fabric addresses the essential requirements. We will also talk about how ByteDance has benchmarked the fabric at their AI clusters, examining its actual performance, deployment plan and thoughts on broader collaboration in the community.

Наука

Опубликовано:

 

30 апр 2024

Поделиться:

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист
Посмотреть позже
Комментарии : 1   
@idrisjafarov357
@idrisjafarov357 2 месяца назад
DriveNets is doing great things! The DDC clusters will be leading the AI networking in the near future.
Далее
СМОТРИМ YOUTUBE В МАЙНКРАФТЕ
00:34
Просмотров 799 тыс.
Why Starbucks Is Struggling
12:06
Просмотров 477 тыс.
The moment we stopped understanding AI [AlexNet]
17:38
Просмотров 854 тыс.
Building a GPU cluster for AI
56:20
Просмотров 107 тыс.
iPhone socket cleaning #Fixit
0:30
Просмотров 18 млн
Проверил, как вам?
0:58
Просмотров 352 тыс.