Тёмный

Broadcom Thor 2: High Performance Ethernet NIC for AI/ML 

Tech Field Day
Подписаться 56 тыс.
Просмотров 3 тыс.
50% 1

The large scale of AI/ML cluster requires high-performance networking solutions. In this talk, we will provide an overview of Broadcom’s high-performance Ethernet NIC for AI/ML clusters. Hemal Shah, Distinguished Engineer and Architect, will describe RoCE and congestion control features of the NIC, a reference AI/ML cluster architecture based on Broadcom switches and NICs, and benefits of end-2-end networking.
Shah begins with a discussion of the importance of high-performance networking for AI/ML clusters. He emphasizes that as AI/ML workloads increase in complexity and scale, networking becomes crucial for efficient job completion times. Shah provides an overview of Broadcom's Ethernet NIC (Network Interface Card), which is designed to meet the demands of AI/ML clusters.
He explains that AI/ML clusters require networking that can handle large amounts of data and support high-speed, low-latency communication between nodes. Broadcom's NICs and switches are designed to work together to provide end-to-end networking solutions that address these needs.
Shah outlines the key features of Broadcom's 400 gig NIC, including:
- Support for RDMA over Converged Ethernet (RoCE) and congestion control, which are important for AI/ML workloads.
- The ability to handle 400 gig bi-directional line rates with low latency to ensure rapid data transfer.
- PCIe Gen 5 by 16 host interface compatibility to maintain high throughput.
- Advanced congestion control mechanisms that react to network congestion and optimize traffic flow.
- Security features like hardware root of trust to ensure only authenticated firmware runs on the NIC.
Shah also discusses the reference architecture for an AI/ML cluster that incorporates Broadcom switches and NICs, designed to scale to thousands of GPUs and provide robust networking capabilities. He concludes by highlighting the importance of end-to-end fabric management for operating large-scale networks effectively, which includes automation, performance monitoring, and diagnostic capabilities.
Recorded live in Santa Clara, CA on February 1, 2024 as part of Cloud Field Day 19. Watch the entire presentation at techfieldday.com/appearance/b... or visit TechFieldDay.com/event/cfd19/ or www.broadcom.com/products/eth... for more information.

Наука

Опубликовано:

 

4 фев 2024

Поделиться:

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист
Посмотреть позже
Комментарии : 1   
@lenkin4
@lenkin4 2 месяца назад
Does Thor 2 have deep memory buffers?
Далее
Arista Networking for AI Workloads
28:02
Просмотров 8 тыс.
Broadcom AI Interconnect and Tomahawk AI Fabrics
49:26
Разоблачение ушные свечи
00:28
Просмотров 644 тыс.
best way out of the labyrinth🌀🗝️🔝
00:17
Просмотров 935 тыс.
NVIDIA Spectrum-X Network Platform Architecture
21:34
Просмотров 4,1 тыс.
Layer 2 vs Layer 3 Switches
6:02
Просмотров 687 тыс.
Best operating system for Servers in 2024
11:41
Просмотров 43 тыс.
Broadcom Jericho3 AI Ethernet Fabric
36:30
Просмотров 7 тыс.
Scaling RoCE Networks for AI Training | Adi Gangidi
20:58
Co-Packaged Optics for our Connected Future
48:15
Просмотров 16 тыс.