Тёмный

Design, Deploy, and Operate AI Clusters like a Pro with Juniper Networks 

Tech Field Day
Подписаться 55 тыс.
Просмотров 254
50% 1

Struggling with where to start with your on-prem AI training cluster? Juniper validated designs (JVDs) are rigorously pre-tested to make sure your deployments are relatively pain-free, and we now offer JVDs to meet the specific needs of AI data centers. See how Apstra does the Day 0/1/2 heavy lifting for you with intent-based automation.
Jay Wilson, an architect at Juniper Networks, presented at Cloud Field Day 20, focusing on the deployment and management of AI clusters using Juniper's Apstra software. Wilson emphasized that Apstra is designed to manage data center fabrics rather than entire data centers, highlighting its ability to handle multiple fabrics and even multiple data centers from a single instance. The presentation aimed to demonstrate how Apstra's intent-based automation can simplify the complex processes involved in setting up and maintaining AI training clusters. Wilson, with his extensive background in high-performance computing (HPC), underscored the importance of intent as the foundation of Apstra, ensuring that any changes made are validated against predefined goals, thus preventing misconfigurations.
Wilson provided a detailed walkthrough of how Apstra works, particularly focusing on its telemetry and configuration management capabilities. He explained that Apstra collects custom telemetry data, such as explicit congestion notifications (ECNs) and priority flow control (PFC) counters, to monitor the health and performance of AI clusters. This data is crucial for maintaining a lossless environment, which is vital for AI workloads. Wilson also discussed the use of configlets-small pieces of code that allow for fine-tuned adjustments to the network configuration. These configlets are essential for tailoring the environment to meet specific needs without disrupting the overall fabric management that Apstra provides.
The presentation also covered the operational aspects of using Apstra, including its anomaly detection and rollback features. Wilson demonstrated how Apstra's single source of truth model ensures that all changes are validated and committed atomically, thus maintaining the integrity of the network. He showed how Apstra can identify and troubleshoot issues in real-time, using a combination of service and probe anomalies to pinpoint problems. Additionally, Wilson highlighted the importance of Apstra's ability to roll back configurations to a previous state, a feature that is particularly useful in dynamic environments where multiple teams are making frequent changes. This capability ensures that any unintended disruptions can be quickly mitigated, thereby maintaining the stability and performance of the AI clusters.
Recorded live in Sunnyvale, California on June 12, 2024 as part of Cloud Field Day 20. Watch the entire presentation at techfieldday.com/appearance/j... or visit TechFieldDay.com/event/cfd20/ or www.juniper.net for more information.

Наука

Опубликовано:

 

13 июн 2024

Поделиться:

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист
Посмотреть позже
Комментарии    
Далее
Good dad 🥰 #demariki
00:17
Просмотров 10 млн
ХЕРЕЙД БОИТСЯ МОЕЙ СОБАКИ!
37:08
High Performance Computing (HPC) - Computerphile
11:47
Просмотров 120 тыс.
What's your #cloud #migration #strategy?
9:26
Просмотров 85 тыс.
Треш ПК за 420 000 рублей
0:59
Просмотров 137 тыс.
#engineering #diy #amazing #electronic #fyp
0:59
Просмотров 350 тыс.