Bill Dally | Directions in Deep Learning Hardware

Подписаться 1,1 тыс.

Просмотров 7 тыс.

50% 1

Bill Dally
, Chief Scientist and Senior Vice President of Research at NVIDIA gives an ECE Distinguished Lecture on April 10, 2024 at Georgia Tech.
Abstract:
“Directions in Deep Learning Hardware”
The current resurgence of artificial intelligence, including generative AI like ChatGPT, is due to advances in deep learning. Systems based on deep learning now exceed human capability in speech recognition, object classification, and playing games like Go. Deep learning has been enabled by powerful, efficient computing hardware. The algorithms used have been around since the 1980s, but it has only been in the last decade - when powerful GPUs became available to train networks - that the technology has become practical.
Advances in DL are now gated by hardware performance. In the last decade, the efficiency of DL inference on GPUs had improved by 1000x. Much of this gain was due to improvements in data representation starting with FP32 in the Kepler generation of GPUs and scaling to Int8 and FP8 in the Hopper generation.
This talk will review this history and discuss further improvements in number representation including logarithmic representation, optimal clipping, and per-vector quantization.
BIOGRAPHY:
Bill Dally joined NVIDIA in January 2009 as chief scientist, after spending 12 years at Stanford University, where he was chairman of the computer science department. Dally and his Stanford team developed the system architecture, network architecture, signaling, routing and synchronization technology that is found in most large parallel computers today.
Dally was previously at the Massachusetts Institute of Technology from 1986 to 1997, where he and his team built the J-Machine and the M-Machine, experimental parallel computer systems that pioneered the separation of mechanism from programming models and demonstrated very low overhead synchronization and communication mechanisms. From 1983 to 1986, he was at California Institute of Technology (CalTech), where he designed the MOSSIM Simulation Engine and the Torus Routing chip, which pioneered “wormhole” routing and virtual-channel flow control.
He is a member of the National Academy of Engineering, a Fellow of the American Academy of Arts & Sciences, a Fellow of the IEEE and the ACM, and has received the ACM Eckert-Mauchly Award, the IEEE Seymour Cray Award, and the ACM Maurice Wilkes award. He has published over 250 papers, holds over 120 issued patents, and is an author of four textbooks.
Dally received a bachelor's degree in Electrical Engineering from Virginia Tech, a master’s in Electrical Engineering from Stanford University and a Ph.D. in Computer Science from CalTech. He was a cofounder of Velio Communications and Stream Processors.

Опубликовано:

11 апр 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 7

@katechen9458 2 дня назад

Great talk. He covered the power and the bandwidth issues and tried to improve from density point of view. Sometimes, we divide things and focus on the limited area, but forget the big picture.

@fabianwenger7133 Месяц назад

“Pick some domain that is ripe for acceleration and do the hardware-software co-optimization”. A job well done and detailed insight in a down-to-earth manner.

@Wobbothe3rd Месяц назад

This man deserves a congressional medal of freedom award.

@radicalrodriguez5912 Месяц назад

great hosting, talk and questions. thanks for uploading it

@gesitsinggih Месяц назад

A lot of useful information, but he is focusing on inference compute density, while the actual bottleneck is dram bandwidth. You will hardly get 10% inference compute utilization on the best hardware, even when maxing out practical batch size. Headline flops number is eye catching, but they have to be more honest about real usage.

@BlockDesignz 13 дней назад

Wrong. He's talking about in a serving setting, where you'll have N users querying your service at any one time. If N is large enough (I'm talking 10^3), the problem becomes compute bounded again!

@gesitsinggih 13 дней назад

@@BlockDesignz True, but in practice no one has large enough batch size and compute bounded. My critique is they grew compute way more than they grew memory bandwidth.