Hardware-aware Algorithms for Sequence Modeling - Tri Dao | Stanford MLSys #87

Подписаться 20 тыс.

Просмотров 5 тыс.

50% 1

Episode 87 of the Stanford MLSys Seminar Series!
Hardware-aware Algorithms for Sequence Modeling
Speaker: Tri Dao
Abstract:
Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length.
In the first half, we describe attention approximation algorithms using sparsity and low-rank structures, as well as algorithms (e.g. FlashAttention) to achieve fast and memory-efficient exact attention. By making attention algorithms IO-aware (accounting for reads and writes between levels of GPU memory) one can speed up attention by 4-8x, enabling 4-16x longer context in Transformers and yielding higher quality models. We will also describe optimizations for long-context LLM inference, leading to 2-8x faster end-to-end inference time.
In the second half, we describe recent progress on subquadratic-time architectures such as RNNs, gated convolution, and structured state space models (SSMs). We identify that a key weakness of such models is their inability to perform content-based reasoning, and propose a selection mechanism to address this shortcoming. Though this change prevents the use of efficient convolutions, we design a hardware-aware parallel algorithm in recurrent mode. We integrate these selective SSMs into a simplified end-to-end neural network architecture (Mamba) without attention or even MLP blocks. Mamba matches or exceeds the performance of strong modern Transformers on language modeling.
Bio:
Tri Dao is an incoming Assistant Professor at Princeton University and is currently chief scientist of Together AI. He completed his PhD in Computer Science at Stanford, co-advised by Christopher Ré and Stefano Ermon. He works at the intersection of machine learning and systems, and his research interests include sequence models with long-range memory and structured matrices for compact deep learning models. His work has received the ICML 2022 Outstanding paper runner-up award.
--
Stanford MLSys Seminar hosts: Avanika Narayan, Benjamin Spector, Michael Zhang
Twitter:
/ avanika15
/ bfspector
/ mzhangio
--
Check out our website for the schedule: mlsys.stanford.edu
Join our mailing list to get weekly updates: groups.google.com/forum/#!for...
#machinelearning #ai #artificialintelligence #systems #mlsys #computerscience #stanford

Наука

Опубликовано:

16 янв 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 4

@jjh5474 5 месяцев назад

Thank you for sharing this insightful video. In the introduction of Mamba, it says "parellelizable training", can you explain how parallel training is possible in an autoregressive model?

@robertjflynn4206 5 месяцев назад

Teacher forcing

@icriou 5 месяцев назад

Follow this video and you will have hands on understanding why AR model could be trained in parallel. ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-kCc8FmEb1nY.html

@matthewnorton2315 5 месяцев назад

I think you might be looking for the "selective scan" part of Mamba. In section 3.3.2 of the paper arxiv.org/ftp/arxiv/papers/2312/2312.00752.pdf, they say "To avoid the sequential recurrence, we observe that despite not being linear it can still be parallelized with a work-efficient parallel scan algorithm (Blelloch 1990; Martin and Cundy 2018; Smith, Warrington, and Linderman 2023)". In short, they use a well known parallel algorithm trick to calculate a prefix sum. See en.wikipedia.org/wiki/Prefix_sum#Parallel_algorithms and you'll notice the similarity. Hope this helps!