The State Space Model Revolution, with Albert Gu

Cognitive Revolution "How AI Changes Everything"

Подписаться 13 тыс.

Просмотров 2,1 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Опубликовано:

2 окт 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 10

@wwkk4964 3 месяца назад

Albert is the man! Thank you!

@mkamp 2 месяца назад

That’s an awesome episode. Very high information density. I am constantly rewinding to hear again the exact framing of questions and answers.

@charliesteiner2334 3 месяца назад

This one was a good'un.

@CognitiveRevolutionPodcast 3 месяца назад

Thanks. Mamba🐍😄!

@mkamp 2 месяца назад

1:15 Albert says that doubling the state size in Mamba 1 doubles the wall clock time. Also, that in Mamba 2 much of the computation is not contingent on the state size. Why the latter? Computation time is constant because it’s one matmul happening in parallel as one step on the GPU?

@augmentos 2 месяца назад

That was a 2time watch

@mkamp 2 месяца назад

When Albert says, multiple times, that they avoid to materialize the state, it sounds that they don’t materialize the state at all during the forward pass in training. Does he mean that exactly? Or that they avoid to materialize the full state at once, but materialize the whole state incrementally, chunk by chunk?

@clray123 Месяц назад

From the Mamba paper: Commonly, the model uses the convolutional mode (3) for efficient parallelizable training (where the whole input sequence is seen ahead of time), and switched into recurrent mode (2) for efficient autoregressive inference (where the inputs are seen one timestep at a time). ... Note that the recurrent mode is more flexible than the convolution mode, since the latter (3) is derived from expanding the former (2) (Gu, Goel, and Ré 2022; Gu, Johnson, Goel, et al. 2021). However, this would require computing and materializing the latent state ℎ with shape (B, L, D, N), which is much larger (by a factor of 𝑁 , the SSM state dimension) than the input 𝑥 and output 𝑦 of shape (B, L, D). Thus the more efficient convolution mode was introduced which could bypass the state computation and materializes a convolution kernel (3a) of size only (B, L, D).

@clray123 Месяц назад

The interviewer should focus on interviewing instead of talking about himself.