Deep dive into how Mamba works - Linear-Time Sequence Modeling with SSMs - Arxiv Dives

Подписаться 4,6 тыс.

Просмотров 15 тыс.

50% 1

Arxiv Dives is part of a reading group that gets together every Friday to dig into state of the art research that relates to Machine Learning and Artificial Intelligence. If you would like to join the live discussion we would love to have you!
Join here: lu.ma/oxenbook...
Each week we dive deep into a topic in ML/AI. Whether it is a research paper, a blog post, a book, or a RU-vid video, we break down the content into a digestible format and have an open discussion with the Oxen.ai team, and anyone else who wants to join. We try to cover the content as high level so that anyone can understand it, and will dive into deeper technical details to get a clearer understanding.
This week we cover the Mamba model architecture which claims 5x faster throughput than Transformers and scales linearly instead of quadratically with the length of the sequence. Its performance shows promise on data up to million-length sequences.
The notes can all be found on the Oxen.ai blog:
blog.oxen.ai/m...

Опубликовано:

8 сен 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 30

@vishalrajput9856 8 месяцев назад

In Selective SSMs, the state update rule can change dramatically depending on the input, as the parameters of the model are input-dependent. This means the model can adapt its behavior much more fluidly to the specifics of the data it encounters at each step. Unlike LSTM/GRU, where the gating mechanisms are preset and learn to modulate information flow within a fixed framework, Selective SSMs can redefine their information processing strategy on the fly. This leads to potentially more nuanced and context-sensitive handling of sequences.

@axe863 8 месяцев назад

This flexibility can be extremely dangerous especially if there's episodic transient structure in the data and one leverages the model. Test it with super exponential boom - violent reversal bust with non-exponential failure waiting time and look at its performance.

@BooleanDisorder 8 месяцев назад

Is it possible to use selective SSM in transformers (on top of attention)?

@Jexep 8 месяцев назад

Thank you I tried to find the way to understand the equation for while and this is easiest I can find so far.

@DanielCardenas1 8 месяцев назад

Thanks so much! The video is so helpful to beginning to understand mamba.

@gregschoeninger 8 месяцев назад

You're welcome! Glad you find it helpful. It was fun diving in and breaking it down.

@piratepartyftw 8 месяцев назад

It's not really "linear time." It's actually O(N log N). The thing is that the log(N) factor is set when you build the architecture. You have to scale the dimensionality of hidden state to with log(N) of the longest expected sequence you wanna process. So at runtime, it feels linear. But really there's a sneaky log(N) in the system, because you needed to design your architecture based on expected context length. This is a good thing though. Approximating a transformer with O(N) just isnt possible, so the fact that its O(N log N) is good, because it means this architecture might be for-real, instead of smoke and mirrors like some previous "linear" transformers replacements.

@oxen-ai 8 месяцев назад

Ah, that is a good call out. Can you go into a little more detail of where that log(N) complexity shows up in the model? Is it hidden in the SSM recurrent module or input layers or how you feed the data in?

@albertmashy8590 8 месяцев назад

It kind of is linear because the N is somewhat constant. Additionally, a lot of tokens in a context are meaningless and useless and can be more effectively compressed into a state space.

@jonatan01i 8 месяцев назад

@@albertmashy8590 isn't all this based on convolutions? For convolutions there is the Fast Fourier Transform that works with n*log(n)

@oxen-ai 8 месяцев назад

@@jonatan01i Mamba uses a conv layer outside the SSM block right after the first linear projection, but the full Mamba block it is still very similar to other RNN Cells according the the paper.

@jonatan01i 8 месяцев назад

@@oxen-ai I just wanna share the info regarding the question in this thread that FFT has log(n) "property" and that in these SSMs (H3, hyena, ...) the whole idea is to use FFT because of it's log(n) property, but I haven't read the paper of this video, still I guess that me sharing what I know can be somewhat useful to share regarding the question of where does this log(n) come from..

@mkamp 7 месяцев назад

Awesome video. Thanks for providing it. Looking forward to the progression in this space.

@anirudh514 7 месяцев назад

Thank you for explaining the paper in simple way!!

@oxen-ai 7 месяцев назад

Welcome 😊

@Robert_McGarry_Poems 8 месяцев назад

I think the reason that it keeps it's context, or learns from its previous state so readily, is the fact that it is a layer to layer process.

@dr.mikeybee 8 месяцев назад

In the context of the equation h'(t) = Ah(t) + Bx(t), h'(t) is not a traditional derivative in the sense of calculus. Instead, it’s a notation used to represent the next state of the system. Normally one would use t+1, but this system has yet to be discretized, so the notation h' is used instead. Here, h(t) represents the state of the system at time t, and h'(t) represents the state of the system at the next time step. The equation h'(t) = Ah(t) + Bx(t) describes how the current state h(t) and the input x(t) influence the next state h'(t). So, while h'(t) might look like a derivative, it’s actually a notation used in the context of state space models to represent the next state of the system.

@oxen-ai 8 месяцев назад

Ah thank you for clarifying! I was thinking of it as how the system “updates” or “changes” so my mind immediately went to rate of change or derivative.

@mkamp 7 месяцев назад

28:20 thanks for the walk through. What I have trouble understanding is how to integrate the self attention approach to have input length and output length being the same and for an SSM to potentially only have a single state? How do you stack those layers if you integrate s6 layers into a transformer like architecture? If you only return the single hidden state you throw away the history. Which is fine in terms of efficiency, but what do you do in the layer above? Then you only have this single state as input?

@dr.mikeybee 8 месяцев назад

This is a helpful video.