Mamba - a replacement for Transformers?

Samuel Albanie

Подписаться 20 тыс.

Просмотров 250 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Опубликовано:

26 сен 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 167

@qwerasdliop2810 9 месяцев назад

Insane, I loved the way you went through multiple important prior papers before talking about mamba!

@looksintolasers 8 месяцев назад

Depth-first search of the depenency tree of papers :)

@shiholololo1053 9 месяцев назад

Standford Labs are thriving right now. To think all this work is made OPEN-SOURCE at a period of hostile and fierce competition among the big tech companies.

@nikoladjordjevic4477 9 месяцев назад

Original transformers were Open Source by Google Also, GPT and GPT2 were open source This is no surprise to those in the community

@8191-m8t 9 месяцев назад

2 Timothy 3:16 New World Translation of the Holy Scriptures (Study Edition) 16 All Scripture is inspired of God+ and beneficial for teaching,+ for reproving, for setting things straight,+ for disciplining in righteousness,+

@patrickangel4880 9 месяцев назад

Like a knife, a weapon available to everyone is not a weapon anymore it's just a mere tool... #hail_to_the_open_source_and_public_research

@peterbennett2301 9 месяцев назад

Is not Mathematics the language of God?

@dezh6345 8 месяцев назад

@@nikoladjordjevic4477 Those companies all turned closed source once money got involved.

@rabbit-hole-research 9 месяцев назад

Thank you for such a good survey of the prior work! Your effort is noted and appreciated!

@SamuelAlbanie1 9 месяцев назад

Much appreciated!

@Fritz0id 9 месяцев назад

Thanks for this, I feel caught up again! I've seen several papers popping up with alternatives to the transformer architecture, but I lacked a framework to grok them. The way you put this paper in a broader context, both in terms of the new benchmark for long range arenas and the emphasis on "no free lunch" w/re to LTI vs SSM was really helpful.

@triplanetary 9 месяцев назад

Can you send some links of those papers that list the alternatives transformers architecture.

@BradNeuberg 9 месяцев назад

Always appreciate your excellent video explanations of cutting edge papers, thanks!

@SamuelAlbanie1 9 месяцев назад

Thanks!

@Rojfos 9 месяцев назад

Thats a really high quality content. I also really like the way you highlight the text when you read over it, this makes it easier to follow along!

@SamuelAlbanie1 9 месяцев назад

Thanks!

@MeanGeneHacks 9 месяцев назад

Hope the open source community builds on this

@dinoscheidt 9 месяцев назад

Well get on it. The open source community is also 🫵

@ItsRyanStudios 9 месяцев назад

WE are the open source community ☺️

@rrestoring_faith 9 месяцев назад

The authors already keep their code open source so the work is replicable. It's common practice in ML research.

@borregoayudando1481 9 месяцев назад

All you need is Mambas?

@rjarpa 9 месяцев назад

exepto for gtp 3 and 4 XD @@rrestoring_faith

@adamshaw46 9 месяцев назад

I really really like the build up of ideas through papers, it's a great way to introduce the idea while giving references that we can look up and trace ourselves and coming onto the scene with no context of the last few years of research it provides a neat overview

@mkamp 8 месяцев назад

Absolutely fantastic. Personally, I would be happy to watch a much longer video: same structure, just slower and broken down a bit more. This is not a complaint. The video is awesome as it is. Just feedback.

@SethuIyer95 9 месяцев назад

The crux of the performance of this network lies in the fact that they are using coefficients of legendre polynomial as a basis which allowed the information to be highly compressed with minimal information loss, thinking about sequence memory, moving away from iterative or recursive processing to a more holistic, algebraic form of memory management.

@xyh6552 9 месяцев назад

In line with your viewpoint, this job is actually similar to using FFT to process n-bit multiplication

@christophkogler6220 9 месяцев назад

@@xyh6552 I think it basically is a high dimensional FFT that's tracking location in the models similarly high dimensional memory/association space. Should provide near-perfect representation, recall, and higher efficiency for recurrent networks.

@derghiarrinde 9 месяцев назад

U lost me at "legendre"

@SethuIyer95 9 месяцев назад

@@xyh6552 Yep, FFT is on fourier basis, this is using legendre basis.

@xyh6552 9 месяцев назад

@christophkogler6220 Similar to your viewpoint, from the perspective of solving the Kakeya conjecture in finite fields, I believe the main idea is to utilize the rigidity of polynomials to achieve efficient compression. I speculate that the effect of utilizing the relationship between polynomials and roots in polynomial splitting fields is essentially replacing one "n" in the complexity with "logn"

@alileevil 9 месяцев назад

Honestly how do you make sense of these papers? I've listened to the whole video and still haven't got a clue what it is about. Quite a lot of brilliant people out there do to work like this.

@drayg0n806 9 месяцев назад

I noticed that @havenhq had tuned a chat version of the pretrained Mamba-2.8B on huggingface. I played it on colab and it feels like a decent chatbot already. I'm very excited about the future of this architecture

@ArnavMondal14 9 месяцев назад

You have any code for it?

@johnny02199 9 месяцев назад

Thanks for the video, would love to have a more detailed explaination based on the related works before!

@Dart_ilder 6 месяцев назад

I liked this video so much that I reached for the like button 3 times while watching it. Awesome context on S4. This is extremely helpful for getting the context and stripping the hype to get to the meaning. That's definitely a sub and I am off to watch all the other videos

@kobilica999 9 месяцев назад

Man, those papers include hardcore numerical linear algebra :D

@HaganeNoGijutsushi 6 месяцев назад

S4 seems to go the hardest with its convolutional trick, but then everyone else goes "fuck this complicated shit, it's too constraining, let's just parallelize more!" and honestly if I had been the one coming up with that clever math I'd feel so cheated 😂.

@Ben_D. 9 месяцев назад

I need an ‘explain it like I’m five’ version of this. 😄 But I hope it means something strong is coming down the pipe.

@christophkogler6220 9 месяцев назад

Actual ELI5: Many current AI models rely on 'MLP (Multi-Layer Perceptron)' and 'Transformer' blocks in their design. The "problematic" (but also usually the 'smart') one is the 'Transformer' block. These need more and more resources to process the context as the context size increases, making scaling up VERY difficult - for a 8x larger context you need about 64x the resources. This is because Transformers compare every part of the context to every other part of the context, every time. The Mamba architecture excludes both the MLP and Transformer blocks for the new 'Mamba' block. It needs the same amount of resources for an increase in context size no matter how large the context already is. For an 8x larger context, you would only need about 8x the resources. That means that - compared to a Transformer based model - you could give it way more input at once and get way more output at once, with the same memory resources. If the method works at larger scales, Mamba could be another significant step forward for AI capabilities. Most current public-facing LLM models, like ChatGPT, use Transformers in their architecture. Transformers include 'self-attention', which basically weighs the importance of every thing against everything else, all at once. This means they process any input in approximately O(N^2) time and memory (where N is the input length). As input / context length increases, their demands scale incredibly high. Anybody with a decent GPU technically CAN run a local LLM, its just small, slow, and dumb. To run anything decent, you end up needing tens (maybe even hundreds) of gigabytes of extremely fast memory, which means workstation GPU's that cost thousands or even entire GPU clusters. The Mamba architecture is basically an entirely different TYPE of AI, more similar to a Recurrent Neural Network, and is both faster and more memory efficient. It processes and considers information sequentially, instead of all at once, but can ALSO ignore unimportant information. The architecture would be able to process an input in approximately O(n+L) time and memory, where n is essentially some constant and L is input length. If it continues to work so efficiently at increasingly large scales, it means literally orders of magnitude faster output and lessened memory requirements for a given context window, which can allow model context size to be increased massively while still using less computational resources than the previous methods. This part is basically educated guesswork, as this level of linear algebra / ML theory is a fair bit over my head: I think Legendre memory cells basically create a high dimensional Fast Fourier Transform from the Legendre polynomials (1:25), which fits neatly into my mental model of AI. In a certain mathematical sense, everything an AI knows can be represented as an incredibly complex interplay of gradients/associations between multidimensional vectors. A multidimensional FFT thus allows you to track your 'location' (the context) within this interplay as the model works, efficiently and with little (to near-zero) accuracy loss. They also allow you to accurately recompose the input from memory, thus allowing the model to efficiently recall or ignore only parts of the input that matter. The 'importance' of any specific part of the input to the output is probably based on some calculation of distance from the FFT to the 'current context'. If it isn't a 'near' association, it probably doesn't really matter to what you're doing, and so can be ignored. And here's a summary of the results of paper from ChatGPT, after I had a little Q&A with it: Summarize the benefits the Mamba architecture has over the Transformers architecture when at similar scales. The Mamba architecture offers several significant benefits over traditional Transformer architectures, particularly when both are at similar scales. Here's a summary of these advantages: 1) Higher Throughput and Efficiency: Mamba achieves a much higher throughput in both inference and training compared to Transformers. Specifically, it has been noted to have 5× higher throughput during inference and up to 40× faster efficiency in training operations. This increased efficiency is especially beneficial when dealing with large-scale models and data. 2) Linear Scaling with Sequence Length: Unlike Transformers, which have quadratic scaling with sequence length, Mamba scales linearly. This is a substantial advantage for processing long sequences, as it ensures more predictable and manageable growth in computational requirements and memory usage as sequence length increases. 3) Improved Generation Throughput: In tasks like language modeling, Mamba not only outperforms Transformers of the same size but also matches or even exceeds the performance of Transformers that are twice its size. This indicates higher efficiency and effectiveness of Mamba in generating outputs. 4) Effective Handling of Longer Sequences: Mamba is particularly adept at handling long sequences, outperforming Transformer models in tasks involving extended contexts. Its design allows it to focus on the most relevant parts of a sequence, enhancing its ability to generalize to much longer sequences than it was trained on. 5) Simplified Architecture: By omitting attention and MLP blocks, Mamba’s architecture is more streamlined than that of traditional Transformers. This simplification contributes to its efficiency, especially in dealing with long sequences. 6) Hardware Optimization: Mamba’s hardware-aware algorithm makes it more compatible with modern GPU architectures, leading to better performance on current hardware platforms. This optimization is crucial for achieving faster processing speeds and more efficient utilization of computational resources. In summary, Mamba offers significant improvements over Transformers in terms of efficiency, scalability, and effectiveness, particularly at similar scales. Its innovations in architecture and design enable it to handle longer sequences more efficiently, making it a strong candidate for various applications in fields requiring efficient sequence modeling.

@nartrab1 9 месяцев назад

Thank you! This was excellent.

@alexander191297 9 месяцев назад

I think this answer is wonderful… and can tell it’s ChatGPT generated 😅

@kevinaud6461 9 месяцев назад

@@christophkogler6220I think this was more of an "explain like I have a bachelor's in CS," but that's exactly what I needed 🙂 Thanks for writing it out

@christophkogler6220 9 месяцев назад

@@alexander191297 Only the part after I mention ChatGPT :)

@fiery_transition 9 месяцев назад

As a person new to the field, I greatly appreciate the way you presented things here!

@SamuelAlbanie1 9 месяцев назад

Thanks!

@Kobe29261 9 месяцев назад

This does it for my 'aspiration video' of the week.

@SamuelAlbanie1 9 месяцев назад

Great.

@TobiasWeg 9 месяцев назад

Very interesting and well explained. Thanks a lot.

@freedom_aint_free 9 месяцев назад

Amazing work ! Keep 'em coming !

@SamuelAlbanie1 9 месяцев назад

Thanks, will try!

@synapsomorphy 9 месяцев назад

Very encouraging that they included the situation in which S6 did poorly! If there are no other catches this looks incredible!

@광광이-i9t 8 месяцев назад

Thanks for your work !! It is really helpful to look through the related works 😮😮

@michaelparis6039 9 месяцев назад

I'm only at 7:13, right after 'spicy'. Subscribed. Great format and amazing delivery!

@SamuelAlbanie1 9 месяцев назад

Thanks!

@JazevoAudiosurf 9 месяцев назад

Tri Dao is one hell of a contributor

@XAheli 9 месяцев назад

Keep these coming! Great video.

@Jeremy-e7u5y 9 месяцев назад

Thank you for bringing this to our eyes and it has been really insightfull

@JerryFederspiel 9 месяцев назад

Just as complex numbers work well for SSMs in audio, I can't help but wonder whether split-complex numbers would help SSM performance in language tasks (considering the hyperbolic flavor of split-complex numbers and the benefits of hyperbolic embeddings when encoding hierarchical data).

@SamuelAlbanie1 9 месяцев назад

It certainly seems plausible. In my experience, while hyperbolic embeddings make strong intuitive sense for hierarchical data, I've never seen them yield significant gains (the kinds of works I am are familiar are of this flavour: arxiv.org/abs/2304.09172). If your experience has been different, I'd be curious to hear.

@JorgetePanete 9 месяцев назад

Remember, the RWKV mentioned is the one from its paper, the RWKV v4, there isn't yet a paper for v5 and v6, but v6 is similar to Mamba Edit: it was updated today

@JorgetePanete 9 месяцев назад

How similar? well, I don't know, check it at the repo

@BlayneOliver 9 месяцев назад

Would this help a regression based transformer which data is based on the stock market’s price action? Or is it more for multi-media?

@KingPowa00 9 месяцев назад

What source do you suggest to understand the algebra and math behidn these works? I really struggled to understand most of the concepts, though I have a fairly good basis of the math behind transformers.

@raul36 9 месяцев назад

First of all, I recommend you guys 3Blue1Brown's algebra videos. Then, if you already have a solid knowledge, I would recommend "Linear Algebra Done Right" book

@MustafaAkben 9 месяцев назад

Great review! Looking forward to playing with it soon :)

@SamuelAlbanie1 9 месяцев назад

Thanks!

@Robert_McGarry_Poems 9 месяцев назад

This is my first time watching your channel. Impressive walkthrough. When I first heard of Q* my imagination started to build a very similar architecture... I don't follow too much of the technical, but I saw how the sandwiched gates, shown in the video, could be used almost in an analogue fashion. This is brilliant! Watching this made me grin like crazy... This might not be zero memory, but dang if it isn't a huge step in that direction. Using local memory is genius. And that token interpretation length, yes... So... physically, I guess, in my mind the next step is to localize the memory to the operation even more, but it looks like in that architecture it's as local as it's going to get... What about something like... "Sample-and-hold," from actual analogue circuits? That might be something to think about.

@vga7714 9 месяцев назад

great summary and even better presenting voice.

@SamuelAlbanie1 9 месяцев назад

Thanks!

@Shnugs 9 месяцев назад

When you stand back and squint your eyes at these papers they almost have a turbo encabulator quality to them.

@colejohnson2230 9 месяцев назад

Lol, yeah. I noticed that most fields tend towards that as you get towards the bleeding edge. Sometimes I have to stop what I'm working on and just appreciate how it looks like nonsense to an outside viewer

@NoNTr1v1aL 9 месяцев назад

Amazing video! Subscribed.

@sup5356 9 месяцев назад

beautifully developed narrative

@h3techsme 9 месяцев назад

This also begs the question of how the hardware-aware process fares when the memory between system and GPU are fully shared...

@EigenA 8 месяцев назад

Great video, thanks for sharing!

@matusstiller4219 9 месяцев назад

This video reminds me of the fact that I do not understand mathematics🙃

@iamr0b0tx 9 месяцев назад

Thanks

@SamuelAlbanie1 9 месяцев назад

Thanks!

@dfparker2002 9 месяцев назад

How is Mamba similar or different to multi-expert models? What is the minimum card spec (memory, cuda, tensors, what ever) to run this model?

@luizpereira7165 6 месяцев назад

Can you use Mamba arquitecture in conjunction with Bitnet b1.58?

@6lack5ushi 9 месяцев назад

is this not a somewhat proof or then addition to Lee Cronin's Assembly theory is you can rebuild input u from the components of m?

@grimsk 9 месяцев назад

점점 물리학과 유사해지는 느낌 feels like it's becoming more and more similar to physics.... 🙂

@baab4229 9 месяцев назад

Idk man I kinda like the shapeshifting sapient robots fighting over their home planet cybertrone, why would you wanna replace them

@TheApgreyd 9 месяцев назад

Thx RU-vid for recommendations

@TheGreatestJuJu 9 месяцев назад

This makes so much sense. So obvious..

@Verrisin 9 месяцев назад

turning image into a flattened sequence ... I wonder if they are using space filling curves, or just line by line ? ... I wonder which "regularity" would be more useful? Or something else even? - To be fair, having no implicit notion of "relative position of 2 pixels" (which I believe brains have) seems really expensive, if it then has to fully recover that structure from just a sequence of tokens ...

@SamuelAlbanie1 9 месяцев назад

Yes - this is a good point. I think the reason flattening is performed without retaining 2d structure is precisely because it makes for a particularly challenging modelling task.

@honeymak 9 месяцев назад

is it conversational? can it talk to itself or several instances?

@circulartext 9 месяцев назад

super cool work

@aron2922 9 месяцев назад

I think about 8 people followed what you were saying but I appreciate the effort

@SamuelAlbanie1 9 месяцев назад

Thanks!

@patrickangel4880 9 месяцев назад

Like a knife, a weapon available to everyone is not a weapon anymore it's just a mere tool... #hail_to_the_open_source_and_public_research

@qwertyuiop-ux6jk 9 месяцев назад

thanks for the video

@shyama5612 9 месяцев назад

is gemini based on this? the logo spiral seems to look like the Legendre polynomial graph,

@s11-informationatyourservi44 9 месяцев назад

can’t wait for a model named kobe to come out

@watcher8582 9 месяцев назад

cool presentation

@SamuelAlbanie1 9 месяцев назад

Thanks!

@MemesnShet 9 месяцев назад

Since the big companies are creating their LLMs on transformers with all those resources and time I doubt they'd change unless the results were dramatically better so Mamba while impressive doesn't seem to be it

@SamuelAlbanie1 9 месяцев назад

Thanks!

@apidas 7 месяцев назад

god, these kids really find the cure for cancer

@ReflectionOcean 8 месяцев назад

- Understand Mamba's significance by exploring its efficient state space model design and selective state mechanism (00:04). - Review the scale issues with Transformers and the emergence of efficient alternatives like Mamba for long sequence modeling (00:31). - Examine the Hippo Recurrent Memory and its application in sequence modeling for improved performance (01:29). - Recognize the role of kernel Fusion, parallel scan, and recomputation techniques in Mamba's efficient memory usage (09:55). - Consider the empirical results showcasing Mamba's high performance on various tasks, including long sequence modeling and DNA classification (13:02). - Analyze the trade-offs in model design, noting how selection mechanisms can impact performance on different data modalities (15:27). - Investigate the limitations of current empirical evaluations and the need to test Mamba on larger model sizes (15:43). - Dive into the released GitHub code to experiment with the Mamba model firsthand (15:59).

@RudyMartinInvest 9 месяцев назад

Thanks!

@SamuelAlbanie1 9 месяцев назад

Thanks!

@JohnViguerie 6 месяцев назад

in the real world LeCunn and Hinton's ideas haven't yet been optimized and deployed to scale in commerce... 😂 But it's fun to try and keep up

@peteroliver7975 8 месяцев назад

I want to see this applied to reasoning tokens

@porting400 9 месяцев назад

Great video

@zlatanmessi2095 8 месяцев назад

Added to my plays list on AI

@KeepingUp_withAI 2 месяца назад

Here after mistral release of their code mamba model 😄

@Sam-ri3hr 9 месяцев назад

Good video Sam

@DamaKubu 9 месяцев назад

If you are interrested in doing mechanistic interpretability on mamba model, hit me a dm. Am thinking of writing something like Neel Nanda's transformer lens for mamba or some lower hanging fruit as a start.

@qwertasd7 9 месяцев назад

any llm using it?

@Adovid 9 месяцев назад

Transformers don't scale on long sequence operations because generative AI neural networks work better spreading attention over the parameters. We shall see if Mamba can do what it claims after a large model is doing inference.

@Kram1032 9 месяцев назад

finally apparently near-infinite contexts!

@Sai_r2d2 8 месяцев назад

Lesssgo kobe ✨️

@ekstrajohn 9 месяцев назад

If transformers scale pretty well, I can't think of a reason why Mamba wouldn't scale. At least off the top of my head. Let's see what happens!

@luismeron4506 9 месяцев назад

Kobe and Gigi 🏀8️⃣💛💜2️⃣4️⃣🖤

@stan-15 9 месяцев назад

Cool beans

@Oler-yx7xj 9 месяцев назад

I'm so tired that I read this title literally and it took me some time to understand why it is probably not a video about using snakes in place of ChatGPT.

@garethjax 9 месяцев назад

that's enough math for a lifetime. Amazing.

@reinerwilhelms-tricarico344 8 месяцев назад

Interesting. But as usual it suffers from acronym overload.

@osbernperson 9 месяцев назад

Aha yes, this are the OK! 👍 becas I is smart here to, and No can be maybi. Good! Do it Now!

@dhrumil5977 9 месяцев назад

Whattttt 😵‍💫😵‍💫😵‍💫

@rkbiri5470 9 месяцев назад

Need an ELI5 section 😅😂

@iTXS 9 месяцев назад

The machines now can get epilepsy lol

@flambr 9 месяцев назад

in the uk, mamba is the nickname for a hard drug

@bootblacking 7 месяцев назад

Why would a snake replace Transformers, it can't even turn into a truck

@derghiarrinde 9 месяцев назад

Maybe you could better explain some sentences instead of just highlighting them and reading them aloud. I get you want a lower length video but sometime you could speak to us like we're 10 years old. Would help with understanding. In the worst case, generate special cases using a GPT (explain this passage to me as if I was 15) and just read that. Thanks.

@SamuelAlbanie1 9 месяцев назад

Thanks for the feedback!

@supperenet9090 9 месяцев назад

No, it's an replacement for conda.

@jasonandrewismail2029 9 месяцев назад

superficial and misleading

@Cineenvenordquist 9 месяцев назад

Remix it with your fixed leads. 🙏🏼

@xyh6552 9 месяцев назад

The technique of solving long-term memory problems using polynomial projection is somewhat similar to using FFT for multiplication. Essentially, both methods use highly efficient information representations with almost orthogonal channel capacity to represent the original information