Тёмный

xLSTM Explained in Detail!!! 

1littlecoder
Подписаться 83 тыс.
Просмотров 6 тыс.
50% 1

Опубликовано:

 

2 окт 2024

Поделиться:

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист
Посмотреть позже
Комментарии : 25   
@ThePurpleShrine
@ThePurpleShrine 3 месяца назад
People are not using transformer because its just very good for LLMs, but majorly the concept has been a big leap forward in field of ML. I liked the new xLSTM paper and also Mamba when it came out, but I think Transformers have been very revolutionary also because of how it aligns well on the hardware level. Karpathy had a nice discussion on it once at Stanford sometime along with Dzmitry (introduced attention mechanism in 2014), though this was before Mamba was fully open-sourced.
@fontenbleau
@fontenbleau 3 месяца назад
with new Asic accelerator cards you will be locked to use only transformers, but there's no choice, gpus for universal use are much more expensive & slower
@ThePurpleShrine
@ThePurpleShrine 3 месяца назад
​@@fontenbleau Traditional gpu's are best for training because of the internal architecture, gate logic and how the gradients and weights need to propagate across layers. Recently "etched" lauched an Asic gpu "Sohu" having the transformer logic built into the hardware itself. This is golden for inference till the point something new replaces transformers. The point I was getting to is that transformers although introduced initially as a good idea for language translation, turned out to be such a novel architecture that it became the base/best solution of multitude of ML problems (with variations of course). Therefore, it makes sense to make ASICs for transformer given the current use-case and popularity of it (it's a money grab gold mine atm). I would love to see a new architecture which is novel and cab be the base for multiple field of ML like transformers and I like that the community is super active in trying to get there. Its exciting to see these new papers (including xLSTM, Mamba/state-spaces etc.).
@ilianos
@ilianos 3 месяца назад
🎯 Key points for quick navigation: 00:00 *📝 Introduction to xLSTM and Max Beck* - Introduction to Max Beck and xLSTM paper, - xLSTM as an alternative to Transformers, overview of the discussion structure. 00:41 *🔍 Historical Context of LSTM and Transformers* - Review of LSTM's performance before 2017, - Introduction of Transformers in 2017 and their advantages, - Developments in language models like GPT-2 and GPT-3. 03:08 *🚀 Limitations of Transformers* - Drawbacks of self-attention mechanism in Transformers, - Issues with scaling in sequence lengths and GPU memory requirements, - Efforts to create more efficient architectures. 04:15 *⚙️ Revisiting LSTM with Modern Techniques* - Combining old LSTM ideas with modern techniques, - Overview of original LSTM's memory cell updates and gate functions, - Introduction to the limitations of LSTMs in tasks like nearest neighbor search. 07:39 *📈 Overcoming LSTM Limitations* - Demonstrating how xLSTM overcomes storage decision revisions, - Introduction of exponential gating to improve LSTM performance, - Comparison of LSTM, xLSTM, and Transformer in specific tasks. 11:00 *🧠 Enhancing Memory Capacity and Efficiency* - Addressing LSTM’s limited storage capacity and parallelization issues, - Introduction of large Matrix memory in xLSTM, - Methods to enhance training efficiency through new variants. 12:25 *🔑 Core of xLSTM: Exponential Gating* - Detailed explanation of exponential gating mechanism, - Introduction of new memory cell states and stabilization techniques, - Comparison with original LSTM gating mechanisms. 16:00 *🧮 New xLSTM Variants: SLSM and MLSM* - Description of SLSM with scalar cell states and new memory mixing, - Introduction of MLSM with matrix memory cell state and covariance update rule, - Differences between the two variants in terms of memory mechanisms and parallel training. 20:57 *🔍 Performance Comparison and Evaluation* - Evaluation of xLSTM on language experiments, - Comparison with other models on different datasets and parameter sizes, - Demonstration of xLSTM’s superior performance in length extrapolation and perplexity metrics. 25:59 *📊 Scaling xLSTM and Future Plans* - Scaling xLSTM models shows favorable performance, - Plans to build larger models (7 billion parameters) and write efficient kernels, - Potential applications and further exploration of xLSTM capabilities. 27:37 *🤔 Motivation for LSTM over Transformers* - Explanation of inefficiencies in Transformer models for text generation, - Benefits of LSTM's fixed state size for more efficient generation on edge devices, - Encouragement to explore recurrent network alternatives over Transformers. 29:05 *🎓 Research Directions and Advice* - Discussion on the potential for recurrent network alternatives in language modeling, - Advice for aspiring researchers to focus on making language models more efficient, - Mention of Yan LeCun’s advice to explore beyond Transformers. 29:59 *🏢 Industry Adoption and Future Trends* - Observations on the adoption of models like Mamba in industry, - Expectations for similar trends with xLSTM, - Mention of a company working on scaling xLSTM for practical applications. 30:50 *🌐 Convincing the Industry to Switch from Transformers* - Challenges in shifting industry focus from Transformers to alternative architectures, - Need to demonstrate xLSTM’s efficiency and performance to gain industry acceptance, - Importance of open-sourcing efficient kernels to facilitate adoption. Made with HARPA AI
@po-yupaulchen166
@po-yupaulchen166 29 дней назад
@1littlecoder Thank you. I have a question about parallel training for mlstm. In updating cell state, it is formulated as C_t = f_t * C_{t-1} + i_t * v_t * k_t^T (as (19) in the paper). How to get C_0,...,C_t training in parallel? I can understand f_t's, i_t's, v_t's, k_t's can be derived in parallel. But for C_t's, it seems not that trivial. For example, we initialize C_0. init C_0 C_1 = f1 * C_0 + i_1 * v_1*k_1 C_2 = f2 * C_1 + i_2 * v_2 *k_2 = f2 * (f1 * C_0 + i_1 * v_1*k_1) + i_2 * v_2 *k_2 Can C_0,..,C_t be computed in parallel?
@Kutsushita_yukino
@Kutsushita_yukino 3 месяца назад
thanks my left hear is satisfied
@1littlecoder
@1littlecoder 3 месяца назад
does it not have sound both the ears?
@KevinKreger
@KevinKreger 3 месяца назад
@@1littlecoder just Max, not you.
@MaJetiGizzle
@MaJetiGizzle 3 месяца назад
@@1littlecoderIt’s only Max talking in my left ear when wearing headphones.
@1littlecoder
@1littlecoder 3 месяца назад
I'm so sorry, it's my mistake, probably! I didn't hear with both the sides of headphones, otherwise could've avoided this!
@PankajDoharey
@PankajDoharey 3 месяца назад
@@1littlecoder Can you reupload with stereo audio?
@knutjagersberg381
@knutjagersberg381 3 месяца назад
Great pokemon catch!
@1littlecoder
@1littlecoder 3 месяца назад
Thank you, I was glad to get Max's time!
@klauszinser
@klauszinser 2 месяца назад
It would be interesting to take a small transformer model and build an xLSTM with the same HW-environment to compare how they (Transformer xLSTM) behave in comparison?
@volkerlorrmann1713
@volkerlorrmann1713 2 месяца назад
Wow Max 🔥
@test2109-wk1zq
@test2109-wk1zq 3 месяца назад
why the re-up?
@1littlecoder
@1littlecoder 3 месяца назад
Tried something!
@PankajDoharey
@PankajDoharey 3 месяца назад
Mono Audio.
@1littlecoder
@1littlecoder 3 месяца назад
Yeah my bad. I didn't listen with headphone, from computer speakers, i couldn't realize it was mono
@Macorelppa
@Macorelppa 3 месяца назад
😋
@1littlecoder
@1littlecoder 3 месяца назад
🙏🏾
Далее
Iran launches wave of missiles at Israel
00:43
Просмотров 1,2 млн
The moment we stopped understanding AI [AlexNet]
17:38
OSPF Deep Dive
2:26:28
Просмотров 215 тыс.
xLSTM: Extended Long Short-Term Memory
57:00
Просмотров 36 тыс.
MIT Introduction to Deep Learning | 6.S191
1:09:58
Просмотров 599 тыс.
Vision Transformer Basics
30:49
Просмотров 27 тыс.