Тёмный

Why Transformer over Recurrent Neural Networks 

CodeEmporium
Подписаться 121 тыс.
Просмотров 75 тыс.
50% 1

#transformers #machinelearning #chatgpt #gpt #deeplearning

Опубликовано:

 

11 янв 2023

Поделиться:

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист
Посмотреть позже
Комментарии : 55   
@IshtiaqueAman
@IshtiaqueAman 9 месяцев назад
That's not the main reason, RNN keep adding the embeddings and hence override information that came before where as in case of transformer embeddings are there all the time and attention can pick the ones that are important.
@NoahElRhandour
@NoahElRhandour Год назад
that was a great video! i find learning about such things generally easier and more interesting, if they are compared to other models/ideas that are similar but not equal
@CodeEmporium
@CodeEmporium Год назад
Thank you for the kind words. And yep, agreed 👍🏽
@NoahElRhandour
@NoahElRhandour Год назад
@@CodeEmporium i guess just like CLIP our brains perform contrastive learning as well xd
@brianprzezdziecki
@brianprzezdziecki Год назад
RU-vid recommend me more videos like this plz
@GregHogg
@GregHogg Год назад
This is a great video!!
@CodeEmporium
@CodeEmporium Год назад
Thanks a lot Greg. I try :)
@schillaci5590
@schillaci5590 Год назад
This answered a question I didn't have. Thanks!
@CodeEmporium
@CodeEmporium Год назад
Always glad to help when not needed!
@IgorAherne
@IgorAherne Год назад
I think lstms are more tuned towards keeping the order, because although transformers can assemble embeddings from various tokens, they don't know what follows what in a sentence. But, perhaps with relative positional encoding they might be equipped just about enough to understand the order of sequential input
@evanshlom1
@evanshlom1 10 месяцев назад
Your comment came right before gpt blew up so maybe you wouldn’t say this anymore?
@borregoayudando1481
@borregoayudando1481 Год назад
I would like to have a more skeleton-up or foundation-up understanding (to better understand the top down representation of the transformer). Where should I start, linear algebra?
@sandraviknander7898
@sandraviknander7898 6 месяцев назад
An important caveat is that transformers like the decoder and GPT models are trained autoregresively with no context of the words coming after.
@sreedharsn-xw9yi
@sreedharsn-xw9yi 6 месяцев назад
ya its masked multi head attention only focuses on left-to-right right ?
@free_thinker4958
@free_thinker4958 4 месяца назад
​@@sreedharsn-xw9yiyes that's decoders only transformers such as gpt 3.5 for example and any text generation model
@aron2922
@aron2922 Год назад
You should have put LSTMs as a middle step
@CodeEmporium
@CodeEmporium Год назад
Good call. I just bundled them with Recurrent Neural Networks here
@kenichisegawa3981
@kenichisegawa3981 6 месяцев назад
This is the best explanation I’ve ever seen RNN vs Transformer. Is there similar video like this for self attention by any chance? Thank you
@CodeEmporium
@CodeEmporium 6 месяцев назад
Thanks so much for the kind words. There is a full video on self attention on the channel. Check out the first video below in the playlist “Transformers from scratch “
@alfredwindslow1894
@alfredwindslow1894 6 месяцев назад
Don’t transformer models generate one token at a time? It’s just they’re faster as calculations can be done in parallel
@nomecriativo8433
@nomecriativo8433 5 месяцев назад
Transformers aren't only used for text generation. But in the case of text generation, the model internally predicts the next token for every token on the sentence. E.g the model is trained to do this: This is an example phrase is an example phrase So the training requires a single step. Text generation models also have a causal mask, tokens can only attend to the tokens that come before it. So the network doesn't cheat during training. During inference, only one token is generated at a time, indeed. If I'm not mistaken, there's an optimization to avoid recalculating the previously calculated tokens.
@ccreutzig
@ccreutzig 4 месяца назад
Not all transformers use a causal mask. Encoder models like BERT usually don't - it would break the usefulness of the [CLS] token, for starters.
@free_thinker4958
@free_thinker4958 4 месяца назад
The main reason is that rnn has what we call the exploding and vanishing gradient descent..
@UnderstandingCode
@UnderstandingCode Год назад
Ty
@jackrayner1263
@jackrayner1263 7 месяцев назад
Does a decoder model share these same advantages? Without the attention mapping wouldn’t it would be operating with the same context as an RNN?
@vtrandal
@vtrandal Год назад
Fantastic!
@CodeEmporium
@CodeEmporium Год назад
Thanks so much again :)
@jugsma6676
@jugsma6676 Год назад
Can you do Fourier Transform replacing the attention head
@iro4201
@iro4201 Год назад
What?
@user-vm7we6bm7x
@user-vm7we6bm7x Год назад
Fourier Transform?
@vastabyss6496
@vastabyss6496 8 месяцев назад
What if you wanted to train a network to take a sequence of images (like in a video) and generate what comes next? Wouldn't that be a case where RNNs and its variations like LSTM and GRUs are better since each image is most closely related to the images coming directly before and after it?
@-p2349
@-p2349 7 месяцев назад
This is done by “GAN” networks. Or generative adversarial networks. This would have two CNNs one is a “discriminator ” network and the other a “generator” network.
@vastabyss6496
@vastabyss6496 7 месяцев назад
​@@-p2349 I thought that GANs could only generate an image that was similar to those in the dataset (such as a dataset containing faces). Also, how would a GAN deal with the sequential nature of videos?
@ccreutzig
@ccreutzig 4 месяца назад
There is ViT (Vision Transformer), although that predicts parts of an image, and I've seen at least one example of ViT feeding into a Longformer network for video input. But I have no experience using it. GAN are not the answer to what I read in your question.
@manikantabandla3923
@manikantabandla3923 Год назад
But there is also a version of RNN with attention.
@gpt-jcommentbot4759
@gpt-jcommentbot4759 11 месяцев назад
These RNNs are still worse than Transformers. However, there have been Transformers + LSTM combinations. Such neural networks have theoretical potential to create extremely long term chatbots, far higher than 4000 tokens, due to their recurrent nature.
@wissalmasmoudi3780
@wissalmasmoudi3780 9 месяцев назад
I need your help about my narx neural network please
@sreedharsn-xw9yi
@sreedharsn-xw9yi 6 месяцев назад
how we can relate this to masked multi head attention concept of transformers, this video is kind of conflicting with that, any expert ideas here please ..
@TheScott10012
@TheScott10012 10 месяцев назад
I respect the craft! Also, pick up a pop filter
@CodeEmporium
@CodeEmporium 9 месяцев назад
I have p-p-p-predilection for p-p-plosives
@Laszer271
@Laszer271 10 месяцев назад
What I'm wondering is. Why do all APIs charge you credits for input tokens for transformers? For me, it shouldn't make a difference for a transformer to take 20 tokens as input or 1000 (as long as it's within its maximum context lengths). Isn't that the case that transformer always pads the input to its maximum context length anyway?
@ccreutzig
@ccreutzig 4 месяца назад
No, the attention layers usually take a padding mask into account and can use smaller matrices. It just makes the implementation a bit more involved. The actual cost should be roughly quadratic in your input size, but that's probably not something the marketing department would accept.
@Userforeverneverever
@Userforeverneverever Месяц назад
For the algo
@FluffyAnvil
@FluffyAnvil 4 месяца назад
This video is 90% wrong…
@ccreutzig
@ccreutzig 4 месяца назад
But presented confidently and getting praise. Reminds me of ChatGPT. 😂
@kvlnnguyieb9522
@kvlnnguyieb9522 5 месяцев назад
how the new one SSM in MAMBA? the Mamba said to better than transformer
@cate9541
@cate9541 9 месяцев назад
cool
@CodeEmporium
@CodeEmporium 9 месяцев назад
Many thanks :)
@AshKetchumAllNow
@AshKetchumAllNow 6 месяцев назад
No model understands
@cxsey8587
@cxsey8587 11 месяцев назад
Do LSTMs have any advantage over transformers ?
@gpt-jcommentbot4759
@gpt-jcommentbot4759 11 месяцев назад
They work better with less text data, they also work better as decoders. While LSTMs don't have many advantages, future iterations of RNNs could lead to learning far longer term dependencies than Transformers. I think that LSTMs are more biologically accurate than Transformers since they incorporate time and are not layered like conventional networks but instead are theoretically capable of simple topological structures. However, their have been "recurrent Transformers" which is basically Long Short Term Memory + Transformers. The architecture is literally a transformer layer turned into a recurrent cell along with gates inspired by LSTM.
@iro4201
@iro4201 Год назад
I understand and do not understand.
@keithwhite2986
@keithwhite2986 3 месяца назад
Quantum learning, hopefully with an increasing probability towards understanding.
@iro4201
@iro4201 3 месяца назад
No risk, no reward... @@keithwhite2986
@sijoguntayo2282
@sijoguntayo2282 10 месяцев назад
Great video! I’m addition to this, RNNs due to their sequential nature are unable to take advantage of transfer learning. Transformers do not have this limitation
Далее
Becoming A World-Record Holder 🔥
01:00
Просмотров 4,5 млн
Why Neural Networks can learn (almost) anything
10:30
Why Recurrent Neural Networks are cursed | LM2
13:17
Просмотров 11 тыс.
What are Transformer Models and how do they work?
44:26
ChatGPT: 30 Year History | How AI Learned to Talk
26:55
BERT Neural Network - EXPLAINED!
11:37
Просмотров 375 тыс.