Perceiver: General Perception with Iterative Attention (Google DeepMind Research Paper Explained)

Yannic Kilcher

Подписаться 262 тыс.

Просмотров 56 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Опубликовано:

13 сен 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 128

@YannicKilcher 3 года назад

OUTLINE: 0:00 - Intro & Overview 2:20 - Built-In assumptions of Computer Vision Models 5:10 - The Quadratic Bottleneck of Transformers 8:00 - Cross-Attention in Transformers 10:45 - The Perceiver Model Architecture & Learned Queries 20:05 - Positional Encodings via Fourier Features 23:25 - Experimental Results & Attention Maps 29:05 - Comments & Conclusion

@mgostIH 3 года назад

This approach is so elegant! Unironically Schmidhuber was right that the more something looks like an LSTM the better 😆

@reesejammie8821 3 года назад

I always thought the human brain is a recurrent neural network with a big hidden state and being constantly fed data from the environment.

@6lack5ushi 3 года назад

Powerful!!!

@Gorulabro 3 года назад

Your videos are a joy to watch. Nothing I do in my spare time is so usefull!

@jamiekawabata7101 3 года назад

The scissors scene is wonderful!

@RS-cz8kt 3 года назад

Stumbled upon your channel a couple of days ago, watched a dozen videos since then, amazing work, thanks!

@srikanthpolisetty7476 3 года назад

Congratulations. I'm so glad this channel is growing so well, great to see a channel get the recognition they deserve. Can't wait to see where this channel goes from here.

@bardfamebuy 3 года назад

I love how you did the cutting in front of a green screen and not even bother editing it out.

@sanzharbakhtiyarov4044 3 года назад

Thanks a lot for the review Yannic! Great work

@emilianpostolache545 3 года назад

27:30 - Kant is all you need

@silvercat4 3 года назад

underrated comment

@robboswell3943 Год назад

Excellent video! A critical question: How exactly are the learned latent arrays being learned? Is there some kind of algorithm used to create the learned latent array by reducing the dimensions of the input "byte array"? They never really go into detail about the exact process they used to do this in the paper. Surprisingly, no online sources on this paper that I have found speak about the exact process either. On pg. 3, it does state, "The model can also be seen as performing a fully end-to-end clustering of the inputs with latent positions as cluster centres..." But this is a pretty generic explanation. Could you please provide a short explanation of the process they used?

@maxdoner4528 3 года назад

Good Job, It's pretty great to have These topics explained by someone other than the aufhorchen, Keep it up!

@timdernedde993 3 года назад

Hey Yannic, great Video as usual :) If you want some feedback I feel like you could have covered the results a bit more. I do think the methodology of course is much more important but it helps to have a bit of an overview of how good it performs at what tasks. Maybe give it a few minutes more in the results section next time. But anyways still enjoyed the video greatly. Keep up the great work!

@CristianGarcia 3 года назад

This is VERY nice! I'd love to give it a spin on a toy dataset. 😍 BTW: Many transformer patterns can be found in the Set Transformers paper, the learned query reduction strategy is termed Pooling by Attention.

@justindaniels863 Год назад

unexpected combination of humour and intelligence!

@JTedam 2 года назад

this helps a lot to make research accessible

@Coolguydudeness1234 3 года назад

I lost it when you cut the piece of paper 😂

@cptechno 3 года назад

Yes, I like this type of content. Keep up the good work. Bringing this material to our attention is a prime service. You might consider creating an AI.tv commercial channel. I'll join.

@Daniel-ih4zh 3 года назад

Things are going so fast in the last year or two.

@ssssssstssssssss 3 года назад

I disagree... There haven't really been many major innovations in machine learning in the past two years.

@ruroruro 3 года назад

Yeah, the attention maps look really really suspicious. Almost like the network only attends to the fourier features after the first layer. Also, the whole idea, that they are feeding the same unprocessed image into the network multiple times seems really weird. The keys should basically be a linear combination of r,g,b and the same fourier features each time. How much information can you realistically extract from an image just by attending to the low level color and positional information. I would have expected them to at least use a simple resnet or FPN alongside the "thin" attention branch thingy.

@reesejammie8821 3 года назад

Couldn't agree more. It's like the attention maps are far from being content-based. Also agree on the features being too low level, what does it even mean to attend to raw pixels?

@Ronschk 3 года назад

Really nice idea. I wonder how much improvement it would bring if the incoming data would converted through a "sense". Our brain also doesn't receive images directly, but instead receives signals from our eyes which transform the input image (and use something akin to convolutions?). So you would have this as a generic compute structure, but depending on the modality you would have a converter. I think they had something like this in the "one model to rule them all" paper or so...

@HuyNguyen-rb4py 2 года назад

so touching for an excellent video

@AbgezocktXD 3 года назад

One day you will stop explaining how transformers work and I will be completely lost

@herp_derpingson 3 года назад

17:30 Since you already bought a green screen, maybe next time put Mars or the Apollo landing in the background. Or a large cheese cake. Thats good too. . All in all. Once architecture to rule them all.

@YannicKilcher 3 года назад

Great suggestion :D

@jonathandoucette3158 3 года назад

Fantastic video, as always! Around 20:05 you describe transformers as invariant to permutations, but I believe they're more accurately equivariant, no? I.e. permuting the input permutes the output in exactly the same way, as opposed to permuting the input leading to the exact same output. Similar to convolutions being equivariant w.r.t. position

@mgostIH 3 года назад

You could say those terms are just equivariant to mistakes!

@ruroruro 3 года назад

Transformers are invariant to key+value permutations and equivariant to query permutations. The reason, why they are invariant to k+v permutations is that for each query all the values get summed together and the weights depend only on the keys. So if you permute the keys and the values in the same way, you still get the same weights and the sum is still the same.

@jonathandoucette3158 3 года назад

@@ruroruro Ahh, thanks for the clarification! In my head I was thinking only of self attention layers, which based on your explanation would indeed be permutation equivariant. But cross-attention layers are more subtle; queries equivariant, keys/values invariant (if they are permuted in the same way).

@anonymouse2884 2 года назад

I belive that it is permuation invariant, since you are doing a weighted sum of the inputs/ context, you should "roughly" (the positional encoder might encoder different time indices slightly differently, but this should not matter a lot) get the same results even if you permute the inputs.

@amirfru 3 года назад

This is incredibly similar to Tabnet ! but with the attentive blocks changed to attention layers

@emmanuellagarde2212 3 года назад

If the attention maps for layers >2 are not image specific, then this echoes the results of the paper "Pretrained Transformers as Universal Computation Engines" which suggests that there is a universal mode of operation for processing "natural" data

@hugovaillaud5102 3 года назад

Is this architecture slower than a resnet with a comparable amount of parameters due to the fact that it is somehow recurrent? Great video, you explain things so clearly!

@patf9770 3 года назад

Something I just noticed about the attention maps: they seem to reflect something about the positional encodings? It looks like the model processes images hierarchically, globally at first and with a progressively finer tooth comb. My understanding is that CNNs tend to have a bias towards local textural information so it'd be really cool if an attention model learned to process images more intuitively

@L9X 2 года назад

Could this perhaps be used to model incredibly long distance relationships, i.e. incredibly long term memory? As in, the latent query vector (i'll just call it Q from here) becomes the memory. Perhaps we start of with a randomly initialised latent Q_0 and input KV_0 - let's say the first message sent by a user - to the perceiver which produces latent output Q_1, and we then feed Q_1 back into the perceiver with the next message sent by the user KV_1 as an input and get output Q_2 from the perceiver and so on. Then at every step we take Q_n and feed that to some small typical generative transformer decoder to produce a response to the user's message. This differs from typical conversational models, such as those using GPT-whatever, because they feed the entire conversation back into the model as input, and since the model has a constant size input, the older messages get truncated as enough new messages are given, which means the older memories get totally lost. Could this be a viable idea? We could have M >> N which means we have more memory than input length, but if we keep M on the order of a thousand that gives us 1000 'units' of memory that retain only the most important information.

@ibrahimaba8966 2 года назад

17:28 best way to solve the quadratic bottleneck 😄!

@petrroll 3 года назад

There's one thing I don't quite understand. How does this model do low features capture / how does it retain the information? I.e. how does it do the processing that happens in the first few layers of CNN. I can clearly see how this mechanism works well for higher-level processing but how does it capture (and keep) low-level features? The reason why I don't quite understand it that the amount of information that flows between the first and second layer of this and e.g. first and second module of ResNet is quite drastically different. In this case it's essentially N*D which I suppose is way smaller than M* (not M because there's some pooling even in the first section of Resnet, but still close) in case of ResNet, simply on the account of N

@TheCreativeautomaton 3 года назад

ey Thanks for doing this, very much like the direction of transformers in ML, im newer to NLP and looking at where the direction of ML might go next. once again thanks.

@marat61 3 года назад

I belive there are error in the paper 23:07 Q must be MxC not MxD otherwise QK.transpose() will be imposible

@piratepartyftw 3 года назад

Very cool. I wonder if it works when you feed in multimodal data (e.g. both image and text in the same byte array).

@galchinsky 3 года назад

Proper positional encodings should somehow work

@henridehaybe525 3 года назад

It would be nice to see how the Perceiver would perform when the KV of the cross-attentions are not the raw image at each "attend" but the feature maps of a pretrained ResNet. E.g. the first "attend" KV are the raw image, the second KV is the feature maps of the second ResNet output, and so on. A pretrained ResNet would do the trick but it could technically be feasible to train it concurrently. It would be a Parallel-Piped Convolutionnal-Perceiver model.

@azimgivron1823 3 года назад

Are the query dimension and the latent array in figure 1 of the same dimensions ? It is written that Q belongs to the space of matrices of real numbers of dimensions MxD which does not make sens to me. I believe they meant NxD where D=C since you need to do a dot product to compute the cross-attention between the query Q and the keys K ==> Q.Kt with Kt being the transpose of K so it implies that the dimensions D and C are equal, isn't right ? I am kinda disappointed by the paper because this the core of what they want to show and they do not make the effort to dive in the math and explain this clearly.

@simonstrandgaard5503 3 года назад

Excellent walkthrough

@TheGreatBlackBird 3 года назад

I was very confused until the visual demonstration.

@MsFearco 3 года назад

I just finished this, its an extremely interesting paper. Please review the SWIN transformer next. Its even more interesting :)

@pvlr1788 2 года назад

Thanks for the video! But I can't understand where from the first latent array comes..

@swoletech5958 3 года назад

PointNet++ from 2017 outperformed the perceiver in image point clouds. 91.9 accuracy versus 85.7 See @ 27:19

@neworldemancer 3 года назад

Thanks for video, Yannic! i would imagine that the attention "lines" @27:00 could indeed be static, but the alternative - they are input dependent, yet too overfitted to FF, as this lines are clear artefact.

@gz6963 Год назад

4:10 Is this related to the puzzles we have to solve with Google Captcha? Things like "select all the squares containing a boat"

@jonatan01i 3 года назад

2:44 "And the image is of not a cat!, a house! What did you think??!.." I thought nothing; my mind was empty :(

@NextFuckingLevel 3 года назад

:( ifeel you

@48956l 2 года назад

thank you for that wonderful demonstration with the piece of paper lol

@dr.mikeybee 3 года назад

Even with my limited understanding, this looks like a big game change.

@NilabhraRoyChowdhury 3 года назад

What's interesting is that the model performs better with weight sharing.

@aday7475 2 года назад

Any chance we can get a compare and contrast between perciever, percieverIO, and percieverAR?

@yassineabbahaddou4369 2 года назад

why they have used a GPT-2 architecture in the latent transformer instead of BERT architecture?

@hanstaeubler 3 года назад

It would also be interesting to 'interpret' this model or algorithm on the music level as well (I compose music myself for my pleasure)? Thanks in any case for the good interpretation of this AI work!

@bender2752 3 года назад

Great video! Consider making a video about DCTransformer maybe? 😊

@peterszilvasi752 2 года назад

17:07 - The visual demonstration of how the quadratic bottleneck is solved was a true "Explain Like I'm Five" moment. 😀

@Kram1032 3 года назад

Did the house sit on the mat though

@notsure7132 3 года назад

Thank you.

@synthetiksoftware5631 3 года назад

Isn't the 'fourier' style positional encoding just a different way to build a scale space representation of the input data? So you are still 'baking' that kind of scale space prior into the system.

@evilby Год назад

WAHHH... Problem Solved!😆

@maks029 3 года назад

Thanks for for an amazing video, I didn't really catch what the "Latent array" represents? It's array of zeros at first?

@xealen2166 2 года назад

i'm curious, how are the queries generated from the latent matrix, how is the latent matrix initially generated?

@Shan224 3 года назад

Thank you yannic

@hiramcoriarodriguez1252 3 года назад

This is huge, i'm not going to surprise if "perceiver" becomes the gold standard for CV tasks.

@galchinsky 3 года назад

The way it is it seems to be classification only

@nathanpestes9497 3 года назад

@@galchinsky You should be able to run it backwards for generation. Just say my output (image/point-cloud/text I want to generate) is my latent(as labeled in the diagram), and my input (byte array in the diagram) is some latent representation that feeds into my outputs over several steps. I think this could be super cool for 3D GANs since you don't wind up having to fill 3d grids with a bunch of empty space.

@galchinsky 3 года назад

@@nathanpestes9497 @Nathan Pestes won't you get o(huge^2) this way?

@nathanpestes9497 3 года назад

@@galchinsky I think it would be cross attention o(user defined * huge) same as the paper (different order). Generally we have o(M*N), M - the size of input/byte-array, N - the size of the latent. The paper goes after performance by forcing the latent to be non-huge so M=huge, N=small O(huge * small). Running it backwards you would have small input (which is now actually our latent so a low dimensional random sample if we want to do a gan, perhaps the (actual) latent from another perceiver in a VAE or similar). So backwards you have M=small N=huge so O(small*huge).

@galchinsky 3 года назад

@@nathanpestes9497 Thanks for pointing this. I thought we would get Huge x Huge attention matrix, while you are right, if we set Q length to be Huge and K/V to be Small, the resulting complexity will be O(Huge*Small). So we want to get new K/V pair each time and this approach seems quite natural: (here was an imgur link but youtube seems to hide it). So there 2 parallel stacks of layers. The first set is like in the article: latent weights, then cross attention, then stack of transformers and so on. The second stack consists of your cross-attention layers, so operates in byte-array dimension. The first Q is the byte array input and K,V is taken from the stack of the "latent transformers". Then its output is fed as K,V back to the "latent" cross attention, making new K,V. So there is an informational ping-pong between "huge" and "latent" cross-attention layers.

@teatea5528 Год назад

It is stupid, but I want to ask how the author claims their method is better than VIT in ImageNet in the appendix A, Table 7 while their accuracy is not higher?

@conduit242 3 года назад

Embeddings are still all you need 🤷

@TheJohnestOfJohns 3 года назад

Isn't this really similar to facebook's DETR with their object queries, but with shared weights?

@antoninhejny8156 3 года назад

No, since DETR is just for localising objects from extracted features via some backbone like resnet, while this is the feature extractor. Furthemore, DETR just puts the features into a transformer, whereas this is like making an idea about what is in the image while consulting with the raw information in the form of RGB. This is however very suspitious, because linear combination of RGB is just three numbers.

@cocoarecords 3 года назад

Yannic can you tell us your approach to understand papers quickly?

@YannicKilcher 3 года назад

Look at the pictures

@TheZork1995 3 года назад

@@YannicKilcher xD so easy yet so far. Thank you for the good work though. Literally the best youtube channel I ever found!

@Anujkumar-my1wi 3 года назад

can you tell me why neural nets with many hidden layer requires less number of neurons than a neural net with a single hidden layer to approximate a function?

@axeldroid2453 3 года назад

Does it have something todo with sparse sensing ? It basically attentds to the most relevant data points.

@bensums 3 года назад

So the main point is you can have less queries than values? This is obvious even just by looking at the definition of scaled dot-product attention in Attention Is All You Need (Equation 1). From the definition there, the number of outputs equals the number of queries and is independent of the number of keys or values. The only constraints are: 1. the number of keys must match the number of values, 2. the dimension of each query must equal the dimension of the corresponding key.

@bensums 3 года назад

(in the paper all queries and keys are the same dimension (d_k), but that's not necessary)

@corgirun7892 Месяц назад

M = 50176 for 224 × 224 ImageNet images

@NeoShameMan 3 года назад

So basically it's conceptually close to rapide eye movement, where we refine over time data we need to resolve recognition...

@marat61 3 года назад

Also you did not say about dimension size in ablation part

@thegistofcalculus 3 года назад

Just a silly question, instead of big data input vector and small latent vector could they have a big latent vector that they use as a summary vector and spoon feed slices of data in order to achieve some downstream task such as maybe predicting the next data slice? Would this allow for even bigger input which is summarized (like HD video)?

@thegistofcalculus 2 года назад

Looking back it seems that my comment was unclear. It would involve a second cross attention module to determine what gets written into the big vector.

@DistortedV12 3 года назад

“General architecture”, but can it understand tabular inputs??

@Deez-Master 3 года назад

Nice video

@moctardiallo2608 3 года назад

Yeah 30min is very better!

@GuillermoValleCosmos 3 года назад

this is clever and cool

@vadimschashecnikovs3082 3 года назад

Hmm, I think it is possible to add some GLOM-like hierarchy of "words". This could improve the model...

@LaNeona 3 года назад

If I have a gamification model is there anyone you know that does meta analysis on system mechanisms?

@brll5733 3 года назад

Performers already grow entirely linearly, right?

@seraphim9723 3 года назад

The ablation study consists of three points without any error bars and could just be coincidence? One cannot call that "science".

@freemind.d2714 3 года назад

Good job Yannic, But I start to feel like lot of paper you talk in video those days are all about transformer, and frankly they kind similar and most are about engineering research not scientific research, hope you don't mind to talk more about interesting paper on different subject

@muhammadaliyu3076 3 года назад

Yannick follows the hype

@timstevens3361 3 года назад

attention looped is consciousness

@happycookiecamper8101 3 года назад

nice

@rhronsky 3 года назад

Clearly you are more of a fan of row vectors rather than column vectors Yannic (refererring to your visual demo :))

@kenyang687 Год назад

The "hmm by hmm" is just too confusing lol

@kirtipandya4618 3 года назад

Where can we find source code?

@oreganorx7 2 года назад

Very similar to MemFormer

@Stefan-bs3gm 3 года назад

with O(M*M) attention you quickly get to OOM :-P

@allengrimm3039 3 года назад

I see what you did there

@martinschulze5399 3 года назад

habt ihr phd stellen offen? ^^

@enriquesolarte1164 3 года назад

haha, I love the scissors...!!!

@AvastarBin 3 года назад

+1 For the visual representation of M*N hahah

@TechyBen 3 года назад

Oh no, they are making it try to be alive. XD

@Vikram-wx4hg Год назад

17:15

@allurbase 3 года назад

It's kind of dumb to input the same video frame over and over, just go frame by frame, it's will take a bit for it to catch up but so would you.

@omegapointil5741 3 года назад

I guess curing Cancer is even more complicated than this.

@insighttoinciteworksllc1005 2 года назад

Humans can do the iterative process too. The Inquiry Method is the only thing that requires it. If you add the trial and error element with self-correction, young minds can develop a learning process. Learn How to learn? Once they get in touch with their inner teacher, they connect to the Information Dimension (theory). Humans can go to where the Perceiver can't go. The Inner teacher uses intuition to bring forth unknown knowledge to mankind's consciousness. The system Mr. Tesla used to create original thought. Unless you think he had a computer? The Perceiver will be able to replace all the scientists that helped develop it and the masses hooked on the internet. It will never replace the humans that develop the highest level of consciousness. Thank you, Yeshua for this revelation.