No video :(

GLOM: How to represent part-whole hierarchies in a neural network (Geoff Hinton's Paper Explained)

Yannic Kilcher

Подписаться 261 тыс.

Просмотров 45 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Опубликовано:

28 авг 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 114

@stalinsampras 3 года назад

Yannic "The light Speed" Kilcher never stop the good work

@yimingqu2403 3 года назад

There are not many who can just publish paper like this in the whole world.

@444haluk 3 года назад

And this is not a compliment :D

@Daniel-ih4zh 3 года назад

@@444haluk How is this paper a bad idea?

@IdiotDeveloper 3 года назад

The GLOM architecture feels similar to the neocortex column structure. Thanks for the easy explanation.

@snippletrap 3 года назад

Directly inspired by it.

@JamesAwokeKnowing 3 года назад

I can't shake the feeling that Hinton might feel he doesn't have much time left and so wanted to get the idea out there asap without waiting for working system. That makes me sad. Then again, capsules and boltzman networks etc also always were getting presented 'not yet working' so many this is Hinton in peak form. :)

@andres_pq 3 года назад

Thought the same. Also wanted to protect the idea so Schmidhuber can't say it's his.

@Freddiriksson 3 года назад

Mr Sampras already wrote it. Just dont stop!! I really appreciate your work.

@andrewmeowmeow 3 года назад

Cool! Your cross-layer attention mechanism is very clever. Thank you for sharing such a clever idea and this high quality video. From a deep learning newbie😀

@BaddhaBuddha 2 года назад

But what would be the biological correlate of this method of summing over layers?

@hemanthkotagiri8865 3 года назад

I just saw his tweet and I was thinking if you have uploaded yet and here you are. Wow dude, you're crazy! I love it! Keep'em coming! Helping me alot!

@billykotsos4642 3 года назад

Τhe man, the myth, the legend

@kevivmodi7019 3 года назад

the one and only

@marcelroed 3 года назад

Yannic or Geoffrey?

@dibyanayanbandyopadhyay3018 3 года назад

@@marcelroed both

@abhishekmaiti8332 3 года назад

coldzera vs yannic

@hoangnhatpham8076 3 года назад

The feedback from upper layers and lateral connections remind me of Neural Abstraction Pyramid. Anyway, nice video as usual!

@suleymanemirakin 11 месяцев назад

Thank you so much this video helped a lot

@florianhonicke5448 3 года назад

Made my Sunday! Thanks!!!

@lm-gn8xr 3 года назад

Does the paper discuss any relations with graph neural networks ? The way features are updated by aggregating top / down / same layer features looks a lot like what is done in graph networks to me Thanks for the video btw, it's incredible to be able to create such contents 1 day after the release of a 40 pages papers 👏

@michaelwangCH 3 года назад

Intuitive and interesting paper - Hinton replicates how human brain recognize an object in an image. AI-researchers should write the code and try it out.

@vsiegel 3 года назад

Potatoes can be much more irregular shaped than avocados. And you have previous knowledge about the avocado shape! That was not a good physics experiment ;) (It is not claimed to be one), which is only noticeable based on the contrast to the rest of the presentation - which is obviously great, to be clear. And the fact of including an experiment in itself is brilliant, thanks for including it!

@andrehoffmann2018 3 года назад

Great video! I'm thinking that maybe it makes sense to gather all information in the same layer for the island creation consensus, independently if it may break the parse tree logic a little. Like, utilizing the cat example, fur is fur, even though we may make arbitrary symbolic separations at higher levels, like "cat ear fur" or "cat neck fur".'

@andrehoffmann2018 3 года назад

This may help when treating a video, for example. If in the video a cat moves, a lower level channel may now receive the image of the cat, when in previous timesteps it received background. If it gathers all information of decisions at the same level in the previous timestep, it can quickly decide "fur" (or "cat" at higher levels), because there were other columns at this level that already processed this, and had agreed for "fur" But if it ignores some information because it was at a different parse tree node (higher level vector island), it will be harder to make this decision, because there is no information about "cat" in the "background" parse tree node that this column was part in the previous step. Maybe this doesn't make sense, but this is how I understand it

@nikronic 3 года назад

Absolutely, I have the same idea too. We may use "breaking parse tree" but actually we can interpret it as updating parse tree. At 29:50 ish, it has been mentioned that lower levels of the different columns of different higher nodes should not contribute in attention propagation while including it makes this possible to create new trees or destory previous ones depending on the state of pixels. Even in static image, depending on the patch size (let's say bigger than pixel), then multiple patches may refer to same object which could be represented by a node in higher levels of column when we let attention pass information between two distinct nodes (branches of trees).

@simonstrandgaard5503 3 года назад

This is a much better layout than the previous video. It works well with the semi transparent text in the bottom right corner. It doesn't work with the channel icon wasting precious screen estate throughout the video. Instead make circle with YK inside, doesn't have to be a fancy font nor colorful, it's important that it's readable/recognizable.

@simonstrandgaard5503 3 года назад

By second thought. I think reability of the youtube link can be improved using white font color with black outline. Also no YK circle. Just the youtube link.

@eladwarshawsky7587 3 года назад

I feel like this is basically message passing like in GNNs (plus some attention), but with patches of an image as nodes.

@morkovija 3 года назад

My prediction: somebody IS going to cite the channel in their papers this year =)

@patrickh7721 3 года назад

34:08 haha

@RishitDagli 3 года назад

Really Cool video, loved the way you explain stuff. I also tried to implement GLOM in code after watching this video.

@Idiomatick 2 года назад

did it work at all?

@avishkarsaha8506 3 года назад

luv these vids, the best lunchtime infotainment

@priyamdey3298 3 года назад

The so-called columns feels like being inspired from the cortical columns with its 4 layers being consistently present throughout the neo-cortex, although those are way way more complex to understand.

@JamesAwokeKnowing 3 года назад

It seems that way but not really. The layers would be across cortical columns with cortical columns closer to pixels/image patches

@snippletrap 3 года назад

They are absolutely inspired by cortical columns. Hinton is frank about this in his talks.

@JamesAwokeKnowing 3 года назад

@@snippletrap If so, it's foolish because studies show clearly that that's now how the brain has objects/information organized. It's 100% wrong to think that the layers in cortex correspond to higher level features, so that all the higher concepts are on one layer and lower on other. instead it's clearly organized on the other dimension where eg you have an area for hands, and it has sub areas for fingers and sub areas for fingertips etc. Tell me, which of the layers of the cortex has all the high level (eg cat, car) classes?

@charlesfoster6326 3 года назад

@@JamesAwokeKnowing I don't understand. That's also how GLOM works. There'd be a section of columns that map onto inputs from, say, the left eye, and build hierarchical representations (in each column) of visual inputs. The representations are modality specific, localized, but also distributed.

@snippletrap 3 года назад

@@JamesAwokeKnowing Hinton distinguishes between levels and layers. Watch his presentation on this paper. Charles Foster is right.

@jonathanr4242 Год назад

Great explanation. It seems a bit similar in some senses to the hierarchical mixture of experts model.

@andreassyren329 3 года назад

It might be possible to bias the the network into learning an appropriate attention modulation such as the one you proposed by introducing positional encoding in the columns. Then columns far apart are less similar and their influence is modulated. An interesting consequence of such a learned modulation would be that, over several iterations, an "island" could arrive on an island-global position encoding in addition to the "object" encoding. This could be useful for higher level layers, that would benefit from using the location information of lower level islands. PS: GLOM has a distinct smell of graph nets.

@whale27 3 года назад

Love these videos

@swazza9999 3 года назад

Thanks Yannic! I noped out of the paper within the first few pages but this video will help me gather the courage to tackle it again. By the way what software are you using for the pdf drawing? (and if anyone else knows, would love to know from you)

@jeshweedleon3960 3 года назад

One Note iirc

@herp_derpingson 3 года назад

25:00 X * X.T can just be interpreted as, "How many things similar to myself are near me". So, if we attend to everything in the picture, it will not be very informative I think. We need to have some spacial window. Speaking of which, I think we should also add position embeddings in this attention. . 44:40 Cant we set the vector lengths of embeddings to 1 by normalized them after each step? . 47:00 So, we have another loss term where the loss is the deviation of individual predictions from the final summation? Is this even needed? Just because I am a pixel of cat fur, doesnt mean that the pixel next to me is going to be a pixel of cat fur. It can also be grass. . 55:00 Humans cant do that too. Human eyes are not rotation invariant. . 58:30 The video analogy is excellent! . I think a easier way to train this model would be to take image net. Make every class one orthogonal vector with the embedding length. Then calculate the MSE loss where all the vectors in the last layer should be equal to the orthogonal vector representing the class. Basically loss = sum(mse(Y, Y_[i]), 0, n) where Y is the orthogonal vector corresponding to that image class ground truth and Y_[i] is the activation of last layer of ith column.

@frankjin7086 3 года назад

Your idea is so brilliant. Thanks for sharing. I am searching for some methodologies in the hierarchy structure in the NN. Your idea is the smartest.

@DanFrederiksen 3 года назад

Video might be a little long. I've only followed Hinton very little but I get the impression that he might publish an idea that might not work just in case something along those lines turns out to work as a way to claim it for himself. Even if the idea isn't originally his either.

@arnebinder1406 3 года назад

Implementation How-To for the text domain (approx. < 1 h when adapting huggingface code): * take the ALBERT model (aka weight sharing across layers), see paper (arxiv.org/abs/1909.11942) and code (github.com/huggingface/transformers/blob/master/src/transformers/models/albert/modeling_albert.py) * use t layers (t=number of time steps you want to model) * use L heads (L=number of GLOM layers you want to model) * do these small modifications to the ALBERT model: 1) remove the linear projections for query, key, value (just pass through [(d/L)*i..(d/L)*(i+1)] to the i'th head; d is the embedding dimensionality) 2) modify/constrain the dense layer that follows the attention in a way that each partition [(d/L)*i..(d/L)*(i+1)] of its output is only constructed by the output of the (i-1)-th head, the i-th head, and the (i+1)-th head 3) remove the skip connection(s) and the MLP that sits on top of the attention layer Maybe this needs some minor tweaks, but you should get the idea. EDIT: Took a bit longer, but here you are: github.com/ArneBinder/GlomImpl

@zhicongxian1582 3 года назад

Thank you for the great video. I have one question in section 7, "learning islands". To avoid collapse in the latent variables, Hinton proposes one obvious solution is to regularize the bottom-up and top-down neural networks by encouraging each of them to predict the consensus opinion. Is the following interpretation correct? The bottom-up network learns a part-whole relationship, e.g., a cat's ear and a cat's neck suggest the presence of a cat's head. The top-down network learns a whole-part relationship, e.g., if there is a cat's head in the area, then there must be a cat's ear and a cat's neck. The presence probability of parts in the bottom-up network should be the same as the one inferred in the top-down network.

@zhicongxian1582 3 года назад

After I read that paragraph again, I assume, maybe it means the way GLOM updates the weights using consensus agreement can avoid collapsing latent variables. No regularization technique is mentioned, correct?

@mdmishfaqahmed5523 3 года назад

10:04 thats one adorable cat :p

@veedrac 3 года назад

Unless I'm misunderstanding something major, transformers can already express most of these computations (including Yannic's proposed improvement) through attention, and in cases do it better. It doesn't do the iteration method shown, but I think that's the only major missing part (and IMO seems kind'a sketchy). It seems to me you'd be better off trying to augment training of a more traditional transformer to encourage these structures, rather than hard-coding the bias in the architecture.

@ryanalvarez2926 3 года назад

I think this could be implemented with a neural cellular automata. Every pixel gets a column of embeddings and are updated iteratively. It’s already so close.

@charlesfoster6326 3 года назад

I agree! Try it out :)

@SudhirPratapYadav 3 года назад

Which software are you using for displaying paper, so much of margin to draw ?

@jamiekawabata7101 3 года назад

Love the video, thank you.

@JamesAwokeKnowing 3 года назад

@yannic can you speculate how well this architecture maps to spiking networks (eg on neuromorphic chips). Because of the iterative and time based nature it would seem it could map nicely.

@patf9770 3 года назад

I'm probably misunderstanding something but isn't the feedback transformer essentially implementing this is an efficient way?

@bertchristiaens6355 2 года назад

Is this similar to the cortical columns of the thousand brains theory?

@nikhilm4418 3 года назад

I’m wondering if a UNet++ architecture has an idea similar to this as far as the information sharing across levels of a column are concerned. GLOM is way more sophisticated of course w.r.t attention based inter-column representation sharing etc.

@jrkirby93 3 года назад

I'm really curious why Hinton wrote this paper... instead of just building the thing? He has experience, access to data, access to compute, grad students to help, and time to focus on it. Was he afraid someone else was doing something similar and he wants to publish first? Does he need help figuring out design? Is there something else he's missing?

@wenxue8155 3 года назад

my guess is that he just want to take credits for every advance in AI field. It's like you read some papers and sense that this could be a breakthrough so you write this idea into a paper. This is not fair to people who has been working on this idea. If these people succeed, i.e. they actually build something and works, in the future, people would say, oh, Geoffrey Hinton had this idea before you, so it was he who invent this.

@jsmdnq 2 года назад

This seems like it would just be equivalent to a Fourier spectrogram. You will have "noise" at the highest levels of detail representing all the various info and as you go up the abstraction you will filter out that noise. The results will simply be that of a 3D Fourier transform with a progressive low pass filter which filters more and more data. at the very top you have a constant which "represents" the scene in it's most abstracted form. Without training the algorithm to know how to filter towards some class you won't be able to interpret the results with any meaning beyond the inherent classification(which is just abstract bits and so no classification).

@shengyaozhuang3748 3 года назад

Can you have a look OpenAI's latest work "Multimodal Neurons in Artificial Neural Networks"?

@arnavdas3139 3 года назад

Lockdown 2.0 run restarted....😭😭😭😭...how to keep up with your videos

@charliesteiner2334 3 года назад

It seems like this is missing scale invariance - if you have a cat and a zoomed in cat, you end up with the same parts of the cat being processed at different layers.

@abhishekaggarwal2712 3 года назад

I think you are confusing between levels and layers. Levels are withing embeddings and represents levels of abstractions. Each layer will have all (say 5) levels in an embedding. Layers are meant to provide progressive temporal resolution of these levels during forward pass. The lateral computation between same levels across different locations is conventional self attention like computation with the all keys, values and queries being identical.

@dr.mikeybee 3 года назад

I think capsules are the wrong direction. What we've seen over and over is that end-to-end ANNs eventually outperform what humans engineer. I believe that when the models get large enough, and when we feed in the right training data in the right order, we will get truly general models. Are there systems that choose and order training data?

@NM-jq3sv 3 года назад

Lamda should be the multiplied "shortest distance between two nodes in a graph" times but we dont know the graph. I couldnt understand your math in the attention modification :(

@sfarmapietre 3 года назад

Nice!!

@alexanderkyte4675 3 года назад

You should make merch related to NN jokes

@Mordenor 3 года назад

Neural Network November

@shaypatrickcormac2765 3 года назад

it looks like Hypercolumns for Object Segmentation

@andres_pq 3 года назад

1:00:00 it is avocado shape

@eelcohoogendoorn8044 3 года назад

I guess im all for people publishing whatever it is on their mind, without too much regard for conventions, so good on him. But I dont see much value in such 'idea' papers in ML. I dont think there is a shortage of ideas; there is a shortage of ideas you can get to work and do something useful. If the field was in a state where we had a deep understanding of why the things we do work in the first place, such theoretical leaps might pay off. As it stands, the question of 'but is this an idea with a loss function that will converge using stochastic gradient descent?' is one you ignore at you own peril.

@pensiveintrovert4318 3 года назад

Large networks already do this implicitly. A neural net is an ensemble of neural nets, each specializing in different images.

@youngjin8300 3 года назад

Another unreasonable effectiveness of Yannic just hit home.

@noddu 3 года назад

59:40 interesting

@boss91ssod 3 года назад

onenote is not good for annotating pdfs, please use sth with sharper quality (eg. good notes, or liquid text). it is uncomfortable to read the text. as an alternative, maybe import the pdf pages as high-res images...

@zeyuyun6605 3 года назад

10:00 "a cat" lol

@user-tm9fh5rb5y 3 года назад

The avacado is not a joke.

@nicolagarau9763 3 года назад

I don't get all the hate towards submitting an Arxiv paper targeting theoretical ideas. I mean, even if Hinton is the author, even if some concepts of the paper are not that new, in my opinion, the idea is expressed very well and could be revolutionary if implemented correctly, it's pretty refreshing to see new theoretical papers in ML which are not targeted towards pure benchmarking. To see it as an attempt of being the first to invent it is quite silly in my view since research should not be a competition. The point is, we need less benchmarking and 0.001% improvements in ML, and more unsupervised and interpretable models, possibly based on biologically plausible concepts.

@nicolagarau9763 3 года назад

Also, thank you very much Yannic for the inspiring video ❤

@kartikeyshaurya1827 3 года назад

Bro is it ok to use the knowledge from your video in my own video??? Of course I will be giving credit to you.....

@jeffhow_alboran 3 года назад

Confirmed. Yannic is a alien creature and does not need sleep.

@wahabfiles6260 3 года назад

what are you saying?

@SudhirPratapYadav 3 года назад

This looks very similar to *Jeff Hinton's* thousand brain theory - cortical columns in neocortex with voting system. edit1: Jeff Hawkin's not Jeff Hinton's

@bertchristiaens6355 2 года назад

Jeff Hawkings*, and I thought the same! I’m curious which architecture will implement it most accurately, transformers, graph NNs, ... when looking at the Tesla Day video, it seems that a combination could be the solution, encoding with CNNs, fusion of input with Attention, feature pyramids, and temporal and spatial predictions

@SudhirPratapYadav 2 года назад

@@bertchristiaens6355I dont know how I could type jeff hinton, after reading jeff hawkins book and watching many videos. Mind is quirky in its own way

@SudhirPratapYadav 2 года назад

@@bertchristiaens6355 Yes, Tesla took engineering approach. I think Tesla self driving car is one of the first real world system with deep learning networks used as 'module' as we use other softwares. It truly is software 2.0.

@bublylybub8743 3 года назад

I appreciate the video walk through. but, I have to rant a bit...what bothers me the most is the reference/related work is severely lacking/missing in this manuscript. The first sentence in the introduction quote "There is strong psychological evidence that people parse visual scenes into partwhole hierarchies and model the viewpoint-invariant spatial relationship between a part and a whole as the coordinate transformation between intrinsic coordinate frames that they assign to the part and the whole [Hinton, 1979]." come on! only one paper is cited and it's his paper... I don't know how psychology or neural science people feel about this. I mean at least pay some effort to cite some biology/psychology/neural science papers here. Back to the idea... I think the idea presented in this paper is ... not new... hierarchal structure in vision modeling, message parsing, attention are all well studied... I am sorry but I don't see any REAL NEW stuff here in this paper. My impression is things described in this paper might be just some less expressive transformers...with some inductive bias (like hierarchy) baked in.

@nikronic 3 года назад

You are right but the point is that it is an idea paper to just share his views and also in abstract he says that it's a mechanism that combines all well known mechanism in a specific way. Also, at the end of the paper, he acknowledges that he should have read much more papers and wants other people's view on this topic.

@eduarddronnik5155 3 года назад

So he proposed transformer?

@silvercat4 3 года назад

Hey Yannic, please share a picture of your cat! I suspect she's quite a beauty

@wenxue8155 3 года назад

this is not fair to people who has been working on this idea. If these people succeed, i.e. they actually build something and works, in the future, people would say, oh, Geoffrey Hinton had this idea before you, so it was he who invent this.

@walterwhite4234 3 года назад

Dude Jürgen Schmidhuberr invented it in the 90this so shut the f**** up

@RoboticusMusic 3 года назад

Is this how many or most of our braincells are actually themselves an "expert" (for example in rats there is just one neuron to signal an image is moving up and it can be stimulated to make the rat press a button it was trained to press even if the image is moving down), so this is an efficient method to "find that expert neuron"?