Hopfield Networks is All You Need (Paper Explained)

Yannic Kilcher

Подписаться 266 тыс.

Просмотров 97 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Опубликовано:

27 окт 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 86

@thomasmuller7001 4 года назад

Yannic Kilcher is All You Need

@quebono100 4 года назад

Thats a good one :)

@tarmiziizzuddin337 4 года назад

haha, yeah man,

@kimchi_taco 4 года назад

Such information transformer

@rockapedra1130 3 года назад

Yannic is awesome at “bottom-lining” things. Cuts through the abstruse mathematical fog and says “this is all it’s REALLY doing”. This channel is HUGELY valuable to me. There are too many papers that veer off into the implementation maths, IMHO. Yannic helps you filter out all the irrelevancies.

@good_user35 4 года назад

It's Great to learn why transformer works so well (Theorem 4) and how the three vectors (K, Q and V) can be translated in Hopfiled networks. The analysis of layers for patterns reminds me many studies of BERTology in NLP. In one of the papers, I remember it reported that the most syntactic processing seems to occur by the middle of the 12 layers. It's interesting and seems there are still many things to be known in the future. Thanks!

@samanthaqiu3416 3 года назад

I'm still confused by this paper: the original Krotov energy for binary pattern retrieval keeps weights as *sums* over all stored patterns, which means constant storage... this lse and update rule seem to be keeping the entire list of stored patterns around.. that looks like cheating to me.. I am probably missing something

@maxkho00 Год назад

@@samanthaqiu3416 I kept asking myself this throughout the entire video. Surely I must be missing something? The version of Hopfield "network" described by Yannic in this video just seems like regular CPU storage with a slightly more intelligent retrieval system.

@centar1595 4 года назад

THANK YOU! I actually asked specifically for this one - and man that was fast :)

@samanthaqiu3416 4 года назад

Regarding theorem #3: c has a lower bound that is exponential on d^-1, hence the guarantee that N will grow exponential seems optimistic. If you include the lower bound on c, seems that the lower bound on N has no exponential dependence on d at all

@Deathflyer 2 года назад

If I understand the proof in the appendix correctly, this is just phrased weirdly. By looking at the actual formula for c, its actual asymptotic behaviour as d \to \infty is just a constant.

@mgostIH 4 года назад

Wow I found about your channel a few days ago, today I saw this paper and got interested in it and now I see you just uploaded! Your channel has been very informative and detailed, quite rare compared to many others which just gloss over details

@chochona019 4 года назад

Damn man, the amount of great papers you review is amazing. Great work.

@jesseshakarji9241 4 года назад

I loved how he drew a pentagram

@emuccino 4 года назад

Linear algebra is all you need

@lucasl1047 20 дней назад

This video is soon gonna boom lol

@rock_sheep4241 4 года назад

You are indeed the most amazing neural network ever :))

@rock_sheep4241 4 года назад

A quick Sunday night film :))

@0MVR_0 4 года назад

Personhood goes well beyond stimulated predictions with evaluatory mechanics.

@konghong3885 4 года назад

Not gonna lie, I have been waiting for this video so I don't have to read it myself :D

@Irbdmakrtb 4 года назад

Great video Yannic!

@dylanmenzies3973 12 дней назад

I read about hopfield nets, thought "why can't they be continuous?", and bang straight into the cutting edge.

@umutcoskun4247 4 года назад

Lol I was looking for a youtube video about this paper just 30 min ago and was sad to see that you have not had uploaded a video about it yet...I was 15 min to early I guess :D

@woooka 3 года назад

Cool work, great to get more insights about Transformer attention!

@rockapedra1130 4 года назад

Very clear! Great job!

@Imboredas 2 года назад

I think this paper is pretty solid, just wondering why it was not accepted in any of the major conferences.

@TheGroundskeeper 4 года назад

Hey man. I literally sit and argue AI for a job and I often find myself relying on info or ideas either fully explained or at least lightly touched by you very often. This is a great example. It’d be a sin to ever stop. It’s obvious to me that training was in no way done and the constant activity in the middle does not indicate the same items are going back and forth about the same things

@0MVR_0 4 года назад

It is time to stop giving academia the 'all you need' ultimate.

@seraphim9723 4 года назад

Modesty is all you need!

@alvarofrancescbudriafernan2005 2 года назад

Can you train Hopfield networks via gradient descent? Can you integrate a Hopfield module inside a typical backprop-trained network?

@revimfadli4666 Год назад

I guess fast Weights can do those

@cptechno 4 года назад

Love your work! I'm interested in the research paper magazines that your regularly scan into. Can you give a list of these research magazines? Maybe you can classify them has 1) very often quoted magazine 2) less often quoted ....

@YannicKilcher 4 года назад

There's not really a system to this

@revimfadli4666 Год назад

@@YannicKilcher so just the PhD "one paper per day" stuff?

@sthk1998 4 года назад

If there can be so much exponential information embedding within these hopfield networks, does that mean that this is a good architecture type to use in a reinforcement learning task?

@YannicKilcher 4 года назад

possibly yes

@sthk1998 4 года назад

@@YannicKilcher how would one transfer the model representation of eg Bert or some other transformer model to a RL framework

@jaakjpn 4 года назад

@@sthk1998 You can use hopfield networks (and transformers) for the episodic memory of the agent. DeepMind has used similar transformer like attention mechanisms in their latest RL methods, e.g., Agent57.

@revimfadli4666 Год назад

Also how resistant would it be against catastrophic forgetting?

@revimfadli4666 Год назад

@@jaakjpn I wonder if the ontogenic equivalent of Baldwin effect played part

@AbgezocktXD 4 года назад

These spheres (32:00) are just as in coding theory. Very cool

@Xiineet 4 года назад

"its not also higher, it is also wider" LMAO

@valthorhalldorsson9300 4 года назад

Fascinating paper, fantastic video.

@sergiomanuel2206 4 года назад

You are a genius man!

@siuuuuuuuuuuuu12 16 дней назад

Doesn't this network take the form of a hub?

@gamefaq 4 года назад

Great overview! Definition 1 for stored and retrieved patterns was a little confusing to me. I'm not sure if they meant that the patterns are "on" the surface of the sphere or if they were "inside" the actual sphere. Usually in mathematics, when we say "sphere" we mean just the surface of the sphere and when we say "ball" we mean all points inside the volume that the sphere surrounds. Since they said "sphere" and they used the "element of" symbol, I assume they meant that the patterns should exist on the surface of the sphere itself and not in the volume inside the sphere. They also use the wording "on the sphere" in the text following the definition and in Theorem 3. Assuming that's the intended interpretation, I think the pictures drawn at 33:42 are a bit misleading.

@YannicKilcher 4 года назад

I think I even mention that my pictures are not exactly correct when I draw them :)

@DamianReloaded 4 года назад

It'd be cool to see the code running on some data set.

@bzqp2 18 дней назад

Hopfield Networks is All You Need To Get A Nobel Prize in Physics.

@burntroses1 2 года назад

It is breakthrough in understanding immunity and cancer

@dylanmenzies3973 12 дней назад

Hang on a sec, n nodes, therefor n^2 weights. (ish) The weights contain the information for stored patterns, thats not exponential on size n, more like n pattern storage of n bits at best. Continuous is different.. each real number can contain infinite information, depends on the accuracy of output required.

@nitsanbh Год назад

Would love for some pseudo code! Both for training, and for retrieval

@ChocolateMilkCultLeader 4 года назад

Thanks for sharing. This is very interesting

@luke.perkin.inventor 4 года назад

It looks great, but equivalently expressive networks aren't always equally trainable? Can anyone recommend a paper that tackles measuring learnability of data, trainability of networks, maybe linking p=np and computational complexity? I understand ill posed problems, but for example, cracking encryption, no size of network or quantity of training data will help... because the patterns are too recursive, too deeply burried, and so unlearnable? How is this measured?

@mathmagic9333 4 года назад

At this point in the video ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-nv6oFDp6rNQ.html you state that if you increase the dimension by 1, the storage capacity increases by 3. However it increases by c^{1/4} so by about 1.316 and not 3, correct?

@YannicKilcher 4 года назад

True.

@nicolasPi_ 3 года назад

@@YannicKilcher It seems that c is not a constant and depends on d. Given their examples with d=20 and d=75, we get respectively N>7 and N>10 which looks like a quite slow capacity increase, or did I miss something?

@davidhsv2 4 года назад

So, the Albert architecture, with the sharing parameters can be described as a hoper network with 12 iterations? Albert is an unique transformer encoder iterated 12 times.

@YannicKilcher 4 года назад

It's probably more complicated, because transformer layers contain more than just the attention mechanism

@jeanphilippe9141 2 года назад

Hey! Amazing video, love your work. I'm a beginner in all of this but I have this question : can bringing up the number of dimensions of the problem lower the "perplexity" of the problem? Higher dimensions meaning more information meaning tighter or more specific "spheres" around a pattern. My guess is "yes" but that sometimes the dimensions are fixed in a problem so this solution to lower perplexity is impossible. Does the paper say anything about that, or do you have an educated guess on what could be an answer? :) If my question is stupid just say so I really don't mind! Thanks for any answer and thank you for your videos. I'm hoping on making this an activity for high school students to promote science, so thanks a lot!

@martinrenaudin7415 4 года назад

If queries, keys and values are of the same embedding size, how do you retrieve a pattern of a bigger size in your introduction?

@YannicKilcher 3 года назад

good point. you'd need to change the architecture in that case.

@sacramentofwilderness6656 4 года назад

Concerning these spheres : do they span all the parameter space? Or there are some regions, not belonging itself to a particular pattern? There were theorems, claiming that the algorithm has to converge, in that case, does the getting caught by a particular cluster depend on the initialisation of weights?

@YannicKilcher 4 года назад

Yes, they are only around the patterns. Each pattern has a sphere.

@pastrop2003 4 года назад

Isn't it fair to say that if we have one sentence in the attention mechanism meaning that each word in the sentence is attending to the words from the same sentence, the strongest signal will always be from any word attending to itself bcs in this case the query is identical to the key? Am I missing something here?

@charlesfoster6326 4 года назад

Not necessarily, in the case of the Transformer: for example, if the K matrix is -Q matrix, then the attention will be lowest for a position onto itself.

@pastrop2003 4 года назад

@@charlesfoster6326 True, although based on what I read on transformers in cases of a single sentence, K==Q. If so we are multiplying a vector by itself. This is not the case when there are 2 sentences (translation task is a good example of that). I haven't seen the case when K == -Q

@charlesfoster6326 4 года назад

@@pastrop2003 I don't know why that would be. To clarify, what I'm calling Q and K are the linear transforms you multiply the token embeddings with prior to performing attention. So q_i = tok_i * Q and k_i = tok_i * K. Then q_i and k_i will only be equal if Q and K are equal. But these are two different matrices, which will get different gradient updates during training.

@zerotwo7319 Год назад

Why dont they say attractor? Much more easier than 'circles'.

@shabamee9809 4 года назад

Maximum width achieved

@tripzero0 4 года назад

Want to see attentionGAN (or op-GAN). Does attention work the same way in GANs.

@ArrakisMusicOfficial 4 года назад

I am wondering, so how many patterns does each transformer head actually store?

@YannicKilcher 4 года назад

good point, it seems that depends on what exactly you mean by pattern and store

@josephchan9414 4 года назад

thx!!

@rameshravula8340 4 года назад

Lot's of math in the paper. Got lost in the mathematics portion. Got a gist of it, however

@ruffianeo3418 Год назад

What really bugs me about all "modern AI" "explanations" is, that they do not enable you to actually code it. If you refer to one source, e.g. this paper, you are none the wiser. If you refer to multiple sources, you end up confused because they do not appear to describe the same thing. So, it is not rocket science but people seem to be fond of making it sound like rocket science, maybe to stop people from just implementing it? Here a few points, not clear (at least to me) at all: 1. Can a modern hopfield network (the one with the exp) be trained step by step, without (externally) retaining the original patterns it learned? 2. Some sources say, there are 2 (or more) layers (feature layer and memory layer). This paper says nothing about that. 3. What are the methods to artificially "enlarge" a network if a problem has more states to store, than the natural encoding of a pattern requires (2 ^ number of nodes < (number of features to store))? 4. What is the actual algorithm to compute the weights if you want to teach a network a new feature vector? Both the paper and the video seem to fall short in all those points.

@004307ec 4 года назад

as an ex-phd student on neural science, I am quite interested in such research.

@FOLLOWJESUSJohn316-n8x 4 года назад

😀👍👊🏻🎉

@444haluk 3 года назад

This is dumb. Floating point numbers are already represented with 32 bits. THEY ARE BITS! The beauty in Hopfield Networks is that I can change every bit independent of other bits to store a novel representation. If you multiply a floating number with 2, they will all shift to left, you just killed many type of operation/degree of freedom due to linearity. With 10K bits I can represent many patterns, FAR more than number of the atoms in the universe. I can represent far more with 96 bits instead of 3 floats. This paper's network is a very narrow minded update to the original network.

@conduit242 3 года назад

So kNN is all you need

@quebono100 4 года назад

Subscibers to the moon!

@tanaykale1571 4 года назад

Hey, Can you explain this Research Paper - CANet: Class-Agnostic Segmentation Networks with Iterative Refinement and Attentive Few-Shot Learning (arxiv.org/abs/1903.02351) It is related to Image Segmentation. I am having a problem understanding this paper.