Тёмный

The Forward-Forward Algorithm 

EscVM
Подписаться 1,4 тыс.
Просмотров 17 тыс.
50% 1

Geoffrey Hinton introduced in the paper, “The Forward-Forward Algorithm: Some Preliminary
Investigations”, a new approach for training neural networks that involves using two forward passes instead of the traditional forward and backward passes of backpropagation. Let's see it together.
An episode of AIQuickie with PyTorch 1/2.x code.
▬ Contents of this video ▬▬▬▬▬▬▬▬▬▬ 👀
0:00 - Intro to "The Forward-Forward Algorithm"
1:37 - Backpropagation Algorithm
5:00 - The Forward-Forward Algorithm
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
Link to the Notebook: github.com/EscVM/EscVM_YT/blo...
Link to the paper: www.cs.toronto.edu/~hinton/FF...

Опубликовано:

 

1 авг 2024

Поделиться:

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист
Посмотреть позже
Комментарии : 26   
@williamrich3909
@williamrich3909 Год назад
Great video, and thanks for the code!
@erickmarin6147
@erickmarin6147 Год назад
This is really concise and we'll put, thanks.
@onafets38
@onafets38 Год назад
fast and clear, well done!
@escvm
@escvm Год назад
Thank you! As soon as I have time, I'll also try to add a video about the FF Alg PyTorch code.
@_stition9777
@_stition9777 Год назад
Hey, cool video. You should put a de-esser (type of compression) on your voice. It will make your "s" sounds more consistent with other consonants
@escvm
@escvm Год назад
Thank you for the tip! I'll try that with the next video.
@josephbrewington6729
@josephbrewington6729 Год назад
Does the natural exponent in our cost function (theta - positive dot product) + (negative dot product - theta) not simplify to (negative dot product - positive dot product)? I am under the impression that theta is used as a threshold hyperparameter, and in this case I fail to see why theta is even necessary. Am I misunderstanding something here?
@demonslayer1162
@demonslayer1162 Год назад
How is the predicted probability distribution of the final layer learned through this method for multi-class classification problems? e.g. for a three class problem, do you just do three inferences and see which one has the longest activation vector at the final layer?
@dansplain2393
@dansplain2393 Год назад
I think just a linear classifier, of which there are many. It does feel like a bit of a cop-out to do all this work to learn a representation only to have to do some ML on top of it all to actually make predictions, but at least that model is a simple one that can fit without back propagation of gradients through a deep network.
@user-dc1np4hg2y
@user-dc1np4hg2y Год назад
Where Cost function at 06:15 came from? I think i've not seen it in the paper.
@escvm
@escvm Год назад
That's a great question! I've started implementing Hinton's one, but then looking on other GitHub implementations, I found that one which showed faster convergence performance. However, if you think about, it's only a slight modification of Hinton's sum of the squared activities. I'll highlight better the difference in the upcoming video.
@user-ed5go9kq8e
@user-ed5go9kq8e Год назад
thank you for your video! I have a question in detail. what do you think about the concept of the loss function in this paper? I think this cost function let the network learn a kind of relation between pos and neg w.r.t a threshold. but I don’t exactly understand the goal of this cost function.
@escvm
@escvm Год назад
Hi! The idea under the loss proposed by Hinton is simple: make neurons of a layer react to positive sample and stay neutral to positive ones, looking only at their inputs. If you think at a binary classification, this is simpler to grasp. Indeed, if you've only two classes, a positive one and a negative, and you can get the last layer of the DNN to "activate" with positive ones, you can easily classify your inputs. Is it clearer?
@user-ed5go9kq8e
@user-ed5go9kq8e Год назад
​@@escvm In other words, the loss was configured in a way that only positives were activated. thank you It's more clear.
@vipulbhadani4554
@vipulbhadani4554 Год назад
Thank you sir for explaining it. I have read this paper and also went through your code. I want to apply FF for my currently groundwater modelling research. Can You please help me?
@escvm
@escvm Год назад
Hi! Yes sure! What can I do for you?
@palashchhabra7230
@palashchhabra7230 Год назад
Why we took threshold 9 in the notebook ?
@escvm
@escvm Год назад
Hi! This's a good question! I'll try to cover it in the coding session. Anyway, the threshold is a hyperparameter of the algorithm. So, it needs to be tuned with each problem addressed. The threshold affects the loss and how activations' sum of squared should be to minimize the overall loss. You can try to play with it, and you should notice that its most evident effect is on convergence speed.
@Hasstenichgesehn1
@Hasstenichgesehn1 Год назад
Why can merging a label into the image provide a positive or negative sample at all? Why should a layer care about an artificially created part of the image that isn't part of the actual image, how does it "know" it's supposed to be the label? I'll keep looking for an explanation in other places, but it would still be nice if you had an explanation for me :-)
@timmygilbert4102
@timmygilbert4102 Год назад
Sound like a diffusion algorithm
@escvm
@escvm Год назад
Hi! I'm curious: why are you suggesting that? :)
@timmygilbert4102
@timmygilbert4102 Год назад
@@escvm have you seen the work of that guy, who did an experiment, in which he use false positive of pre neural face detection, then average them over 1million data, in which it revealed an average human face? I have this inside jokes that all neural is really averages, noises and relaxation. Given than trained network are sparse, we aren't optimizing correctly the actual capacity of the network. That experience and the relative efficiency of simple bag of word and ngram model to extract some degree of high level semantic, it prove that there is something more fundamental in the frequency distribution of the data set. Something we can probably extract and bypass guessing hyper parameter. That's for the context of the joke, now I said diffusion because, like with other training, you will set the neural weight to random, then using a training sample you will try to remove the noise by relaxing the weight around the sample activation, given all sample that kinda sound like a convoluted way to do averages. Now Alex J Champagnard, did some very early experiment with, if my memory is exact, it's call extreme neural network, that is using the network without training it, and fir image generation, you could retrieve images with similar features than the original, the conclusion was that the architecture probably had a bigger impact than the training, in some way, and that's my other joke, network weight are really just attention mechanics, we just got better at designing the architecture. The last observation is that, when you look at transformer weights activations, they try really hard to undo some data we put in the embedding, like removing the offset of the positional encoding. Given the network activation are already sparse, it's funny to see most activation, at least in early layer, are all about finding data we already know but obfuscated instead of passing them directly. I feel like there is a solution that would be simpler than training network, closer to Huffman encoding but taking into account the insight we got with neural network, that is there averages but also (hierarchical) differentiations of the statical spread, that hold all the relevant information without training. After all self attention is a kind if ngram, and it's use to select the right average in the FF layer. We wouldn't need blind walk in hyperdimensional noise. The data is all you need. I don't know if that makes sense.
@adsick_ua
@adsick_ua Год назад
backprop formulas at 4:00 are extremely confusing - a lot of (unexplained) terms and hard math notation. (I am a coder, not a PhD)
@torcher5023
@torcher5023 Год назад
Imagine doing machine learning and be confused by the backprop.
@escvm
@escvm Год назад
Hi @adsick5014 I'm sorry to hear that. Yes, I agree that sometimes formulas and jargon can make simple concepts hard to grasp. Nevertheless, it is also hard to find the right compromise between being too superficial or too rigorous. Anyway, I think if there's anyone good at this is 3Blue1Brown, and it has a super nice series on backprop. I suggest you to take a look at it: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-aircAruvnKk.html
@dansplain2393
@dansplain2393 Год назад
Andrej Karpathy has some extremely instructive videos on the nuts and bolts of backprop.
Далее
This Algorithm is 1,606,240% FASTER
13:31
Просмотров 777 тыс.
Документы для озокомления😂
00:24
Backpropagation and the brain
32:26
Просмотров 16 тыс.
Brain Criticality - Optimizing Neural Computations
37:05
Geoffrey Hinton: Reasons why AI will kill us all
21:03
Просмотров 178 тыс.
Can We Build an Artificial Hippocampus?
23:51
Просмотров 196 тыс.
This Canadian Genius Created Modern AI
8:33
Просмотров 1 млн
What are Transformer Models and how do they work?
44:26