Тёмный
No video :(

Week 7 - Lecture: Energy based models and self-supervised learning 

Alfredo Canziani
Подписаться 39 тыс.
Просмотров 31 тыс.
50% 1

Опубликовано:

 

28 авг 2024

Поделиться:

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист
Посмотреть позже
Комментарии : 36   
@TheRohit901
@TheRohit901 Год назад
Thank you so much for making these videos publicly available. This is a wealth of knowledge here for free, I'm grateful to you. Prof Yann LeCun has really good clarity on these concepts.
@alfcnz
@alfcnz Год назад
❤️❤️❤️
@jonathansum9084
@jonathansum9084 4 года назад
Your work is one of the greatest works because of sharing these incredible tools to people. Thank you!
@alfcnz
@alfcnz 4 года назад
Glad you like them!
@jwc7663
@jwc7663 3 года назад
Is there any relationship between energy-model and reinforcement learning? According to "If you want to combine the output of an automated system with another one (for example, a human, or some other system), and these systems haven’t been trained together, but rather they have been trained separately, then what you want are calibrated scores so that you can combine the scores of the two systems so that you can make a good decision.", It looks like energy-based models use more than one system to make a decision, but RL uses just one integrated system?
@anmolmonga1933
@anmolmonga1933 3 года назад
I really liked the model. I didnt get what yann meant by energy based model are Bayesian. It seems that even the energy based model are MLE. I don't see how they take the prior of the dataset when they are making prediction
@vijolagoga3401
@vijolagoga3401 3 года назад
At 58:50 the falling pen example is introduced. The dataset is just one x and possibly an infinite number of y. Of course this is not even a function, f(x) = {the hole y set}, so finding an approximator will fail. But, if we live in classical mechanics there are also an infinite number of x. The x has (phi, theta), polar and azimutal angles, and by knowing them we can make better predictions for y. The problem is now well-posed, a unique solution exists. Even if that is a quantum mechanical pen, we know that x has a quantum vector state (unknown to us possibly / probably), and by knowing it we can get the final state y after calculating its probability density. In both cases we are not representing well the input (state) x. The problem is bagging for a z or two. So, from this comment and the lecture I am coming to the conclusion that the hidden variable describes the state, and the neural network ( or a complex chain of reasoning and functions) represents the rules of the system's evolution in time. Is this correct? Can this be a possible interpretation of the energy based models?
@alfcnz
@alfcnz 3 года назад
There is a single x, which corresponds to the initial condition of the system, a pendulum in its unstable equilibrium state, to which a small perturbation is applied in order to break the symmetry. The outcome, y, is unpredictable if the perturbation is not observable. Of course, if x would be completely observable, the problem wouldn't occur.
@tominikolla2699
@tominikolla2699 3 года назад
If we introduce a new variable z than after training, z would ideally be in {left, right} if x is in unstable equilibrium and z exerts no influence at all on the system. LeCun used the phrase 'z would have the complementary variable'. EBM looks very fascinating. I have another rather funny thought experiment: if you are facing the intelligent robot and in the left three students are running toward you shouting some questions from a distance and in the right there is a big piece of your favorite cake, than this EBM driven robot would get the minimum energy if you turn right. Turning left would also have a possibility, but rather distant (or higher energy). So, Z would have some complementary information on the environment (shouting students left, favorite cake right) but also some internal stuff (you like cake more than questions). Thank you for the videos. Very fascinating!
@alfcnz
@alfcnz 3 года назад
Keep in mind z is *not* trained / learnt! z is **inferred** once the outcome y is observed and it contributes to explaining how y arose from x.
@tominikolla2699
@tominikolla2699 3 года назад
You are right. I forgot that z is not learned in this model. How unfortunate :p
@alfcnz
@alfcnz 3 года назад
😒😒😒
@mahdiamrollahi8456
@mahdiamrollahi8456 Год назад
What is the difference between a Loss function and Energy-based function? In loss function we try to reduce the distance between predicted and target. As I understood, in EB function we are going to match x,y in E(x,y).
@alfcnz
@alfcnz Год назад
Nope. The loss tells you how bad a given parametrisation is. We minimise the loss to find good weights, which in turn give us a well behaved energy function and/or predictability.
@JTMoustache
@JTMoustache 2 года назад
Graphical models, markov networks in particular, allow to decompose the joint distribution as a product of energy factors.. not a sum.
@alfcnz
@alfcnz 2 года назад
Or the sum for the joint log-distribution 😉
@faizanshaikh5326
@faizanshaikh5326 4 года назад
Thanks for uploading! There's a small spelling mistake in the video name - should be "self supervised"
@alfcnz
@alfcnz 4 года назад
Fixed, thanks!
@ChiTran-xd1pg
@ChiTran-xd1pg 3 года назад
in the k-means problems As the coordinate gradient descent, we initialize the centroids, find x near it and update centroids by calculating the avarage In EBM model, I knew that z - latent denotes the classes and think that the energy here is the distances from points to centroids, and I assume that W is the d x k matrix where d is the dimension of data (W is like k centroids) and Y is data points ? but it's still a little bit ambiguous Hope you reply with clear explenation, what is Y ( y1, y2).. Thanks
@alfcnz
@alfcnz 3 года назад
I don't understand the question. Also, you need to add a timestamp, or I have no idea of what part of the video you're talking about.
@ChiTran-xd1pg
@ChiTran-xd1pg 3 года назад
@@alfcnz 1:21:24 here can you explain the k-means inference in EBM form here
@alfcnz
@alfcnz 3 года назад
@@ChiTran-xd1pg as the formula states, you check what centroid is the closest and you return the squared Euclidean distance to it. That's how inference is performed. Is this what you were asking? I don't really understand the first message. y is the data we want to learn. In this case y ∈ ℝ², so y = (y₁, y₂), and they are forming a spiral.
@ChiTran-xd1pg
@ChiTran-xd1pg 3 года назад
@@alfcnz so y is the data and z denotes the k centroids , so what's exactly W (and how we get it).
@ChiTran-xd1pg
@ChiTran-xd1pg 3 года назад
@@alfcnz fix me if im wrong pls, as u said that y is the data and latent var z is one hot denoting the cluster,w prototype here is 2*k matrix here and wz is the current centroid and the function is y-wz?
@Epistemophilos
@Epistemophilos 2 года назад
40:52 A probabilistic model could have p(x,y) for example be a mixture of Gaussians. I don't see how the infinitely thin manifold comes in to play then. For every value of x,y, there would be a non-zero probability density. Even if we have a p(y|x) model, then there are non-zero densities for any y, for any x. p(y|x) could even be multi-modal. Again, where do the Dirac functions come into it? I'm obviously missing something.
@robmcadam9393
@robmcadam9393 2 года назад
Because if we have an infinitely expressible function and are doing maximum likelihood inference from finite samples, we'll end up with a non-smooth manifold that spikes at the data points and is zero everywhere else.
@Epistemophilos
@Epistemophilos 2 года назад
@@robmcadam9393 Thanks. Why would a probabilistic model necessarily need to be infinitely expressible?
@robmcadam9393
@robmcadam9393 2 года назад
@@Epistemophilos The problems arise in a non-parametric context or, in theory, with infinitely expressible parametric models. If you have a parametric model, you are maximizing the likelihoods *subject to the constraint* that the energy function maps onto some distribution of choice.
@Epistemophilos
@Epistemophilos 2 года назад
@@robmcadam9393 Well, you would normally optimize or integrate over some hyperparameters also, right? For example a GP model would not normally assign infinite density to training data.
@faizanshaikh5326
@faizanshaikh5326 4 года назад
The speaker mentions that transformer-like models fail to give a competitive performance on CV tasks. But then in the recent DETR model, what changed?
@alfcnz
@alfcnz 4 года назад
DETR was released on 26 May 2020 (see twitter.com/alfcnz/status/1265487916547084291 for more info), this video was recorded on 16 Mar 2020. Moreover, next time include minute:second of where something is said, otherwise I cannot double check.
@faizanshaikh5326
@faizanshaikh5326 4 года назад
@@alfcnz My bad - sorry about that. The sentence starts at 53:40, and around 54:02 the speaker mentions "the results have been extremely disappointing" I get that DETR is fairly recent - I just wanted to understand what was the idea/innovation that DETR does which hasn't been done before (but maybe I should read the paper directly then 😅 )
@snippletrap
@snippletrap 4 года назад
He is talking about the "fill in the blanks" method of representation learning, not transformer models in particular. The dominant CV model has been the CNN, which aggregates local features. Transformers, by contrast, apply global attention. Aside from DETR there is also Image GPT.
@yassineah9874
@yassineah9874 4 года назад
which courses is this wich year and thanks a lot
@alfcnz
@alfcnz 4 года назад
Deep Learning? We finished teaching it last week. So 2020, I'd say.
Далее
05.2 - But what are these EBMs used for?
10:42
Просмотров 10 тыс.
ChatGPT: 30 Year History | How AI Learned to Talk
26:55
How to Speak
1:03:43
Просмотров 19 млн
This is why Deep Learning is really weird.
2:06:38
Просмотров 384 тыс.
How Far is Too Far? | The Age of A.I.
34:40
Просмотров 62 млн
ICLR 2020: Yann LeCun and Energy-Based Models
2:12:12
Просмотров 21 тыс.