Тёмный

How do you minimize a function when you can't take derivatives? CMA-ES and PSO 

Serrano.Academy
Подписаться 155 тыс.
Просмотров 8 тыс.
50% 1

Опубликовано:

 

16 сен 2024

Поделиться:

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист
Посмотреть позже
Комментарии : 31   
@chyldstudios
@chyldstudios 2 года назад
Very clear explanation of these two optimization algorithms. Well done!
@SerranoAcademy
@SerranoAcademy 2 года назад
Thank you, glad you like it! :)
@cesarkadirtorricovillanuev9761
Muy buen video, explicación muy clara e intuitiva. Sigue adelante! Saludos desde Bolivia.
@Todorkotev
@Todorkotev 2 года назад
Luis, yet another awesome video! Thanks to you, I've learned something new today! The step-by-step visualization with the gaussian evolution of candidates is epic - super helpful and eye-opening! Thank you!
@SerranoAcademy
@SerranoAcademy 2 года назад
Thank you so much Boyko, I’m glad you enjoyed it! :)
@SerranoAcademy
@SerranoAcademy 2 года назад
And thank you so much for your contribution! It's very kind.
@luisvasquez5015
@luisvasquez5015 Год назад
Fantastic video! Can't wait for the third part of the series!
@shazajmal9695
@shazajmal9695 5 месяцев назад
Thanks a lot for this super informative lecture! Could you please make one on Genetic Algorithms 😅
@RK-TKLINK
@RK-TKLINK 8 месяцев назад
The best explanation I've found,thank you!
@imadsaddik
@imadsaddik 2 года назад
Thanks, the explanation was crystal clear!
@omarmohy3975
@omarmohy3975 Год назад
amazing video really, I also wish if u could explain it with more math and some coding as a second part will be amazing.
@ZaCharlemagne
@ZaCharlemagne 8 месяцев назад
The CMA-ES is pretty much similar to CEM (Cross Entropy Method)
@elie_
@elie_ Год назад
Thanks a lot, very good introductory video! One question on PSO : is the size of the steps fixed or proportional to (respectively) the current « speed », the distance to PB, the distance to group best? Thanks
@user-wr4yl7tx3w
@user-wr4yl7tx3w 2 года назад
so well explained. that is amazing. I wonder what can be the downside with such a particle swarm.
@SerranoAcademy
@SerranoAcademy 2 года назад
Thanks! Great question. The exact same problems can happen with PSO, it can be stuck at a local minimum just like CMA-ES. And the way to overcome them is the same.
@centscents
@centscents 2 года назад
Do you have a citation for this method? Thanks.
@Mohammad-Arafah
@Mohammad-Arafah 2 года назад
I wish you could make a course for statistics in great detail or write a book for that.
@SerranoAcademy
@SerranoAcademy 2 года назад
Thank you! I'm building a course on that, hopefully it'll be out in the next few months, I'll announce it in the channel when it's ready! :)
@Mohammad-Arafah
@Mohammad-Arafah 2 года назад
That's great news. I can help you in practical labs; I am PhD researcher in generative modelling@@SerranoAcademy
@brunorcabral
@brunorcabral 2 года назад
Great explanation. Where can I get the math formulas?
@SerranoAcademy
@SerranoAcademy 2 года назад
Thanks! Here's a huge repository of CMA-ES info, code, tutorials, etc. For PSO I haven't found so much info, so mostly wikipedia.
@floydmaseda
@floydmaseda 2 года назад
How fast is it? If I train a neural net (which we know how to compute the gradient of) with CMA-ES or PSO, will it take longer to converge? I would imagine particularly PSO is pretty slow, maybe only slightly better than a Monte Carlo approach. You're basically doing a line search algorithm without the advantage of knowing the direction you're moving is a descent direction. CMA-ES, on the other hand, might be reasonable?
@SerranoAcademy
@SerranoAcademy 2 года назад
Great question! I haven't used it in neural networks, I imagine that gradient descent is better. I've used CMA-ES and PSO for quantum neural networks, as derivatives are hard there, and I've noticed that CMA-ES tends to work better. Not so much in speed, but actually in finding minimums and not getting stuck. That's where I think the gains are, in the fact that the randomness in CMA-ES allows it to explore more parts of the space than a gradient-based algorithm that only gives small steps. I think a good combination of gradient and non-gradient based methods is the best combination at the end.
@floydmaseda
@floydmaseda 2 года назад
@@SerranoAcademy A common strategy when training a model is to reduce the learning rate of gradient descent when the loss is no longer decreasing to see if we're bouncing around inside a local minimum without decreasing. I wonder if trying an iteration or two of CMA-ES at these times might sometimes allow us to jump to nearby local minima which may be deeper but could not be reached with any gradient-based approach. Or another use might be during initialization, which is often just random. Maybe doing CMA-ES a few iterations at the beginning of training and picking the best out of like 5 choices might shoehorn the network into a better minimum than just having a single initialization point.
@zyzhang1130
@zyzhang1130 2 года назад
Question: you don’t have gradient so how do you know if cma-es reached a local minimum?
@SerranoAcademy
@SerranoAcademy 2 года назад
Great question! You can notice if after several iterations you keep getting generations that don’t improve your minimum, or that improve it very very slightly. Then you assume you’re at a local minimum.
@zyzhang1130
@zyzhang1130 2 года назад
@@SerranoAcademy so similar to the convergence analysis of gradient descent. Thank you for your reply😁
@java2379
@java2379 Год назад
Thanks. That's just a guess but i doubt this method be efficient in higher dimensions for the following reason. For instance had to take 5 points randomly in 2d. Let's take the square root to guess how much point per dimension you need. Thats about 2. So with 10 dimension i would need 2^10 = 1024. But i think ML involves much more dimensions ; that's basically about the number of weights in a multi layered network. say 3 layers of fully connected neurons : 100*100 about 2 ^ 10000 . 1 Gigabyte of RAM is 2^30 so no way to apply this. Am i wrong somewhere ? :-)
@elie_
@elie_ Год назад
The idea of such methods is to optimize functions when gradient descent isn’t available because of lack of derivability. In the case you mention, the network optimization is done in the usual way (SGD, Adam, …), and what you will look at is, say, the loss after a fixed number of epochs for a given set of hyperparameters (learning rate, beta coefficients, etc…) which at less numerous. Then you reiterate the process using CMA-ES/PSO solely on those hyperparameters.
@dvunkannon
@dvunkannon 3 месяца назад
For a "guarantee" of convergence, yes, you need a lot of points and a lot of function evaluations. You can in practice avoid scaling the number of points strictly on the number of dimensions. ES algorithms can easily solve 30-dimensional problems with a population size of 100, not 2^30, for example.
@luisvasquez5015
@luisvasquez5015 Год назад
Fantastic video! Can't wait for the third part of the series!
Далее
Why Does Diffusion Work Better than Auto-Regression?
20:18
What is Quantum Machine Learning?
51:32
Просмотров 11 тыс.
Каха и жена (недопонимание)
00:37
simple evolution strategy(ES) and covariance adaptation
14:43
The Boundary of Computation
12:59
Просмотров 1 млн
The moment we stopped understanding AI [AlexNet]
17:38
Principal Component Analysis (PCA)
26:34
Просмотров 407 тыс.
Watching Neural Networks Learn
25:28
Просмотров 1,3 млн
Каха и жена (недопонимание)
00:37