Luis, yet another awesome video! Thanks to you, I've learned something new today! The step-by-step visualization with the gaussian evolution of candidates is epic - super helpful and eye-opening! Thank you!
Thanks a lot, very good introductory video! One question on PSO : is the size of the steps fixed or proportional to (respectively) the current « speed », the distance to PB, the distance to group best? Thanks
Thanks! Great question. The exact same problems can happen with PSO, it can be stuck at a local minimum just like CMA-ES. And the way to overcome them is the same.
How fast is it? If I train a neural net (which we know how to compute the gradient of) with CMA-ES or PSO, will it take longer to converge? I would imagine particularly PSO is pretty slow, maybe only slightly better than a Monte Carlo approach. You're basically doing a line search algorithm without the advantage of knowing the direction you're moving is a descent direction. CMA-ES, on the other hand, might be reasonable?
Great question! I haven't used it in neural networks, I imagine that gradient descent is better. I've used CMA-ES and PSO for quantum neural networks, as derivatives are hard there, and I've noticed that CMA-ES tends to work better. Not so much in speed, but actually in finding minimums and not getting stuck. That's where I think the gains are, in the fact that the randomness in CMA-ES allows it to explore more parts of the space than a gradient-based algorithm that only gives small steps. I think a good combination of gradient and non-gradient based methods is the best combination at the end.
@@SerranoAcademy A common strategy when training a model is to reduce the learning rate of gradient descent when the loss is no longer decreasing to see if we're bouncing around inside a local minimum without decreasing. I wonder if trying an iteration or two of CMA-ES at these times might sometimes allow us to jump to nearby local minima which may be deeper but could not be reached with any gradient-based approach. Or another use might be during initialization, which is often just random. Maybe doing CMA-ES a few iterations at the beginning of training and picking the best out of like 5 choices might shoehorn the network into a better minimum than just having a single initialization point.
Great question! You can notice if after several iterations you keep getting generations that don’t improve your minimum, or that improve it very very slightly. Then you assume you’re at a local minimum.
Thanks. That's just a guess but i doubt this method be efficient in higher dimensions for the following reason. For instance had to take 5 points randomly in 2d. Let's take the square root to guess how much point per dimension you need. Thats about 2. So with 10 dimension i would need 2^10 = 1024. But i think ML involves much more dimensions ; that's basically about the number of weights in a multi layered network. say 3 layers of fully connected neurons : 100*100 about 2 ^ 10000 . 1 Gigabyte of RAM is 2^30 so no way to apply this. Am i wrong somewhere ? :-)
The idea of such methods is to optimize functions when gradient descent isn’t available because of lack of derivability. In the case you mention, the network optimization is done in the usual way (SGD, Adam, …), and what you will look at is, say, the loss after a fixed number of epochs for a given set of hyperparameters (learning rate, beta coefficients, etc…) which at less numerous. Then you reiterate the process using CMA-ES/PSO solely on those hyperparameters.
For a "guarantee" of convergence, yes, you need a lot of points and a lot of function evaluations. You can in practice avoid scaling the number of points strictly on the number of dimensions. ES algorithms can easily solve 30-dimensional problems with a population size of 100, not 2^30, for example.