In 16.44 why would we violate the maximum R(81)? Wouldn't we be taking n= 3 and r =27. That doesn't violate the max R. In fact as per your table, taking 6*27 = 162 .81 violates this rule. I am lost. Can you please explain?
Good question! This observation is based on the original example given by the authors. However, this original example given by the authors is unfortunately wrong. To make sure that you understand the method fully, you could try to follow their pseudocode (link to their paper in description). You will end up with different numbers in the table.
Hyperband was like headache before watching your video. Now it is clear. Thank you for such a beautiful content and examples. you shouldn't stop making videos though it's very unfortunate that you have only few subscribers.
Great video, I finally understood Hyperband thanks to you and was able to use it in Keras confidently. Thanks! Do you know other hyperparameter tuning approaches that may be better/worth exploring?
Glad to hear that it was helpful! :) Hyperband relies on a model-free approach (successive halving) that does not aim to learn a predictive model that maps any configuration to a predicted performance. The approaches that do this (called Bayesian optimization), like Tree Parzen Estimator, can be more efficient and require less trial-and-error. It would also even be possible to combine this with Hyperband or successive halving, making it even more efficient. If you are interested, there is also a video about the Tree Parzen Estimator.
Amazing video! One question, so for each bracket in hyperband new set of configuration is chosen from total set of hyperparameters, correct? So there may be duplicate configuration, so same configuration may be in bracket 1 and 2?
@@aixplained4763 I'm still unsure about hyperband's benefit, so for each consequtive bracket it resample smaller set then previous bracket from hyperparameter configurations. Since it is randomly sampling hyperparameters of final bracket aren't the best ones and they are trained for a long time. What exactly is the benefit over simple successive halving...
@@haneulkim4902 Good question! In regular successive halving, we have the issue that the halving can be too aggressive (prematurely discarding the better configurations because they needed some more time to yield good performance). Finding the right level of "aggression" is not easy to do. Hyperband basically does multiple successive halving brackets with different levels of "aggression" to solve this. In the end, it is indeed sometimes the case that more training time leads to better performance, but not always. Moreover, after performing all brackets in hyperband, you could do post-process the result. E.g., you could select the best candidate from every bracket and train them all with the same budget.
Happy to hear that you like the symbols! The slides are created in Google Slides and I copy/paste symbols from a latex2image generator such as latex2image.joeraut.com/