CS885 Lecture 8a: Multi-armed bandits

Pascal Poupart

Подписаться 18 тыс.

Просмотров 22 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Опубликовано:

11 сен 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 17

@rafaelmosca6243 5 лет назад

Excellent Lectures! Everything crystal clear.

@cbandle5050 Год назад

Suboptimal action != Worst action (unless there are only two actions). This leads to various issues with the proof at 44:20 1) 2nd bullet does not follow from 1st. R(a) >= R(a') for all a' means that action a is actually optimal. However this contradicts bullet 1 which defines action a as suboptimal. 2) Assuming that action a is suboptimal only implies that there merely *exists* an a' so that R(a') > R(a). We cannot deduce universal quantification as it's possible for R(b) < R(a) < R(a') (here action a is suboptimal but not the worst). 3) We also cannot fix this by assuming that we converged to the worst arm a for this proof by contradiction, as the negation of "converging to the best arm" = "not converging or converging to an arm which is not the best".

@user-co6pu8zv3v Год назад

Excellent lecture!

@umarkhan-hu7yt Год назад

please suggest some book to go more in depth of your lecturer. I am following Reinforcement Learning of Richard Sutton.

@shivangitomar5557 Год назад

Amazing lecture! Thank you!!

@Bryan-bh7cy Год назад

well damn it's really good

@andresacostaescobar 5 лет назад

Hi. Where can I find a proof of the statement that P(a_t != a*) ~ epsilon and of the fact that epsilon=1/t leads to a logarithmic regret?

@NilavraPathak 4 года назад

Andrés Acosta Escobar sum of inverse of t is a log series as it tends to infinity. Easiest explanation is if you integrate 1/t you will get log t.

@sudhas_world 4 года назад

To prove that P(a_t != a*) ~ epsilon, you just need to understand that epsilon-Greedy algorithm explores with probability epsilon at each iteration. Hence, on an average, we can say that that an arm isn't going to be the optimal arm at any time step is just epsilon.

@ryanjadidi8622 4 года назад

timestamp 44:20 shouldnt it be "assumption that a' is suboptimal" rather than a?

@Cindy-md1dm 4 года назад

The purpose of having epsilon is to increase the exploration. If we try to reduce the probability of getting a suboptimal by reducing epsilon, it will go back to greedy strategy. Then, what's the point?

@m-aun 4 года назад

You explore at the start to get the best action. After that is estimated for you choose the best action(exploit), and not sub-optimal actions. You want your regret to be low.

@jocarrero 4 года назад

True that as epsilon decreases you become greedy but before reaching that point you have had explored a lot!! So the theoretical result is telling you a.- decrease the rate of exploitation over time. b.- But not too fast!! slower than 1/t. This type of results is typical in the theory of stochastic process. It is related to the way “stochastic processes” are constructed. The classical book by Love, vol 2 has a beautiful chapter using this type of conditions to prove that ‘good process” can be constructed in a Kolmogov space.

@jocarrero 4 года назад

Just one more comment on the rate at which epsilon is decreasing. Look at Kolmogorov’s 0-1 Law. This law tell you what happens if you keep sampling independently from a distribution.