Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 4 - Model Free Control

Подписаться 605 тыс.

Просмотров 75 тыс.

50% 1

For more information about Stanford’s Artificial Intelligence professional and graduate programs, visit: stanford.io/ai
Professor Emma Brunskill, Stanford University
onlinehub.stanford.edu/
Professor Emma Brunskill
Assistant Professor, Computer Science
Stanford AI for Human Impact Lab
Stanford Artificial Intelligence Lab
Statistical Machine Learning Group
To follow along with the course schedule and syllabus, visit: web.stanford.edu/class/cs234/i...

Опубликовано:

28 мар 2019

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 6

@odycaptain Год назад

Thank you for the course

@zonghaoli4529 Год назад

26:23 I think the reason why people got a bit confused and obatain different answers was because they forgot the essence of MC for policy evaluation. For MC policy evaluation, it will start only when a full episode is completed. In this case, therefore, G_{i,t} for all existed state-action pair (s3,a1), (s2,a2), and (s1,a1), are all 1 as gamma is zero. Then just follow the pesudo code for MC, you will get the right answer. If you are doing TD, where policy evaluation starts immediately without waiting for the completion of the entire episode, I think the first student's answer was correct.

@RUBOROBOT Год назад

In the monotonic e-greedy Policy Improvement theorem, why do we add the (1-e)/(1-e) factor instead of just using the (1-e) that is already there? I see this step unnecessary and confusing as the original (1-e) is canceled with the unnecessary new (1-e) denominator, and it's therefore never used.

@zonghaoli4529 Год назад

This proof around 33:03, to a certain extent, is very cumbersome since all these transformations did not take you anywhere to be honest. Essentially, what is really important is that for V^{pi_{i+1}}, there is a taking max{Q} over action, which is certain always larger or equal to Q over action, which is what V^{pi_{i}} gives you in the previous iteration. On average, it is the greedy action that ensures the monotonic improvement.

@michaelbondarenko4650 8 месяцев назад

Interestingly, they didn't fix the proof in the 2023 class either

@ZinzinsIA 7 месяцев назад

the (1 - eps) / (1 - eps) is just to show that we did not changed the value of the sum by multiplying by this quantity and then we can write 1 - eps another way i.e with the sum of pi(a|s) - eps. Then, sum pi(a|s) - eps = (1 - eps + eps/card(A) + [(card(A) -1)/card(A) * eps)] - eps. The expression between parentheses corresponds to the sum of probas by definition of the epsilon soft policy. Be careful that they put the epsilon inside the sum over a but it is not correct, we do not sum epsilon over all possible actions. Then with the simplification and the fact that max_a(Q_pi(s, a)) >= Q_pi(s, a) you get the result and what is interesting with this "cumbersome" writing is that the simplification it gives you corresponds to V_pi(s) and so you have shown that there's a policy improvement. You can also check the book Sutton & Barto Aabout RL the proof is phrased a little bit differently but it's the same idea (p.101 of the 2018 edition)