Тёмный

Deep RL Bootcamp Lecture 5: Natural Policy Gradients, TRPO, PPO 

AI Prism
Подписаться 7 тыс.
Просмотров 52 тыс.
50% 1

Опубликовано:

 

27 окт 2024

Поделиться:

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист
Посмотреть позже
Комментарии : 22   
@littlebigphil
@littlebigphil 3 года назад
This is really dense, but also clears a lot up. I'll have to watch a second time.
@ariel415el
@ariel415el 4 года назад
This is much less apprehensible than last lectures =
@nissim2007
@nissim2007 4 года назад
anyway
@gregh6586
@gregh6586 4 года назад
Yeah, John's mind works on another level. Maybe that's how you invent not only TRPO (along with Pieter from lecture 4a) but also PPO.
@dewinmoonl
@dewinmoonl 7 лет назад
AWW YEAH TRUST REGION THIS IS WHAT I NEEDED THANKS!
@mfavaits
@mfavaits 2 года назад
explanations are all over the place - put some structure in the way you explain things
@OliverZeigermann
@OliverZeigermann 3 года назад
Why calculate the ratio of new and old policy even though log prob is good enough anway? Is it because we want to use the ratio for clipping?
@ppstub
@ppstub 3 года назад
So we can be more data-efficient and re-sample trajectories generated under an old policy according to the ratio we compute
@abcborgess
@abcborgess 2 года назад
It is due to the importance sampling used to calculate the surrogate function. The expectation in the surrogate uses samples from the distro parametrized by theta_old. However, I think we are actually interested in the value of surrogate at other parameters theta. The ratio corrects the probability of the samples from theta_old in the expectation.
@underlecht
@underlecht Год назад
Its because we'd like to keep the policy in trusted region. So that the policy would not diverge too much and become useless
@mansurZ01
@mansurZ01 5 лет назад
I think that loss function at 7:30 should have the opposite sign because the gradient is derived for gradient ascent. So for gradient descent we should pretend that the gradient has opposite sign. And if we derive loss function for that - the "minus" sign will carry trough. Am I right?
@gregh6586
@gregh6586 4 года назад
No, it should not be negative: it is true that we want to use gradient ascent, i.e. to move in the direction that increases our "loss" the most. The advantage term is positive for those actions, whose actual value (i.e. the "q-value") was better than the expected value (or simply "value"). So what we want to do, is to find the direction that increases that advantage the most, i.e. find the gradient. You have to be careful when using frameworks, though (what John refers to as "auto-div-libraries"), and double-check whether you get the positive or negative value (e.g. for the categorical cross entropy, where we want to use the "positive loss" but some implementations might return the negative value).
@elliotwaite
@elliotwaite 4 года назад
He just has it written in terms of gradient ascent, which is common in RL where we are typically trying to maximize our objective of expected total reward. But you are correct in that if we want to do gradient descent, which is the default in PyTorch or TensorFlow, we'll want to use the negative of that loss, as can be seen in this implementation: github.com/openai/spinningup/blob/master/spinup/algos/pytorch/ppo/ppo.py#L234
@yanwen3498
@yanwen3498 3 года назад
Who has the slides?
@kevinwu2040
@kevinwu2040 3 года назад
You make these things soo easy
@shaz7163
@shaz7163 6 лет назад
In 14.55 When maximizing the objective function with a penalty obtained with KL divergence, what if the expected values become minus?
@yoloswaggins2161
@yoloswaggins2161 6 лет назад
Discourage action.
@pablodiaz1811
@pablodiaz1811 5 лет назад
Thanks for share
@ProfessionalTycoons
@ProfessionalTycoons 5 лет назад
awwwe yeah
Далее
Deep RL Bootcamp  Lecture 4A: Policy Gradients
53:56
Просмотров 60 тыс.
ДУБАЙСКАЯ ШОКОЛАДКА 🍫
00:55
Просмотров 2,5 млн
L4 TRPO and PPO (Foundations of Deep RL Series)
25:21
All Machine Learning algorithms explained in 17 min
16:30
CS885 Lecture 14c: Trust Region Methods
20:19
Просмотров 21 тыс.
An Intro to Quantum Natural Gradient Descent
8:16
Просмотров 1,9 тыс.