Deep RL Bootcamp Lecture 5: Natural Policy Gradients, TRPO, PPO

AI Prism

Подписаться 7 тыс.

Просмотров 52 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Опубликовано:

27 окт 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 22

@littlebigphil 3 года назад

This is really dense, but also clears a lot up. I'll have to watch a second time.

@ariel415el 4 года назад

This is much less apprehensible than last lectures =

@nissim2007 4 года назад

anyway

@gregh6586 4 года назад

Yeah, John's mind works on another level. Maybe that's how you invent not only TRPO (along with Pieter from lecture 4a) but also PPO.

@dewinmoonl 7 лет назад

AWW YEAH TRUST REGION THIS IS WHAT I NEEDED THANKS!

@mfavaits 2 года назад

explanations are all over the place - put some structure in the way you explain things

@OliverZeigermann 3 года назад

Why calculate the ratio of new and old policy even though log prob is good enough anway? Is it because we want to use the ratio for clipping?

@ppstub 3 года назад

So we can be more data-efficient and re-sample trajectories generated under an old policy according to the ratio we compute

@abcborgess 2 года назад

It is due to the importance sampling used to calculate the surrogate function. The expectation in the surrogate uses samples from the distro parametrized by theta_old. However, I think we are actually interested in the value of surrogate at other parameters theta. The ratio corrects the probability of the samples from theta_old in the expectation.

@underlecht Год назад

Its because we'd like to keep the policy in trusted region. So that the policy would not diverge too much and become useless

@mansurZ01 5 лет назад

I think that loss function at 7:30 should have the opposite sign because the gradient is derived for gradient ascent. So for gradient descent we should pretend that the gradient has opposite sign. And if we derive loss function for that - the "minus" sign will carry trough. Am I right?

@gregh6586 4 года назад

No, it should not be negative: it is true that we want to use gradient ascent, i.e. to move in the direction that increases our "loss" the most. The advantage term is positive for those actions, whose actual value (i.e. the "q-value") was better than the expected value (or simply "value"). So what we want to do, is to find the direction that increases that advantage the most, i.e. find the gradient. You have to be careful when using frameworks, though (what John refers to as "auto-div-libraries"), and double-check whether you get the positive or negative value (e.g. for the categorical cross entropy, where we want to use the "positive loss" but some implementations might return the negative value).

@elliotwaite 4 года назад

He just has it written in terms of gradient ascent, which is common in RL where we are typically trying to maximize our objective of expected total reward. But you are correct in that if we want to do gradient descent, which is the default in PyTorch or TensorFlow, we'll want to use the negative of that loss, as can be seen in this implementation: github.com/openai/spinningup/blob/master/spinup/algos/pytorch/ppo/ppo.py#L234