Fast reinforcement learning with generalized policy updates (Paper Explained)

Yannic Kilcher

Подписаться 266 тыс.

Просмотров 11 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Опубликовано:

27 окт 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 33

@ericcodes 4 года назад

This channel is everything I wish education was. Thanks Yannic University.

@kanaadpathak 4 года назад

Yannic, I think it would be really cool if you explained some of your own papers as well :)

@herp_derpingson 4 года назад

Unless he is embarrassed by them. Heck, I am embarrassed about my first paper as well.

@YannicKilcher 4 года назад

yea they're kinda boring :D

@herp_derpingson 4 года назад

37:50 This is just an ensemble of policy networks. Right? Even then it is a linear combination. Wouldn't it be better if we had a DNN at the end instead of a simple dot product? 39:25 It is a transfer learning, but since all the policies are independent of one another, we can have many agents running in parallel without having to periodically sync their policy functions. 48:00 These graphs are a pretty cool idea. I think the word you are looking for is "conflicting policies". It is not necessary that a new policy would help the ensemble of existing policies. Especially since we are taking a linear combination. 49:50 Seriously, at this point, it is indistinguishable from an ensemble of DQN networks.

@dermitdembrot3091 4 года назад

37: I don't understand your questions fully or what they are referring to, but yes of course they use an ensemble of policies. The regressed w seems to be just found by linear regression (Equation 9) which finds the optimal linear solution that we desire. The fact that the task reward must be linear in features (and we cannot model it with a non-linear DNN) likely stems from the need to be able to interchange the reward computation and expected discounted sum. E sum_i gamma^i phi_i^Tw = ( E sum_i gamma^i phi_i )^Tw Because, as far as I understand, the right hand side is estimated with a network trained in a DQN fashion by bootsrapping the target for Psi_i as phi_i + gamma Psi_{i+1}, so that we don't need to know/estimate transition/reward probabilities for the left hand side. Is what I wrote clear to you?

@YannicKilcher 4 года назад

Agree with Dermit, the linear combination is probably due to the fact that they want to stay close to / in accordance with the theory, but as you say, the more you let go of the constraints, the more it regresses to a plain ensemble. Nice observations.

@irfanrahadi7487 2 года назад

Interesting thanks a lot Yannic ! Please create more content like this ~

@dermitdembrot3091 4 года назад

Dear Yannic, thanks for bringing us this interesting paper and also for the other gold nuggets on your channel! I think you are a bit too critical of this paper. My understanding is limited by the fact that I haven't watched the precursor video or at least don't remember doing so. But I partly read the original SR paper and believe to roughly understand the point of this idea. By my current understanding, the use of more than one reward functions as well as policies in order to extrapolate quickly to new tasks is a very nice extension of the SR idea, but not necessary for the SR idea to be useful. Yes, psi_i^T w looks like the last layer of a Deep Q Networks. But the way the parameters can be trained is the important difference, although I don't know whether everything is ever done as I describe below. A Q network is learned entirely by bootstrapping where the target for Q-value Q_i is r_i + gamma Q_{i+1}. Similarly, for the SR psi_i the target is phi_i + gamma psi_{i+1}. Intuitively, bootstrapping the richer representation of the last hidden layer may help learning since there are more predictions to be made (in case of multi-task learning (or in single task learning where phi is incentivized to stay diverse e.g. by an entropy or mutual information (with states) objective) some of the features need to be predicted without even being needed for the reward prediction, an auxiliary task so to speak). Additionally, w (as well as phi) can be trained per timestep i with loss (phi_i^T w - r_i)^2 (but of course the need to train phi was not there before, since it is a representation that is not needed in Deep Q Networks). Intuitively, w has a less difficult training objective by not needing to predict an expected discounted sum of rewards but just a single one at a time. Put together: phi learns to be such that it can be linearly transformed into good reward predictions. w learns to do just that. Psi predicts the discounted sum over phi. And Q can be computed as psi^T w due to the interchangeability of expected discounted sum and linear "last layer" ( Q = E sum_j gamma^j phi_j^T w = ( E sum_j gamma^j phi_j )^T w = psi^T w ) In fact, in a non-multitask setting, phi_i may even learn to predict r_i in one of it's dimensions and make the others superfluous and rendering the setting rather close to DQN. And the goal that we achieve with this SR training may be just the same as in normal DQN, that is estimating a Q-value. But the way we train it, making use of our prior knowledge that Q is linear in r, can be hoped to fare better that a DQN blackbox. And that is without even going into how well this fits to multi-task-learning. (But yeah then it has a transfer learning element) I hope that was understandable, Dermit

@YannicKilcher 4 года назад

True, I think you're absolutely right with your explanations, thank you.

@sacramentofwilderness6656 4 года назад

It is an interesting proposal to extend the classical Q-learning approach, however, it seems to me that in the most basic setting it would work, only if tasks are very similar to each other. Whether the tasks significantly differ from each other, we would need rather a complicated and general function \omega(s) , as in the last example, it would take a significant amount of time to adapt new policy.

@alirezamogharabi8733 4 года назад

Thanks a lot, Please explain some papers about distributed multi-agent Reinforcement Learning 🙏❤️❤️

@utku_yucel 4 года назад

Awesome video. Thanks for sharing !!!!

@mrpocock 4 года назад

Thanks. Another interesting paper. I wonder if an alternative way to learn useful atomic strategies that you can then compose would be to put the underlying summary of the input through a soft max, and get the different tasks to bid on them. Perhaps this would allow a high level reward to split into sensible, composable subtasks.

@YannicKilcher 4 года назад

Nice idea!

@jonathanballoch 3 года назад

Shouldn't \gamma also be a vector? it seems highly unlikely to me that a mixture of tasks would all have the same discount...

@first-thoughtgiver-of-will2456 4 года назад

..I've had the same idea (and some others pertaining to) for policy generalization for a long time now.

@IgorAherne 4 года назад

yes!! thanks man :')

@Kram1032 4 года назад

This might be a stupid experiment but I'm kinda curious whether directly rewarding avoidance and directly rewarding collection could be combined by this policy in order to find a new policy that's even better at avoiding/rewarding. That's for simple cases like the task devised here. But more generally, ask to optimize a task and its opposite as two separate policies, then mix and match both learned policies using this technique and see if it somehow does even better at each variant than after only having learned the first task. It'd be specifically interesting if either task is about equally hard. (In the example task it seems clear that avoiding is easier than collecting)

@Kram1032 4 года назад

Oh I guess that's more or less covered by the graphs around 40min in

@MoneyWiseGem 4 года назад

Great vid, any good code example for this ?

@rishikaushik8307 4 года назад

in the 'Zero-Shot Policy for New Tasks' section φ-sub-d means dth element of the φ vector or do we have d different successor features? First one makes sense to me

@YannicKilcher 4 года назад

I think it's the first one

@mT4945 4 года назад

You are amazing!

@crustysoda 4 года назад

I wonder if it also constrained new task to be score as linear combination of learned task? If the new task's score is multiplication of square picked times triangle picked, could generalize well?

@YannicKilcher 4 года назад

it could probably do ok because that still seems monotonic, but as I see it, this definitely violates their assumptions.

@2346Will 4 года назад

Isn't this what HER does? Please could someone explain why this proposed framework could be better than HER?

@herp_derpingson 4 года назад

Its just an ensemble bro...

@2346Will 4 года назад

@@herp_derpingson I agree that this paper seems to just be an ensemble of DQNs. It's just unclear to me what contribution it makes.

@first-thoughtgiver-of-will2456 4 года назад

they both seem to take from the great Universal Value Function Approximation paper but having glanced at HER-- this algorithm doesn't rely strictly on a replay buffer for multi-goal learning and learns a policy ('feature') mixing vector mask thing (w). Thinking analogously, this is a linear attempt to automate the hard coded HER goals used in the replay buffer. It seems that in HER you can degrade learning if you have goals that are too different whereas w here would allow the algorithm to tune non-applicable policies out. Also HER randomly mixes objectives during training of a policy through the replay buffer whereas this looks directly at prior policies (..'features') and attempts to mix in some policy. HER is a mutli objective learner whereas this is more of a 'smart' transfer learner (an admitably subtle difference). the difference to me is something like sequential vs parallel learning across policies. I only glanced at the HER algorithm and just watched this video in the background and haven't read the paper; I am sure I missed something or am wrong about something but hope I lent some insight.