Actor Critic Methods Are Easy With Keras

Machine Learning with Phil

Подписаться 43 тыс.

Просмотров 21 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Опубликовано:

21 окт 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 108

@MachineLearningwithPhil 4 года назад

This content is sponsored by my Udemy courses. Level up your skills by learning to turn papers into code. See the links in the description.

@bowieterrance9513 3 года назад

i guess it is kind of randomly asking but do anybody know of a good website to stream newly released movies online?

@aidanmclaughlin5279 2 года назад

Took me three hours of medium headaches to give up and find this video. Fantastic explanation!

@robosergTV 5 лет назад

I swear some day your videos will get me job in RL. Thank you for your work. I don't think anyone else is doing what you do on YT and it's VERY helpful.

@birdmw 4 года назад

lol what jobs are there in RL? This is academic

@chymoney1 2 года назад

@@birdmw maybe George Hotz open AI project . I guess there are others too but highly specialized

@GregMurray-u8f 5 месяцев назад

@@birdmw We use RL all the time in business ML and DS in tech. These are powerful frameworks that don't have to be used just for solving trivial games - don't think so literally and rigidly. Ironically, the recommendation engine that likely recommended this video to you on YT is based on an RL framework...

@srikanthsrinivas8797 5 лет назад

This is by far, the best actor critic practical tutorial on youtube. Doing gods work!

@shubhambansal013 4 года назад

This is the best hands-on RL channel on RU-vid. Learned a lot from you. I would like to point out few things 1. This is not Monte Carlo variant but TD variant as Phil is using the estimated future reward rather than the actual future reward. 2. 12:50 This particular variant is sample inefficient not actor-critic in general. I copied your code and performed benchmark on single step and batch learning and found batch learning to be superior. The initial performance of single step learning is better than batch (I think that is due to low variance of a single step as compared to batch) but the batch learning catches up and surpasses single step learning. Also, in the later stages of learning the reward of single step learning starts dropping (I think that is due to the fact that single step model has much less experience as compared to batch model in the same timeframe).

@ajyee4754 Год назад

Wow so clear and concise. Thank you so much for your good work, this helped me a lot

@gk5253 5 лет назад

Thank you so much for these tutorials!

@MachineLearningwithPhil 5 лет назад

My pleasure!

@Majadoon 2 года назад

Hi, I followed this tutorial and I am getting the following error when I fit actor model : _SymbolicException: Inputs to eager execution function cannot be Keras symbolic tensors, but found []

@markd964 2 года назад

Excellent video, thx. A few updates for tf.keras / TF2.2+ users: (i) Need to declare self.n_actions=n_actions in class Agent init; (ii) Model(input...should be Model(inputs...; (iii) in TF2.2 need to disable eager execution (using tf.python.framework_ops.disable_eager_execution() ); (iv) I also needed to specify gpu device and set_memory_growth(dev, True); (v) all keras modules updated to tensorflow.keras imports. I am hoping you have an A3C video on your channel!

@Meditator80 Год назад

Thank you so much Mark! This is really important 😄😄

@Meditator80 Год назад

Hi Phil, pretty enjoy your tutorial on actor critic model. Would you mind me just asking the custom loss is for gradient descent? But the actor requires gradient ascent, correct me if I'm wrong? BTW, the delta is actually - TD_error ?

@willcreatesthings 5 лет назад

Been hunting for easy-to-understand content like this on keras RL and actor-critic methods for ages. Best I've found so far. Good stuff, man, thanks for the help.

@mehdizadem383 4 года назад

Great tutorial ! I just don't understand the interest behind having the two networks share weights. What would that help with?

@JordanMetroidManiac 2 года назад

Thanks for the really helpful video! I got it to work flawlessly on a Flappy Bird environment. I just wish I knew how to load the saved models without Keras freaking out about the custom loss function. Makes continuation of training kind of... impossible.

@MachineLearningwithPhil Год назад

Honestly I recommend using tensorflow 2.

@سودانتوك 5 лет назад

Man! that's amazing , thank you for your great work.

@rodrigoalvarezortega2424 4 года назад

Thank you!! amazing video!!!

@razorbalade1 4 года назад

Thank you so much for this tutorial. It was very clear and helped me a lot!

@Veltriuk1 4 года назад

Just as a suggestion, the word "input" is a python function itself, so i'd change tha name of that variable a bit.

@rashpel185 3 года назад

Hi Phill, thank you for your great explanation, it is really become more or less clear. i found the folowing error in the row 66 where the actor is fitting: "TypeError: Cannot convert a symbolic Keras input/output to a numpy array. This error may indicate that you're trying to pass a symbolic value to a NumPy call, which is not supported. Or, you may be trying to pass Keras symbolic inputs/outputs to a TF API that does not register dispatching, preventing Keras from automatically converting the API call to a lambda layer in the Functional Model." Did this happen due to version of Keras 2.4.1 i use? i ll appreciate if you suggest how to fix it. Thank you in advance!!

@user-qo7qt3wq7h 5 лет назад

Thank you for those tutorials !! :)

@fatman0031 4 года назад

Thanks for this. Im hoping to construct a similar model to this where the input to the network is going to be time series data. I hope it is as easy as it sounds

@glennchoi897 3 года назад

Great content. This Actor-Critic video uses 3 NN, while Dueling Deep Q Learning (another video of yours) uses 2 NN - V and A. What is the difference between the two approaches, and which one is better?

@DJsliverr 3 года назад

How is the policy model even being affected by the training phase? I see no correlation between the fit on actor/critic affecting the policy choosing action

@Kanic12341 4 года назад

Hey, Phil. Can I ask you your opinion? I've implemented AC exactly as you, same parameters, same enrivornment. However my plot seems to have way more variance. It reaches the peak then breaks then peaks again back and forth. Then I've implemented PPO, with the exact body algorithmas your AC, only changing the loss, and though it peaks sooner, it once again breaks and peaks bakc and forth (extremely high variance). Do you have an idea what causes this?

@junhuangho6937 5 лет назад

Hi phil, love these tutorial videos so much. Was wondering if there were plans to do a multi-action discrete SAC algo anytime soon?

@MachineLearningwithPhil 5 лет назад

I'm getting back to the actor critic material in the coming weeks. Once I finish this deep q learning course I'm going to prep an actor critic course as well. I've got PPO on my last as well as A2C - I'll see if I can fit in SAC in a reasonable time frame as well.

@heii0w0rid37 4 года назад

Hi, I have a question at 21:35 about the "custom_loss", why the loss function should be designed as "-log_lik * delta"? I know that -log_lik is like cross entropy, but how about "*delta"?

@RASKARZ34 4 года назад

Thanks for your videos ! I have 2 questions ^^ 1) If my rewards are between +1 and -1 is there a reason to use linear activation rather than tanh for the critic ? 2) I coded something pretty similar 2 weeks ago and in my equivalent of the choose_action function i do probabilities*= mask (mask is a one hot array to mask impossible moves (connect4 board game)) and then renormalize probabilities /= np.sum(probabilities) ... But sometimes I get an error : probabilities contains NaN ... Thats strange because I cant divide by zero with the softmax activation ... Have you met something similar before ? (I modify my message to add this question : 3) I saw a comment under this video where you say its not A2C its just AC but isnt the delta you're calculating the advantage ?)

@RASKARZ34 4 года назад

2 solved by creating my own softmax clipping the logits betwween -20 and 20 because the exponential was going to infinite values giving me a [0, 0, 0, 1, 0, 0, 0] and if the middle column wasnt playable I added this mask [1, 1, 1, 0, 1, 1, 1] Then it was a zeros vector so masked_output / sum(masked_output) gave me NaN

@kaiyizhang9864 4 года назад

yeah, i also feel confused why he say this is not A2C, I think the delta is the calculation of advantage, did you figure it out?

@raminahadi4240 3 года назад

Hi Phil, Thanks so much for your top tutorial. A question, why don't we update the policy network here?

@JordanMetroidManiac 2 года назад

The weights of the policy network are updated through the actor model via the custom loss function.

@kevin5k2008 5 лет назад

@Machine Learning with Phil My limited understanding suggests that the TD variant of actor-critic is used. Can you give your comments?

@salos1982 5 лет назад

Wonderful video Phil. I have one question about it. Why do you not use batching for fitting. And use fitting for single state

@JordanMetroidManiac 2 года назад

That seems to be a possibility. Try it and see if the model still learns. I think it would work.

@xxXXCarbon6XXxx 5 лет назад

No errors! Well done

@MachineLearningwithPhil 5 лет назад

lol it's a new record!

@JHogg11 4 года назад

Is using multiple models unavoidable as opposed to something like one model with multiple outputs? I've seen a few tutorials where the actor and critic losses are apparently just summed for the same model. Your method seems more correct than those because the losses are matched to the correct outputs rather than lumped together, but I'm wondering if there's a way to get it all in the one model. Perhaps the issue is having to get the delta/advantage produced by one loss function and taking it as an input to another loss function, which might be impossible in TF/Keras. Is my thinking correct?

@MachineLearningwithPhil 4 года назад

Yeah, the framework plays a large part in the implementation. In PyTorch I sum the actor and critic losses and typically use a network with shared lower layers and 2 separate output layers. I tend to use PyTorch more, so I'm not sure if your suggestion is impossible or not.

@JHogg11 4 года назад

@@MachineLearningwithPhil Thanks for your quick reply. I think I found the answer here: stackoverflow.com/questions/57559016/keras-built-in-mse-loss-on-2d-data-returns-2d-matrix-not-scalar-loss. Just playing around with it, the actor and critic output layers can be concatenated, which allows for different activations for the actor and critic while sharing a loss function (I think keeping the outputs separate prevents sharing values in the loss function in Keras). The actor and critic losses can then be separately calculated and tf.stack can be used to join them inside the loss function. With simple experimentation, I'm seeing that the weights do move in the right direction, and by splitting out crossentropy and MSE losses as metrics (as a simple example of combining probabilistic and regression outputs rather than full actor-critic), I can see that each does move smoothly in the right direction. In the link, I'm reading "per-sample" to mean unit- or node-specific, which means that the critic portion of the loss won't pollute the actor-specific weights and vice versa. That was my original concern with simply adding the losses - I would think this would cause the losses of one output of the model to affect the weights of the other output. Does all of this make sense?

@makers_lab 4 года назад

Enjoying your examples Phil, thanks for creating them! I did find some of the explanations not always quite accurate though (at least as i'm interpreting them). For example, a trailing comma with some tuples is necessary simply so they are parsed as a tuple of one item rather than a parenthesised expression (i.e. while (2,3) would be a tuple, (2) wouldn't be). The loss function also isn't returning "one function inside of another" as seemed to be being described, it's just a regular function that will return a value from calling sum() if invoked. As Python has first class functions that can passed around as values, this is how the loss function is passed into compile(), which can accept a string or a function. It's true that one might want a closure to dynamically create and return a loss function with some values bound to the loss function at runtime, but that's not what's happening in your example. The examples as a whole are great though.

@MachineLearningwithPhil 4 года назад

Thank you. Yes, sometimes I am imprecise. I will work to be more careful.

@OnlineCashSystem101 5 лет назад

thanks for this video! is there any way to use this for continuous action space?

@manuelnovella39 5 лет назад

You're amazing!

@MachineLearningwithPhil 5 лет назад

Thanks Manuel, I try!

@BenBotto 4 года назад

Thanks for the great video. Your custom loss function uses `log_lik = y_true*K.log(out)`, but I've generally seen `y_true` moved inside the log (`log_lik = K.log(y_true * out)`). Is there a mathematical reason for doing it the way you have?

@BenBotto 4 года назад

Disregard the question. I misunderstood something during the video. Since y_true is a one-hot vector, the two equations are equivalent. Thanks again for posting this!

@РыгорБородулин-ц1е 4 года назад

What's weird is that my score is constantly negative over all experiences, and never positive Edit : i acted a fool here twice. First, i didn't called the "learn" function when i was following the tutorial, and didn't noticed it. Second, i didn't paid attention on how much episodes this method needs to experience before it starts to improve. My bad. Thanks to Phil for being so patient and for his feedback.

@AwesomeLemur 3 года назад

"... because that's how probabilities work, at least in this universe" lol

@digitalboltwebdesign 5 лет назад

thank you for these tutorials. if possible would you be able to show how to apply this to finance

@MachineLearningwithPhil 5 лет назад

Great suggestion, but I won't have time to get to it yet. It's something I"ll have to look into for the future.

@hujanais4250 5 лет назад

Firstly, really enjoyed your video and had fun running your sample. Now a totally newb question. Can this A2C model be applied on problems with continuous action space like Pendulum or Continuous MountainCar? Thanks again for the great video!!

@MachineLearningwithPhil 5 лет назад

Good question. It's not really an A2C algorithm so much as it is regular actor critic (we're not calculating an advantage function). It doesn't work very well on continuous action spaces, since we're approximating the policy, which is a probability distribution rather than an actual action (i.e. it's the probability of selecting a discrete action, instead of a continuos action itself). For continuous action spaces, check out my videos on Deep Deterministic Policy Gradients.

@hujanais4250 5 лет назад

@@MachineLearningwithPhil Ah ok. Thanks for the clarification. I will check out your next video. Good stuff but this does have a steep learning curve.

@asaf92 4 года назад

@@MachineLearningwithPhil I might be missing something, but isn't "delta" simply "advantage"? The advantage is equal to Q(s,a) - V(s), and Q(s,a) = reward + gamma * V(s'), so advantage is simply = reward + gamma * V(s') - V(s) which is what you set your delta to

@manuelsperoni 4 года назад

@@asaf92 looking at the code i have the same question, delta seems to be the advantage

@awesome-ai1714 4 года назад

Why we use policy? its like random action basicly, we never update policy weights, so are weights updated by other network? Is it connected to actor?

@andreamassacci7942 5 лет назад

How long did it take you to get to this level? being programming for 1 year + but it would take me 3 months to write this from scratch.

@MachineLearningwithPhil 5 лет назад

Great question. I've been pecking away at this stuff for about 3 years now, but that's built on a foundation of programming (off and on) for my entire life. Keep working at it, and you'll get there. But please note, this stuff is HARD. It's going to take time, so just enjoy the process of learning and becoming a better engineer.

@weiqinchen6105 Год назад

Does anyone have the code using Vanilla Actor Critic framework to solve "LunarLanderContinuous-v2"? I tried a lot and it didn't work.

@MegaGippie 4 года назад

Hi there, This is a really good and easy to understand tutorial, thanks for that. Do you have any kind of log file for this example where i can compare my GPU to your RTX 2080 in terms of computing time? Also it would be nice to know how the curve looks like.

@MegaGippie 4 года назад

If i try to run this code i run out of memory before Episode 150. Is this a general problem or just a typing error from me? It also seems like the Agent is not learning..... I start with an avg reward of avg_score:-178.657 and i end up with -359.719 after 123 episodes...

@MegaGippie 4 года назад

Ok I could fix the error by myself. I was already wondering why it took my pc so long to train an Episode. I used the tensorflow.keras import instead of the keras import. I thaught it should have no impact because it's basicaly the same thing. I was wrong... tensorflow.keras => ~2s /episode keras => ~25s /episode

@aliamiri4524 4 года назад

hey phil thank you for this I just confused about the model is this simple actor-critic or advantage actor-critic?

@MachineLearningwithPhil 4 года назад

It's simple actor critic. There's no advantage stream here

@asaf92 4 года назад

@@MachineLearningwithPhil Will you be doing an A2C guide?

@kaiyizhang9864 4 года назад

@@MachineLearningwithPhil Hello Phil, why do I think this video is an advantage actor-critic implementation? the calculation of delta is the calculation of advantage.

@geo2073 5 лет назад

love it how you started doing keras tutorials now. Any plans on doing GANs?

@MachineLearningwithPhil 5 лет назад

I do have plans on doing GANs, but probably some time in October. I'm going to do some more basic stuff this upcoming month, and start to branch out into more ML topics. I've started to saturate the RL stuff.

@geo2073 5 лет назад

@@MachineLearningwithPhil very excited! One thing is for sure in your video - the code works and it does what it suppose to do and this means that you understand it very well! A lot of 'smart' people know a lot conceptually, but when you start coding that's when you really have to know it.

@Kanic12341 5 лет назад

What version of tensorflow are you using?

@MachineLearningwithPhil 5 лет назад

1.14 I believe

@rudischmidt811 4 года назад

Thx Phil, great Tutorial! I tried to run the Agent on Cartpole-gym. Even with several parameter-configs it never learned the game. Maybe you could make a Tutorial on adjusting the Model to a different Env. Would be very helpfull, Thx

@РыгорБородулин-ц1е 4 года назад

same here with lunar lander, and other his tutorials something's fishy here Edit : and what's was fishy, is my attention. Thanks to Phil for his patience and feedback on github!

@MachineLearningwithPhil 4 года назад

If you guys are having difficulty, raise the issue on GitHub and include a link to your code. I'll see what I can to do help.

@РыгорБородулин-ц1е 4 года назад

@@MachineLearningwithPhil Edited my comments. I'm sorry, and also, thanks a lot.

@MachineLearningwithPhil 4 года назад

No problem bud. Hope it helps.

@Gytax0 4 года назад

21:22 'it is much simpler to learn the policy than it is to learn the value function' - but the actor-critic method learns both...

@gabrielddpg 4 года назад

is this called a3c? or a2c?

@MachineLearningwithPhil 4 года назад

Neither. It's just actor critic.

@snowboyyuhui 5 лет назад

thanks for the tutorial! I'm getting this error where the predicted probabilities goes to nan by the 2nd or 3rd episode, any idea why? i checked the custom loss function over and over again. also, would this method benefit from experience replay?

@MachineLearningwithPhil 5 лет назад

got a github? It's almost certain that a zero is ending up in the log. Might help me help you debug. Also, it would benefit from an experience replay. You should add it and see how much it improves.

@snowboyyuhui 5 лет назад

@@MachineLearningwithPhil github.com/rockyyliang/RL I'll try adding replay and comment when I get results. I even tried copy and pasting your loss function directly, and playing with the exponent value in K.clip(), always leads to nan

@MachineLearningwithPhil 5 лет назад

ooook now this is weird. I can run for 100 games with no nan or other issues. I just did a git clone of your repo and ran it with no changes. What type of set up are you running? CPU, GPU? What OS?

@snowboyyuhui 5 лет назад

strange! newer i7, rtx 2080, ubuntu 18.04. could it be an RTX quirk? i noticed you are using one also tho

@MachineLearningwithPhil 5 лет назад

which i7, specifically? Is it skylake X? Are you running something like an i7 7820x or similar? If you're using a Skylake X have you updated your BIOS? There was a microcode error in those from the factory that had to be patched. I know because my PC would hard reboot when plotting data in matplotlib, before I patched the BIOS. Other thing to check, do a cd into your home directory then cd into .keras and then open up keras.json and see what the floatx is set to.

@manuelsperoni 4 года назад

How the Policy to select the action is updated? Maybe i am missing something, but you define it once, but never update it.

@MachineLearningwithPhil 4 года назад

By updating the model weights in the fit function.

@manuelsperoni 4 года назад

@@MachineLearningwithPhil Thank you, i was a bit confused about the definition of the layers in keras, in particular becouse you are adopting this shared network that is not very common . When you define the Policy (the variable policy = Model...) you are referencing the same network of actor but with different input (as you explained to be used for prediction), thus fitting the actor is equal to fit the Policy becouse they are exactly the same network. Now i have completely understood . Good Job!

@alexgodo603 4 года назад

@@manuelsperoni That would mean that fitting the actor is the same as fitting the critic and the policy as they are both variable =Model? Is that all one big network? I am a bit confused...

@aloshalaa1992 4 года назад

Can we have the code on github?

@waltercanedoriedel1413 3 года назад

How dare you

@kevin5k2008 5 лет назад

Great job Phil! Btw, if anyone encounter Keras2 API warnings, you may refer to 'medium.com/@bramblexu/userwarning-update-your-model-call-to-the-keras-2-api-8a6a5955daac' to fix it.