I swear some day your videos will get me job in RL. Thank you for your work. I don't think anyone else is doing what you do on YT and it's VERY helpful.
@@birdmw We use RL all the time in business ML and DS in tech. These are powerful frameworks that don't have to be used just for solving trivial games - don't think so literally and rigidly. Ironically, the recommendation engine that likely recommended this video to you on YT is based on an RL framework...
This is the best hands-on RL channel on RU-vid. Learned a lot from you. I would like to point out few things 1. This is not Monte Carlo variant but TD variant as Phil is using the estimated future reward rather than the actual future reward. 2. 12:50 This particular variant is sample inefficient not actor-critic in general. I copied your code and performed benchmark on single step and batch learning and found batch learning to be superior. The initial performance of single step learning is better than batch (I think that is due to low variance of a single step as compared to batch) but the batch learning catches up and surpasses single step learning. Also, in the later stages of learning the reward of single step learning starts dropping (I think that is due to the fact that single step model has much less experience as compared to batch model in the same timeframe).
Hi, I followed this tutorial and I am getting the following error when I fit actor model : _SymbolicException: Inputs to eager execution function cannot be Keras symbolic tensors, but found []
Excellent video, thx. A few updates for tf.keras / TF2.2+ users: (i) Need to declare self.n_actions=n_actions in class Agent init; (ii) Model(input...should be Model(inputs...; (iii) in TF2.2 need to disable eager execution (using tf.python.framework_ops.disable_eager_execution() ); (iv) I also needed to specify gpu device and set_memory_growth(dev, True); (v) all keras modules updated to tensorflow.keras imports. I am hoping you have an A3C video on your channel!
Hi Phil, pretty enjoy your tutorial on actor critic model. Would you mind me just asking the custom loss is for gradient descent? But the actor requires gradient ascent, correct me if I'm wrong? BTW, the delta is actually - TD_error ?
Been hunting for easy-to-understand content like this on keras RL and actor-critic methods for ages. Best I've found so far. Good stuff, man, thanks for the help.
Thanks for the really helpful video! I got it to work flawlessly on a Flappy Bird environment. I just wish I knew how to load the saved models without Keras freaking out about the custom loss function. Makes continuation of training kind of... impossible.
Hi Phill, thank you for your great explanation, it is really become more or less clear. i found the folowing error in the row 66 where the actor is fitting: "TypeError: Cannot convert a symbolic Keras input/output to a numpy array. This error may indicate that you're trying to pass a symbolic value to a NumPy call, which is not supported. Or, you may be trying to pass Keras symbolic inputs/outputs to a TF API that does not register dispatching, preventing Keras from automatically converting the API call to a lambda layer in the Functional Model." Did this happen due to version of Keras 2.4.1 i use? i ll appreciate if you suggest how to fix it. Thank you in advance!!
Thanks for this. Im hoping to construct a similar model to this where the input to the network is going to be time series data. I hope it is as easy as it sounds
Great content. This Actor-Critic video uses 3 NN, while Dueling Deep Q Learning (another video of yours) uses 2 NN - V and A. What is the difference between the two approaches, and which one is better?
How is the policy model even being affected by the training phase? I see no correlation between the fit on actor/critic affecting the policy choosing action
Hey, Phil. Can I ask you your opinion? I've implemented AC exactly as you, same parameters, same enrivornment. However my plot seems to have way more variance. It reaches the peak then breaks then peaks again back and forth. Then I've implemented PPO, with the exact body algorithmas your AC, only changing the loss, and though it peaks sooner, it once again breaks and peaks bakc and forth (extremely high variance). Do you have an idea what causes this?
I'm getting back to the actor critic material in the coming weeks. Once I finish this deep q learning course I'm going to prep an actor critic course as well. I've got PPO on my last as well as A2C - I'll see if I can fit in SAC in a reasonable time frame as well.
Hi, I have a question at 21:35 about the "custom_loss", why the loss function should be designed as "-log_lik * delta"? I know that -log_lik is like cross entropy, but how about "*delta"?
Thanks for your videos ! I have 2 questions ^^ 1) If my rewards are between +1 and -1 is there a reason to use linear activation rather than tanh for the critic ? 2) I coded something pretty similar 2 weeks ago and in my equivalent of the choose_action function i do probabilities*= mask (mask is a one hot array to mask impossible moves (connect4 board game)) and then renormalize probabilities /= np.sum(probabilities) ... But sometimes I get an error : probabilities contains NaN ... Thats strange because I cant divide by zero with the softmax activation ... Have you met something similar before ? (I modify my message to add this question : 3) I saw a comment under this video where you say its not A2C its just AC but isnt the delta you're calculating the advantage ?)
2 solved by creating my own softmax clipping the logits betwween -20 and 20 because the exponential was going to infinite values giving me a [0, 0, 0, 1, 0, 0, 0] and if the middle column wasnt playable I added this mask [1, 1, 1, 0, 1, 1, 1] Then it was a zeros vector so masked_output / sum(masked_output) gave me NaN
Is using multiple models unavoidable as opposed to something like one model with multiple outputs? I've seen a few tutorials where the actor and critic losses are apparently just summed for the same model. Your method seems more correct than those because the losses are matched to the correct outputs rather than lumped together, but I'm wondering if there's a way to get it all in the one model. Perhaps the issue is having to get the delta/advantage produced by one loss function and taking it as an input to another loss function, which might be impossible in TF/Keras. Is my thinking correct?
Yeah, the framework plays a large part in the implementation. In PyTorch I sum the actor and critic losses and typically use a network with shared lower layers and 2 separate output layers. I tend to use PyTorch more, so I'm not sure if your suggestion is impossible or not.
@@MachineLearningwithPhil Thanks for your quick reply. I think I found the answer here: stackoverflow.com/questions/57559016/keras-built-in-mse-loss-on-2d-data-returns-2d-matrix-not-scalar-loss. Just playing around with it, the actor and critic output layers can be concatenated, which allows for different activations for the actor and critic while sharing a loss function (I think keeping the outputs separate prevents sharing values in the loss function in Keras). The actor and critic losses can then be separately calculated and tf.stack can be used to join them inside the loss function. With simple experimentation, I'm seeing that the weights do move in the right direction, and by splitting out crossentropy and MSE losses as metrics (as a simple example of combining probabilistic and regression outputs rather than full actor-critic), I can see that each does move smoothly in the right direction. In the link, I'm reading "per-sample" to mean unit- or node-specific, which means that the critic portion of the loss won't pollute the actor-specific weights and vice versa. That was my original concern with simply adding the losses - I would think this would cause the losses of one output of the model to affect the weights of the other output. Does all of this make sense?
Enjoying your examples Phil, thanks for creating them! I did find some of the explanations not always quite accurate though (at least as i'm interpreting them). For example, a trailing comma with some tuples is necessary simply so they are parsed as a tuple of one item rather than a parenthesised expression (i.e. while (2,3) would be a tuple, (2) wouldn't be). The loss function also isn't returning "one function inside of another" as seemed to be being described, it's just a regular function that will return a value from calling sum() if invoked. As Python has first class functions that can passed around as values, this is how the loss function is passed into compile(), which can accept a string or a function. It's true that one might want a closure to dynamically create and return a loss function with some values bound to the loss function at runtime, but that's not what's happening in your example. The examples as a whole are great though.
Thanks for the great video. Your custom loss function uses `log_lik = y_true*K.log(out)`, but I've generally seen `y_true` moved inside the log (`log_lik = K.log(y_true * out)`). Is there a mathematical reason for doing it the way you have?
Disregard the question. I misunderstood something during the video. Since y_true is a one-hot vector, the two equations are equivalent. Thanks again for posting this!
What's weird is that my score is constantly negative over all experiences, and never positive Edit : i acted a fool here twice. First, i didn't called the "learn" function when i was following the tutorial, and didn't noticed it. Second, i didn't paid attention on how much episodes this method needs to experience before it starts to improve. My bad. Thanks to Phil for being so patient and for his feedback.
Firstly, really enjoyed your video and had fun running your sample. Now a totally newb question. Can this A2C model be applied on problems with continuous action space like Pendulum or Continuous MountainCar? Thanks again for the great video!!
Good question. It's not really an A2C algorithm so much as it is regular actor critic (we're not calculating an advantage function). It doesn't work very well on continuous action spaces, since we're approximating the policy, which is a probability distribution rather than an actual action (i.e. it's the probability of selecting a discrete action, instead of a continuos action itself). For continuous action spaces, check out my videos on Deep Deterministic Policy Gradients.
@@MachineLearningwithPhil I might be missing something, but isn't "delta" simply "advantage"? The advantage is equal to Q(s,a) - V(s), and Q(s,a) = reward + gamma * V(s'), so advantage is simply = reward + gamma * V(s') - V(s) which is what you set your delta to
Great question. I've been pecking away at this stuff for about 3 years now, but that's built on a foundation of programming (off and on) for my entire life. Keep working at it, and you'll get there. But please note, this stuff is HARD. It's going to take time, so just enjoy the process of learning and becoming a better engineer.
Hi there, This is a really good and easy to understand tutorial, thanks for that. Do you have any kind of log file for this example where i can compare my GPU to your RTX 2080 in terms of computing time? Also it would be nice to know how the curve looks like.
If i try to run this code i run out of memory before Episode 150. Is this a general problem or just a typing error from me? It also seems like the Agent is not learning..... I start with an avg reward of avg_score:-178.657 and i end up with -359.719 after 123 episodes...
Ok I could fix the error by myself. I was already wondering why it took my pc so long to train an Episode. I used the tensorflow.keras import instead of the keras import. I thaught it should have no impact because it's basicaly the same thing. I was wrong... tensorflow.keras => ~2s /episode keras => ~25s /episode
@@MachineLearningwithPhil Hello Phil, why do I think this video is an advantage actor-critic implementation? the calculation of delta is the calculation of advantage.
I do have plans on doing GANs, but probably some time in October. I'm going to do some more basic stuff this upcoming month, and start to branch out into more ML topics. I've started to saturate the RL stuff.
@@MachineLearningwithPhil very excited! One thing is for sure in your video - the code works and it does what it suppose to do and this means that you understand it very well! A lot of 'smart' people know a lot conceptually, but when you start coding that's when you really have to know it.
Thx Phil, great Tutorial! I tried to run the Agent on Cartpole-gym. Even with several parameter-configs it never learned the game. Maybe you could make a Tutorial on adjusting the Model to a different Env. Would be very helpfull, Thx
same here with lunar lander, and other his tutorials something's fishy here Edit : and what's was fishy, is my attention. Thanks to Phil for his patience and feedback on github!
thanks for the tutorial! I'm getting this error where the predicted probabilities goes to nan by the 2nd or 3rd episode, any idea why? i checked the custom loss function over and over again. also, would this method benefit from experience replay?
got a github? It's almost certain that a zero is ending up in the log. Might help me help you debug. Also, it would benefit from an experience replay. You should add it and see how much it improves.
@@MachineLearningwithPhil github.com/rockyyliang/RL I'll try adding replay and comment when I get results. I even tried copy and pasting your loss function directly, and playing with the exponent value in K.clip(), always leads to nan
ooook now this is weird. I can run for 100 games with no nan or other issues. I just did a git clone of your repo and ran it with no changes. What type of set up are you running? CPU, GPU? What OS?
which i7, specifically? Is it skylake X? Are you running something like an i7 7820x or similar? If you're using a Skylake X have you updated your BIOS? There was a microcode error in those from the factory that had to be patched. I know because my PC would hard reboot when plotting data in matplotlib, before I patched the BIOS. Other thing to check, do a cd into your home directory then cd into .keras and then open up keras.json and see what the floatx is set to.
@@MachineLearningwithPhil Thank you, i was a bit confused about the definition of the layers in keras, in particular becouse you are adopting this shared network that is not very common . When you define the Policy (the variable policy = Model...) you are referencing the same network of actor but with different input (as you explained to be used for prediction), thus fitting the actor is equal to fit the Policy becouse they are exactly the same network. Now i have completely understood . Good Job!
@@manuelsperoni That would mean that fitting the actor is the same as fitting the critic and the policy as they are both variable =Model? Is that all one big network? I am a bit confused...
Great job Phil! Btw, if anyone encounter Keras2 API warnings, you may refer to 'medium.com/@bramblexu/userwarning-update-your-model-call-to-the-keras-2-api-8a6a5955daac' to fix it.