No video :(

Actor Critic Algorithms

Подписаться 770 тыс.

Просмотров 94 тыс.

50% 1

Reinforcement learning is hot right now! Policy gradients and deep q learning can only get us so far, but what if we used two networks to help train and AI instead of one? Thats the idea behind actor critic algorithms. I'll explain how they work in this video using the 'Doom" shooting game as an example.
Code for this video:
github.com/llS...
i-Nickk's winning code:
github.com/I-N...
Vignesh's runner up code:
github.com/tj2...
Taryn's Twitter:
/ tarynsouthern
More learning resources:
papers.nips.cc...
rll.berkeley.ed...
web.mit.edu/jnt...
mlg.eng.cam.ac....
mi.eng.cam.ac.u...
Please Subscribe! And like. And comment. That's what keeps me going.
Want more inspiration & education? Connect with me:
Twitter: / sirajraval
Facebook: / sirajology
Join us in the Wizards Slack channel:
wizards.herokua...
And please support me on Patreon:
www.patreon.co... Instagram: / sirajraval Instagram: / sirajraval
Signup for my newsletter for exciting updates in the field of AI:
goo.gl/FZzJ5w
Hit the Join button above to sign up to become a member of my channel for access to exclusive content! Join my AI community: chatgptschool.io/ Sign up for my AI Sports betting Bot, WagerGPT! (500 spots available):
www.wagergpt.co

Опубликовано:

29 авг 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 114

@Ronnypetson 6 лет назад

Siraj is definitely very important for the dissemination of AI knowledge. I myself owe Siraj many thanks for this incredible channel!!

@robertotomas 4 года назад

Wow this is seriously a fantastic introduction motivating ac methods

@sophieg.9272 3 года назад

You saved my live with this video. Thanks! I have to write a text, that this topic includes and i struggled for so long to understand it, but now it seems so easy.

@adamduvick 5 лет назад

This video is just about this article: towardsdatascience.com/reinforcement-learning-w-keras-openai-actor-critic-models-f084612cfd69

@tomw4688 3 года назад

He goes so fast. It's like he's talking to someone that already understand it.

@timothyquill889 3 года назад

Think he's more interested in showing off his knowledge than actually helping anyone

@chicken6180 6 лет назад

out of all the channels im subbed to this is the only one i have notifs on cuz its good

@SirajRaval 6 лет назад

thanks spark also tell me what vid topic you'd love to see

@jeffpeng1118 3 года назад

How does the critic know what the action score is?

@VigneshKumar-xd7xi 6 лет назад

Thanks for the recognition @Siraj. Looking forward to your upcoming works on the channel. A Halite 2 AI bot perhaps.

@adrianjaoszewski2631 6 лет назад

Did anybody actually try to run the source code? I've seen the same code snippet in two different places and none of them worked. Frankly - not only does it not work, it also has a lot of redundancy (many unused variables and errors), typos which make the code work incorrect, but are not spotted because the update methods are actually dead code which is never called. Basically the whole example is doomed because of the fact that it's just a single run through the environment and it usually stops just by hanging down. After fixing this it also does not work because the update function is never called. If you call the update function at the end of the train method it has runtime errors because of typos and wrong model use (trying to assign critic weights to the actor) and to be honest - even the neural nets are wrong - both have ReLUs as output layers, but the inputs can be negative (impossible with ReLU) and the Q-values should be mostly negative (most of the rewards are negative).

@LunnarisLP 6 лет назад

It's usually just sample code, because going through the whole code would often require to explain the used libraries and stuff. Google did the same with their policy gradient video with tensorflow :D

@chaitanyayanamala845 6 лет назад

My virtual teacher Siraj

@the007apocalypse 3 года назад

Apparently codes weren't the only things he plagiarised. Imagine this as a playground with a kid (the “actor”) and her parent (the “critic”). The kid is looking around, exploring all the possible options in this environment, such as sliding up a slide, swinging on a swing, and pulling grass from the ground. The parent will look at the kid, and either criticize or complement here based on what she did. towardsdatascience.com/reinforcement-learning-w-keras-openai-actor-critic-models-f084612cfd69

@kushalpatel8939 4 года назад

Amazing video. Nicely Explained.

@luck3949 6 лет назад

Hi Siraj! Can you please make a video on program synthesis? Please please please, I beg you! For me it seems that it is the straightest way to get a skynet-level AI, but it is so underhyped, that I did't even know that word until I googled the idea behind it. I have no idea, why nobody talks about that topic. I have no idea, why they don't use neural networks. It seems that Alpha-Go suits almost perfectly for that task (this is also a search in a tree), but I haven't heard about any revolution in that area.

@dewinmoonl 6 лет назад

program synthesis doesn't use AI because the pattern is too complicated and the data is too sparse. but if you want to watch synthesis I stream on twitch under "evanthebouncy"

@Belowzeroism 5 лет назад

Creating programs requires AGI which is far far beyond our reach by now

@ionmosnoi 6 лет назад

the source code is not working, the target weights are not updated!

@davidm.johnston8994 6 лет назад

Very interesting video as usual, thank you! :-)

@underlecht Год назад

Most interactive and most unclear/inaccurate video on actor-critic. Thank you!

@larryteslaspacexboringlawr739 6 лет назад

thank you for actor critic video

@NolePTR 6 лет назад

The way AlphaZero did it if I understand right is that it critiques the current state, not the future state given an action. So all you have to put in is S to receive the fitness (and policy vector). It's more of a fitness value than reward, due to context. This is possible since Chess has a finite number of positions pieces can be. The best output from the policy network is simulated and then passed back through the NN. State transition predictions are actually hardcoded (it always returns the ACTUAL state that would occur given an action, not a prediction of the actual state from a simulate_move function). So if my understanding is right, this is used for instead of hardcoding the state transitions for simulation, it uses a NN to predict the outcome state?

@SLR_96 4 года назад

Suggestion: In videos where you're trying to explain an idea or a method in a general form, try to simplify it as much as possible and don't go into much detail... Also definitely try examples and simple analogies as much as you can, because as we all know the process of learning works best with more examples

@FabianAmran 2 года назад

I agree

@tonycatman 6 лет назад

I watched a demo from NVIDIA this week in which they played a John Williams type of music score. It was unbelievably good. It'll be interesting to see what people come up with. A new Christmas Carol ?

@SirajRaval 6 лет назад

That’s dope! Hans zimmer ai next

@dshoulders 6 лет назад

where can i find this demo

@tonycatman 6 лет назад

Here : ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-egJ0PTKQp4U.html. Starts at about 02:00. I'm not sure how much licence the orchestra had.

@lordphu 5 лет назад

the correct term for finding the derivative is to "Differentiate" not "Derive"

@matthewdaly8879 6 лет назад

So is the actor's predicted best choice then optimized with gradient ascent on based on the critics Q values?

@vornamenachname906 3 года назад

no.

@deepaks.m.6709 6 лет назад

Finally you've controlled your speed. Love you bro :)

@dustinandrews89019 6 лет назад

Perfect timing. I am creating a AC on a toy grid-world problem and struggling with using the Q value to update the actor (output softmax((4,))). I'll check out the code.

@dustinandrews89019 6 лет назад

Siraj, it would be great if you could zoom in one how you use the gradients from the critic to update the actor. I know it's the chain rule, but a simplified example walk-through would be awesome.

@kaushikdr 3 года назад

Great video! One question: Why do we need a "model" to act as a critic? Don't we just need to maximize our reward? Also, how can we know if we have chosen the "best" action if we don't know all the rewards of an infinite input space? (Of course, in chess there is a finite input space.)

@chas7618 2 года назад

The actor critic algorithm is a two part algorithm, it has both a policy model which takes the actual action, and a value function that tells the policy model how good the action was. To improve RL and apply it into real world problems, means that our action space is continuous. Value based RL methods simply cannot function in highly continuous action spaces e.g. deep Q. Therefore we need policy gradient based approaches. To improve policy gradient methods further we need value functions. Hence the fact that we need a combination of both value iteration and policy gradient based approaches. Therefore we need actor critic RL algorithms

@chas7618 2 года назад

In RL there is a constant uncertainty of whether we have found the best action to take at a particular state, this is the problem of exploitation and exploration. Optimizing the RL agent means tweaking the weights of the network of the policy or the value function until we converge at the best possible actions to take at each state. Studying multi armed bandit problems we learn about the exploration and exploitation problem in great detail

@unicornAGI 6 лет назад

Hey Siraj!Got a chance to Implement one of the NIPS paper 2017,I have selected Reinforcement Learning Field , How Hard it will be and What is the Procedure to Implement the paper?

@davidmoser1103 6 лет назад

The linked source is for playing a pendulum game, not doom, which is much more complex. Honestly, I don't think you ever wrote a bot for playing doom, that's why you only show 5s of doom being played. To prove me wrong, link the source code for the doom bot.

@somekid338 6 лет назад

Gebregl I believe Arthur Juliani made source code for a doom bot using this method. I would recommend checking out his explanation instead.

@LunnarisLP 6 лет назад

GJ Sherlock Since Siraj is mainly making youtube tutorials for noobs like us, he probaby doesn't code many major projekts like the doom one would be, which was probably created by a whole team, like most of those major projekts. Not only that, but the doom bot was probably trained for multiple days on really powerful machines.. So GJ on spotting that he didn't code the doombot himself :D

@cryptomustache9921 6 лет назад

the fact that is being applied to Doom is for some reason, or given time will it work on any FPS game. Does it train with the game showing, or just the code going superfast at super speed, being able to play multiple games. Thanks for your videos.

@siriusblack9999 6 лет назад

but... how does the critic network learn what actions/states to give high q values and which to give low ones?

@toxicdesire8811 6 лет назад

Sirius Black I think it will depend on the boundary conditions of the actions taken by actor

@siriusblack9999 6 лет назад

i meant more generally - what purpose does the critic have VS just rewarding the actor directly with whatever you would otherwise reward the critic with, or is the critic's only purpose to "interpolate" intermittent rewards? IE you have 1 reward every 50 generations, and the critic attempts to learn how the other 49 generations should be rewarded to get it to that final reward?and if that IS the purpose, why not just use synthetic gradients instead? or is this just another case of "let's give the same thing two different names just to confuse people", just like how "perceptron" and "neural network layer" sound like they're completely unrelated topics but they're actually the exact same thing except you normally don't care about input gradients in a perceptron because it's only one layer (and you therefore normally don't implement it, even though you could and it would still be a perceptron, but now you could also use the same exact implementation as a hidden layer in a neural network)

@neilslater8223 6 лет назад

In simple policy gradient methods, you would train the actor to maximise total return. But without a critic you cannot predict the return - you have to run the actor to the end of each episode before you can train it a single step. The critic, by *predicting* the final return on each step allows you to bootstrap and train the actor on each step. It is this bootstrapping process (from temporal difference learning approach) that makes Actor Critic a faster learner than, say REINFORCE (a pure policy gradient method).

@siriusblack9999 6 лет назад

so it's the exact same thing as a synthetic gradient

@sikor02 6 лет назад

I'm wondering the same, how the critic is being learned? I still can't figure it out. Looking at the code it seems like critic is predicting the Q value and then uses the fit function to ... fit what it predicted multiplied by gamma factor? I can't understand this part.

@spenhouet 6 лет назад

Cool technique!

@SirajRaval 6 лет назад

thanks Sebastian!

@user-ll7mt9wx1i 6 лет назад

I love your video, they are all helpful for me. But this video doesn't have the subtitle, it's difficult for me. T_T

@G12GilbertProduction 6 лет назад

But how this Q-net archie network goes spinal?

@alexlevine78 6 лет назад

Is it possible to use multiple agents? My game is a first person shooter and multiple agents are allies going against an enemy. Is using the same critic neural net for all agents, but separate actor ones possible agent possible? I want to increase efficiency and make it decentralized. Feel free to pm me. A collaborator might be useful.

@diegoantoniorosariopalomin4977 6 лет назад

So , learning from human preferences is an actor critic model ?

@dustinandrews89019 6 лет назад

This method "Q-Prop" from earlier this year seems like an improvement on this A-C method, but I don't see much about it online. arxiv.org/pdf/1611.02247.pdf Shixiang Gu, Timothy Lillicrap , Zoubin Ghahramani, Richard E. Turner, Sergey Levine. Has it been overlooked or superseded?

@himanshujat3658 6 лет назад

Wizard of this week, thank you siraj!!😇

@vladimirblagojevic1950 6 лет назад

Can you please make a video about proximal policy optimization as well?

@Mirandorl 6 лет назад

0:07 how many people checked for slack notifications

@rishabhagarwal7540 6 лет назад

It would be helpful to include the relevant blog post in video description (in addition to github): towardsdatascience.com/reinforcement-learning-w-keras-openai-actor-critic-models-f084612cfd69

@anteckningar 6 лет назад

It feels like he is ripping-off of that blog post a bit too much...

@andreasv9472 6 лет назад

Richard Löwenström At least he should give it credit.

@davidmoser1103 6 лет назад

Not only did he not give proper credit (mentioning it in the video, and video description), the code linked is for a pendulum game. And people commenting on that code say it doesn't work, or not well. So, no doom bot to be found anywhere. Such a pity, he explains things so well, but then lies about the results.

@LemurDrengene 6 лет назад

This is what he does in many of his videos. Sometimes I wonder if he even understand what he "teaches" or if he is just reading other people's work. It's down to the smallest detail, even with the playground analogy and controller with infinite buttons.. Disgusting earning money like this on other people's work.

@rajroy2426 3 года назад

Just saying in rl you need to reward it if it wins so it knows what winning means

@chiragshahckshhh9696 6 лет назад

Nice..!

@richardteubner7364 6 лет назад

this code has nothing do to with doom.

@cybrhckr 6 лет назад

Is it just oversimplification or this is just Q learning with multipreprocessing

@somekid338 6 лет назад

no, it works by replacing the advantage of a policy gradient method with an estimation of future rewards, given by the critic network in the form of q-values. Berkeley's deep rl bootcamp, lecture 4a, has a pretty good explanation of it.

@toxicdesire8811 6 лет назад

Are you in india right now? Because upload time is different this time.

@zakarie 6 лет назад

Great

@jinxblaze 6 лет назад

notification squad hit like

@fabdenur 6 лет назад

Hey Siraj, I'm a huge fan and watch the great majority of your videos. Having said that, let me repeat a bit of constructive criticism: you explain the concepts really well, but often only flash by the actual results. For instance, in this video there are only 5 seconds (from 8mins1secs to 8mins5secs) of the Doom bot playing. It would be much more satisfying if you showed it playing for let's say 15 or 20 seconds. This would only add 10 to 15 seconds to the length of the whole video, but the audience would get to appreciate the results a lot better. best and keep up the great work! :)

@davidmoser1103 6 лет назад

Yes, footage of the doom bot initially, and after some learning, would be very interesting to see. But he didn't write a doom bot, the code is for a simple pendulum game. Very disingenuous.

@fabdenur 6 лет назад

Wow. He didn't make that very clear, did he? Not cool

@julienmercier7790 5 лет назад

He didn't write the doom bot. That's the hard truth

@ishantpundir9747 6 лет назад

hey Siraj I am ishant I am 16 I have dropped out just to work on AI and Robotics 24X7 you are a. really big inspiration when are you coming back to india I would love to meet you

@notaras1985 6 лет назад

i have a question please. when it learnt to play chess by itself, was it given the pieces and pawns movements? or it lacked even that?

@onewhoraisesvoice 6 лет назад

Yay!

@Suro_One 6 лет назад

I can't form a better comment than "Amazing". Anyone agree with me? The high level description of the model seems simple, but it's very complex if you dive deeper. What are your preferred methods of learning things like this?

@rajathshetty325 6 лет назад

I understood some of those words..

@vornamenachname906 3 года назад

this is a bad explanation why the critic model is important. the background of this second network was an issue: what if you get your reward only after a long series of steps, and you need to update all steps with this one reward. maybe there were some good and some bad moves - you get a lot lot of noise if you apply for example only "win" and "lose" to all of this steps. the critic model helps you calculate the loss for every step, to get hight loss at really bad steps that lead to a lose, and say "nah ok you lost but that move was not that bad".

@jra5434 6 лет назад

I made some songs on Amper but I suck at connecting APIs and other things to python. I use spyder and always get errors when I try to connect them together.

@KunwarPratapSingh41951 6 лет назад

Zeroth comment.. btw Love for Siraj brother

@silentgrove7670 4 года назад

I am playing a game without a rule book or an end goal.

@MegaGippie 4 года назад

Dude the explanation is awesome. I learned a lot about the topic. But the sonds you lay over the animation of nearly every image are annoying......This distracts me a lot.....

@vladomie 6 лет назад

Wow! It appears that AIs now have that critical voice in their head like the one described in Taryn's song ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-f-Tm1yX6-BY.html

@debadarshee 3 года назад

Rewarding actors to create AI tasks..

@allenday4273 6 лет назад

Good stuff!

@Kaixo 6 лет назад

isn't it easier to learn chess by evolution instead of cnn's? I just made Snake with evolution and it works better than when I did it with a neural network. The only problem I need to fix is that in the end all snakes just have the same tactic, but I think that'll be easily fixable. I'm now going to make a 4 in a row with evolution, I hope that it works out!