Navigating a Virtual World Using Dynamic Programming

Подписаться 769 тыс.

Просмотров 28 тыс.

50% 1

Let's teach our AI how to get from point A to point B of a Frozen Lake environment in the most efficient way possible using dynamic programming. This is considered reinforcement learning and we'll trying two popular techniques (policy iteration and value iteration). We'll use OpenAI's Gym environment and pure python to do this.
Code for this video:
github.com/llSourcell/navigat...
Please Subscribe! And like. And comment. That's what keeps me going.
Want more inspiration & education? Connect with me:
Twitter: / sirajraval
Facebook: / sirajology
More learning resources:
ocw.mit.edu/courses/aeronauti...
uhaweb.hartford.edu/compsci/cc...
/ deep-reinforcement-lea...
www.cs.cmu.edu/afs/cs/project...
cs.stanford.edu/people/karpath...
www.quora.com/How-is-policy-i...
www0.cs.ucl.ac.uk/staff/d.silv...
Join us in the Wizards Slack channel:
wizards.herokuapp.com/
And please support me on Patreon:
www.patreon.com/user?u=3191693 Instagram: / sirajraval Instagram: / sirajraval
Signup for my newsletter for exciting updates in the field of AI:
goo.gl/FZzJ5w
Hit the Join button above to sign up to become a member of my channel for access to exclusive content! Join my AI community: chatgptschool.io/ Sign up for my AI Sports betting Bot, WagerGPT! (500 spots available):
www.wagergpt.co

Опубликовано:

17 июл 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 86

@maxlee3838 6 лет назад

I’ve been trying to wrap my head around this for a while and in a 20 minute presentation, it’s suddenly clear. Thanks Siraj for what you do. It is appreciated by many.

@SirajRaval 6 лет назад

thanks max for the encouragement

@bauwndule 6 лет назад

Great explanation Siraj, wish all my teachers could reinforce their teaching with your skills!

@pyserialkiller110 3 года назад

Hey man! thank you for sharing your knowledge with us. I love your videos, they help me a lot to comprehend complex topics in my AI classes.

@ringostarkiller7097 6 лет назад

Thank you Siraj! Awesome channel!

@robertholtz 6 лет назад

Very well presented. Thank you, Siraj.

@SirajRaval 6 лет назад

np thanks Robert

@RiteshKumarMaurya 6 лет назад

Thanks a lot Siraj....

@OswaldoRGZ 6 лет назад

I Love this videos Siraj!, hi from Colombia...

@NicolasPimprenelle 6 лет назад

I like the effect with the green background and his hair !

@SirajRaval 6 лет назад

haha thx

@i.C.Infinity 6 лет назад

Good timing Siraj. This will be useful in my VR Aura.

@SirajRaval 6 лет назад

good i want to see it

@i.C.Infinity 6 лет назад

All in good time. I'm doing my best to get it finished before I go to India in late December on a 12 month multiple entries business VISA. Get me some of that smartphone market and then travel the world for a few years.

@BagoGarde 6 лет назад

Man thanks for the good work! keep it up

@SirajRaval 6 лет назад

@larryteslaspacexboringlawr739 6 лет назад

thank you for dynamic coding video

@abhinav.sharma 6 лет назад

Loved this video on RLs. Please do more of these... And teach us how to make cool models using TensorFlow, Keras, Theano or whatever... One Question: What is your favorite Library to work upon? Thanks... Loved it...

@SirajRaval 6 лет назад

pytorch

@realcygnus 6 лет назад

Siraj rocks.......& you just gotta dig that doo

@SirajRaval 6 лет назад

thanks!

@elouioui7270 6 лет назад

THIS IS SO GOOD

@dtkungfu 6 лет назад

Hey Siraj, awesome video! Although I can't seem to wrap my head around something and I was wondering if you could help me out. In the policy iteration function you check to see if the policy is stable with this line of code: if current_action != best_action: stable_policy = True If they don't equal each other then doesn't that mean that the policy is unstable since our new "best_action" is different than the previous best action under our policy?

@hugoreborn3702 6 лет назад

I think this True should be changed to False.

@shalevlifshitz547 6 лет назад

THANK YOU

@scarcade2070 6 лет назад

I'm still a bit confused as at 20:20 it says in the second last line "Final policiy derived using Value Iteration", yet you see there still a lot of redundant moves where the algorithm seems to move Mario down and then up again. Also it you see it moving left a lot of times while that is where it spawns right? Seems to me like the optimal algorithm isn't found yet or am I interpreting what's on the screen wrong? Or is it just that more iterations are needed to find the optimal path? I would think that 10k episodes would remove at least those kind of repetitive moves right?

@ricardoluna6106 6 лет назад

I have the same question

@VojtechMach 6 лет назад

He started off with a 4x4 grid world but you can see in his terminal at the end that he is suddenly using 8x8 grid world (he didnt say that). The policy assigns one action to every state. In the latter case, there are 8x8=64 states. The number of arrows it printed out in one line is 64. Therefore I assume it is a condensed output of the policy going state by state. In other words, the arrows represent the optimal action at each cell of the 8x8 grid (probably) going from top left corner of the world to the bottom right. I guess you could see this in the codes he provided.

@kwinvdv 6 лет назад

I was taught that DP starts at the end/goal and finds which neighbouring states have the smallest (expected) cost to get to it. The next you look at their neighbouring states and for each of them the lowest cost-to-go (to the goal), etc. Even though DP always find the optimal path, it does not scale well computationally in higher dimensions.

@nolan412 6 лет назад

D* (d star)

@GKS225 6 лет назад

So what's the difference between Dynamic Programming and Reinforcement Learning?

@andyrossignol16 6 лет назад

Hi there ...dynamic programming is basically breaking the problem down into subproblems and storing solutions for the state. RL is assigning the "good" or "bad" values so the program knows numerically if it is achieving the desired goal.So, for this example, the DP part of this is calculating the optimal policy or value function. The RL part of this is assigning -1 for falling into a hole and +1 for reaching the goal.

@GKS225 6 лет назад

Thanks!

@nolan412 6 лет назад

Back in the day's of Lisp and AI, generating code was the way to optimize your state machines. 1/3 & no dynamic programming imo.

@SirajRaval 6 лет назад

what andy said

@masdeval2 6 лет назад

Have you shared in the notes the link to material behind you?

@itsSKG 6 лет назад

Siraj ♥

@SirajRaval 6 лет назад

santosh

@lthh25 6 лет назад

I Siraj, thank you so much for the awesome tutorial. I am digesting your code and I have some questions, could you please help me? 1. The number of possible states in a 8x8 frozen lake is 64, but do we always have to have the same number of steps as the number of states? I ask because it seems that the policy was set to have the same length with the environment's state, but the problem was to get to "G" as fast as possible, wasn't it? 2. I looked up the gym's documentation, and tried it on my own, and it seemed to me that the attributes nA, nS and P are not standard attributes of an environment. I see that you used these attributes quite naturally without having to explain what they are, could you explain why the code still runs, please? Thank you very much!

@gorkemvids4839 6 лет назад

"Just look at the screen for a few hours and boom you get it" he says, and he is right

@chicken6180 6 лет назад

so good i feel bad for not watching these videos

@SirajRaval 6 лет назад

watch them!!

@kishorkukreja7733 6 лет назад

From what I understand, both policy and value iteration can help an agent find the optimal path for a given environment. However we still go for other methods, Q learning being one of them or maybe Monte Carlo ,why ? Am I missing some key concept here ?

@UnchartedWorlds 6 лет назад

Hello Siraj, its the World! What can you teach us today?

@petarking66 6 лет назад

Shouldn't line 121 (policy iteration) have == istead of != ? If the policy isn't the same as the previous one than it didn't converge.

@swazza9999 6 лет назад

Sorry if this is late but I was looking at the same thing now. Even when you fix line 121 the value iteration and policy iteration don't give the same results. In fact, as the code is currently, it exits the loop the moment the action in *any* state does not change. But for convergence *all* actions for all states should not change during an iteration. You need to check for convergence *outside* of the loop which iterates through states (line 121 is *inside* this loop). You check for convergence by verifying that the previous policy and the new one are equal. Once you make this modification to the code you should get the same results for value iteration and policy iteration.

@rishikaushik8307 5 лет назад

had the same doubt, line 122: stable_policy = False should work

@GuillermoColmenero 6 лет назад

What is the difference from the A* algorithm?

@jasneetsingh4018 6 лет назад

how do you make these videos, u use OBS and green screen filtering but how do you do it can please make a video on your setup?

@SirajRaval 6 лет назад

DSLR, green screen, final cut pro x, thats it

@wizzardofwizzards 6 лет назад

All this would apply to mapping for a driverless car scenario?

@andyrossignol16 6 лет назад

i'm not an expert on driverless cars, but I'd say this is likely used in mapping, not driving.

@Zzznmop 6 лет назад

is value iteration similar to newtons method for linear approximation

@SirajRaval 6 лет назад

never thought of that analogy, but i don't think so. since newtons method uses a second order derivative to update weight values and value iteration doesn't

@tunestar 6 лет назад

Whats with all the print(f ???

@daephx 6 лет назад

its fuckin' 2am here what you doin' up siraj?

@SirajRaval 6 лет назад

+DaemonPhoenix42 overnight flight to Sri Lanka, late release

@AhmadM-on-Google 6 лет назад

fly safe dud

@eddantes9625 6 лет назад

Scott e Page...Markov models

@AhmadM-on-Google 6 лет назад

Wasn't this supposed to be in the Video Game AI video lecture series !?

@AhmadM-on-Google 6 лет назад

I had to watch this sped up, my brain is used to processing Siraj in high speed.

@AhmadM-on-Google 6 лет назад

@siraj roll back this video please, you know why i said this...its too rushed m8, bad for cred

@SirajRaval 6 лет назад

it is, every video for next 10 weeks will be apart of it. ill also create a playlist

@hvr1996 6 лет назад

Kinda similar to the Wumpus World problem

@jazz7946 6 лет назад

Mate I want to learn machine learning the best I can this semester I have tried many methods I tried reading the books and whatnot my sister says there will always be dependencies I realized soon that there is too much that I don't know and the only way out seems to be that I start making projects but I'm worried that I will not get the big picture that way how wrong I am mate and will it take me 10k hours to see the big picture or what ?

@artemdmytrenko7031 6 лет назад

Just do it.

@andyrossignol16 6 лет назад

yes, just do small projects at first...follow along tutorials. just go into blindly and each new small project you will grasp something different. everyone learns differently and maybe you just learn by doing. also, i highly encourage you take Linear Algebra , and if you haven't already, also watch a few lectures on Calculus, both via the MIT MOOC available for free on RU-vid. Not because it's impressive, but because the MIT professors make it very clear.

@amandamate9117 6 лет назад

I have the same problem. When I understand stuff I forget it a week later. I dont have the head for that, we dont have the IQ

@andyrossignol16 6 лет назад

are you kidding me? I hope your profile is flagged, You're trying to make yourself look like a dumb bimbo to get people to subscribe to you. This account is probably run by a middle-aged Russian dude.