"Stop publishing in nature" So fucking much, yes. Walling off academic literature is bad enough as it is, doing the same for cutting edge comp-sci research is just ridiculous. Great video!
Nice video! 11:59 Smart to provide the value network with "hidden" knowledge (since it doesn't affect the actions, anyway). Kind of like a human player watching a replay without any fog of war (i.e. with full visibility) of the game he just played after it has finished in order to learn the opponents strategy and take advantage of it in the next game, I guess.
Wonderfull explanation of the network pipeline, thank you!! Just a note about the league training (as i got it from the paper): Main exploiters play only against the main agents to find weakness (but not between themselves or league) League exploiters play against past versions of everyone which are in the league, they become better and then, by time, they replicate putting their clone in the league that will be eventually an opponent of the main (they not play directly against the main) Also they reinizialize themselves to the baseline supervised
Thx a lot, what is a little unclear to me is what is trained exactly in the big archi we see? I mean for alphazero the net is trained for 2 things: the value head to predict the outcome the closest to the actual outcome, the policy head to predict the next move distribution the closest to what the search will conclude as the next move. In the big diagram here at 10:25 what exactly is trained and for what?
The same things are trained: The value head to predict the outcome of a state and the policy head to predict the action. The difference here is that the policy head is split into multiple individual units (everything on the top, right of the value head). At first, this is trained to imitate humans, then it is trained using reinforcement learning.
@@YannicKilcher So I can just see it as a big neural network which is trained in a standard way as a whole, and just has an internal complexe topology in which we can identify subnets but without subtraining target? (Some other deepmind stuff can have subpart that have subtraining target, like by example an auto-encoder that should compress the input and be trained for it's own auto encoding purpose in a totally independant training than the whole thing, but here, it;s not the case? just subpart of a big net trained together with this big net as a whole?) By the way Alphastar engine is supposed to be stochastic, do you know where what are the source of randomness here?
Hey Yannic! I know it's been awhile since you posted this video but you explain things very well. Thank you so much for making these. One question, for this study what do you see the real world opportunities being for this kind of learning? Seems like the chaining of many networks to do a more general task would apply to the real world in many different ways. What are your thoughts?
Thanks for this, but I wouldn't mind a longer more in-depth explanation. Seemed like some spots were just scratching the surface of concepts. But thank you, still a great primer.
From what I've heard, LSTMs can generally be replaced by 1D conv-nets, I'm not really sure why this is the case, but do you think that it would be appropriate to replace the initial LSTM with a convnet in this case?
I guess essentially WaveNet is a 1D convnet for generating sound, which people previously attempted to do with LSTMs, therefore for sound generation you can perhaps say that they replaced an LSTM with a 1D convnet?
Nice to see an LSTM being used in a RL application. Markovian approaches seem obviously flawed for learning complicated tasks without excessive feature engineering.
One part that is missing here is the pseudo-rewards they talked about, which seems to have had a very big role. They don't give any examples, but pseudo-rewards are human-generated with intent to guide agents towards common sense or micro-tactics. I personally wonder how the agents learned to do blink micro so well when they had trouble in much easier tasks. You can easily hand-code some blink micro, but it will be very hard for a neural net to come up with it. I want to know if this is a result of pseudo-reward such as limiting the available units to blink stalkers and pre-setting mandatory blink actions per time. That would explain a lot.
I think actions like blink-micro could be picked up in the first stage, when the agent tries to imitate humans. Of course, you're right, there are pseudo-rewards given, especially when the agent sticks to some pre-defined build order (though that's not always the case). It looks like this is - above all - a giant engineering effort.
lol doesn't use hotkeys , instead is able to select anything instantaneously and knows everything about friendly units and buildings instantaneously , yes "compensation". For instance if unit production ends the program is instantaneously informed about production, hence it does not need to check up on the production like human does. Humans deal with Mouse Input the program can Select any pattern of units after 200ms delay in a single action. Even at 200 ms delay , it's incomparable to a human. Watching the state of the game is mostly a single thread action. We have the concept of "Mental Cheklist" or "Action Anchoring". Players monitor the game in a sort of loop Supply->Production->Upgrades->Minimap->Creep etc, the program has all of the information at all times. Anchor one action another to collect information in timely fashion takes up majority of the focus and apm. To notice something in at 200ms is rivaling godlike play for a human. This has never been a "Fair" exercise, A player doesn't get to practice thousand years worth of games. Does not have pixel perfect Mind-control over the game. This is nothing but a proof of concept that a machine can learn some complicated tasks.
Great tutorial. will they have a problem of the vanishing gradient if they play for a long time? because they are using an LSTM, or probably remembering just the few last steps that count?
The long time-horizon is overcome by the actor-critic framework: The reward is densified using the value function. In addition, the agent sometimes gets an auxiliary reward for following a pre-specified build order. So the LSTM isn't really meant to remember things for that long - in fact, it is usually impossible to do backprop through more than a handful of steps, so it's more geared towards putting the current decision in context of what happened recently (the last few seconds).
@@YannicKilcher If so, then could this also be achieved by replacing LSTMs by history planes in the input volume, like they did in the game of go (where the seven last time steps were stored)?
@@YannicKilcher Do you mean it's impossible because of memory constraints of the GPU? I thought that in theory that could remedied by using synthetic gradients or memory-efficient backpropagation through time, but maybe it's still difficult in practice for some reason?
I dont know much about AI but I'm pretty good at starcraft and from watching some of the gamea the AI definitely struggles remember things that it should already know. some players were able to confuse the ai by hiding units in the fog of war then revealing them then hiding them again and in some games the ai would send flying units over anti-air structures repeatedly even though it should know that those structures wont move and will kill its air units every time.
I am also curious about how the LSTM layer is used in RL agents to encode memory. Could you instead modify the input features to account for the history of the actions (make it Markovian)? e.g. wether it has started a building outside the camera view?
I was wondering about the same thing. For the game of go, that is sort of what they did with their seven history planes right? However I think the problem is that in starcraft the correlation length between the actions to be taken and the history varies greatly so that you'd have to store practically all of the history with its many timesteps. This would make for huge input volumes, leading to models that are not only impractical but probably also impossible to train. So it makes sense then that they have to be smarter about it and dynamically encode such memory through LSTMs. The above is just a guess, and I'm happy to be corrected if wrong.
Yes, this is definitely a possibility and is done wherever possible. But usually, there's still some information that is too complicated or implicit to encode directly and that's where you want the model to learn what to remember. I guess this comes back to the old paradigm that deep learning is often a better feature engineer than a human 🤷♀️
I think in Go, the recent history is important because the legality of some moves depends on it, like some capture-recapture cycles. In Atari, the last 4 or so frames are included because the game engine sometimes flickers and doesn't show your avatar or some opponents. In both cases, this is enough to make the state fully Markovian. In StarCraft, you're absolutely right, the amount of history you'd need to include to achieve the same thing would be huge, so the solution is a mixture of explicit (as input) and implicit (lstm) history encoding.
Why would it be interesting to make it Markovian-what would be the benefit of doing so? (On the other hand, isn't an LSTM already Makovian-i.e. the next internal state of the LSTM is independent of any history, given its current internal state-and if you would condition it on former actions, wouldn't that effectively make it non-Markovian?)
@@YannicKilcher Yes but you were saying that the bot can see units while the units are off-camera. I meant that perhaps this is because the bot would just completely forget about those units existence otherwise, unlike a human who would remember that they were there, even though they couldn't see them.
TBH, this looks like classic engineering with a little bit of deep learning thrown in. If league training is their main contribution, then the science here is meh. This work has also influenced how pro starcraft players play: These days it is common to not strive for perfect eco build, but rather, oversaturate in anticipation that there will be losses to harassment in early and mid-game: This is something that AlphaStar introduced as it was previously considered bad play. Wow. Hence, this was a fantastic demonstration of modern technology and where are we heading: AI better than most humans at most things that were so far considered "cognitive".
This is just straight up inefficient though, it is literally more cost effective to just prepare the minimal defense required to fend off the harassment that would be killing off the workers in the first place, and this can even be pretty accurately timed by scouting the enemy at certain intervals based on how long the tech you are scouting for takes to build and checking the gas to see how much had been mined. The work they did here is like a baby version of the potential AI that could actually be. I can't wait until they learn to use logic concepts as part of their learning algorithm to determine which move is best and what yields the best results. i personally see the professionals using fairly aggressive strategies and threats, when this game has a strong defenders advantage element to it. Even the AI doesn't even know how to utilise it correctly.
AlphaStar was limited to just a 200~300 APMs.... IMO that is too restrictive... THIS is suppose to be an AI that can do things faster than human; so LET THEM!
I think the goal is for the ai to develop a strategy rather then brute forcing to victory. Pro player have high apm but low epm. Alphastar have the same apm as epm so in reailty its not slow.
@@DanBurgaud Self driving cars are end products, so getting the AI to drive well by all means is the end goal. AlphaStar is just a research tool, and the researchers who built it aren't interested in AIs that crush human players using godlike micro.
That's a valid criticism, but I think DeepMind does this to make it more interesting. Given almost infinite APM, the AI could just out-micro any opponent. Because of that, it would also not need to develop particularly interesting strategic knowledge, which is a large part of what we're ultimately interested in.