Superhuman AI for heads-up no-limit poker: Libratus beats top professionals

Подписаться 1,5 тыс.

Просмотров 52 тыс.

50% 1

This talk gives a high-level explanation of Libratus, the first AI to defeat top humans in no-limit poker. A paper on the AI was published in Science in 2017.
No-limit Texas hold’em is the most popular form of poker. Despite AI successes in perfect-information games, the private information and massive game tree have made no-limit poker difficult to tackle. We present Libratus, an AI that, in a 120,000-hand competition, defeated four top human specialist professionals in heads-up no-limit Texas hold’em, the leading benchmark and long-standing challenge problem in imperfect-information game solving. Our game-theoretic approach features application-independent techniques: an algorithm for computing a blueprint for the overall strategy, an algorithm that fleshes out the details of the strategy for subgames that are reached during play, and a self-improver algorithm that fixes potential weaknesses that opponents have identified in the blueprint strategy.

Опубликовано:

15 июл 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 80

@not_a_human_being 6 лет назад

source code, please! ;)

@ursinbrunner2461 6 лет назад

A great video, you are explaining the approach in great detail but still having the overall architecture present. And I can only congrats you all to the great research community you managed to build up, well done!

@anjankatta1864 6 лет назад

More people should do this! Start a trend! Incredible you took the time to make this

@optimizedpran1247 4 года назад

Fantastic talk. Thanks so much.

@st1nos 6 лет назад

absolutely fascinating!!!

@klawiehr Год назад

In that graphic of “solved games” around the 1 hour mark, whether Scrabble is a solved game is debatable. The current bots still have issues calculating equity, especially risk evaluation and strategy for the long game. Forcing Function has an interview on here with top player David Eldar that explores this topic. I also recommend Mack Meller’s new “Mack vs. Machine” series for analysis on human vs bot play.

@yoloswaggins2161 6 лет назад

Good shit noam. Grats.

@drgooshgoosh2419 6 лет назад

Very cool

@rockbill5982 6 лет назад

Awesome video. Good job. How does your research work? I mean what do you do exactly? When you learn all of this information,models , how it works etc., how do you innovate something new?

@TiagoPereira-nf5jo 4 года назад

Calvin Ball as the most difficult game... I think I like you bro. Your video was also very good.

@JKenny44 6 лет назад

Very interesting video and field of study, thank you for the presentation. Watching this mostly as a poker player some of it went over my head. But I do have a couple of questions to help me understand if you see this. Does the bot play through a specific situation the same way every time. E.g Does it have a range of say 60% of hands to always raise preflop and 40% of hands to fold. In the situation where a new opponent goes all in on the first hand the bot tightens its calling range to a point where it beats the opponents bad hands and it loses to AA when it calls but it doesn't give AA action often enough for it to be worth the all in. But once a player knows the exact calling range of the bot it seems exploitable in this way. If the calling range is (QQ+, AK), then the opponent ends up in a situation where they can find hands to profitably shove like AK which block half the range and have lots of equity against the rest. My question is can that strategy be effective against the entire range of the opponent? Or can the bot balance the range by only calling half of the time with QQ and half of the time with AK and by doing this find a calling range which works against everything the opponent could shove with. It is very interesting to me that the bot does not seem to care about many things like narrowing the opponents range or protecting their own range which top level poker is based on and still manages not to be exploitable.

@noambrown8362 6 лет назад

Thanks! 1) Yes, the bot plays the same situation the same way every time. It might mix between different actions though, like 3-betting AA 95% of the time and flatting 5% of the time. But those probabilities don't change based on the opponent's behavior. There was a small exception to this. If the opponent consistently opened with the same size (like 3x) for ~10 hands in a row, then the bot would switch to a GTO strategy which assumes the opponent will either open with 3x or fold. But this ended up happening very rarely during the competition. 2) The AI would find a range where the behavior you described would not be profitable. It would take into account the fact that the opponent could conceivably shove with AK that blocks AA and KK, and the AI might then start calling with JJ to counter that. The AI doesn't intentionally do things like "protecting its range" but the algorithm ends up doing that unintentionally because it's the best way to be unexploitable.

@xuanattran3778 2 года назад

Hi Sir, could it be right, if to say this kind of structure is kind of lamdra arch, which has batch layer and server layer?

@JarradSingly 2 месяца назад

I love poker so much.

@haochen9635 Год назад

Love the talk. Will you share the source code?

@dhruvjoshi588 3 года назад

Around 59:00, when you talk about how in multiplayer games most players fold so the games become 2-player very quickly, wouldn't there be too many considerations about the other (now unattainable) subgames for it to be feasible to use this kind of algorithm on? And there were ways to reach it do you have any predictions for how different the strategy would look - other than just expecting a stronger opponent hand on avg?

@noambrown8362 3 года назад

The exponential increase in the number of states/subgames as more players are added is indeed a challenge. Fortunately we address this with our followup work on Pluribus.

@xelorkrosmo4066 4 года назад

This could be an objective ranking but it should be based on several factors such as deviating play from GTO, a exploitative indice based on adaptation to deviation of oppenents from gto and stuff like this.. but the game has to be solved for this! i mean we nned solvers!

@peterleggatt3642 5 лет назад

This was fascinating--thank you for such a complete talk. What were the bot's % opening ranges from the SB and BB? Did they change according to the opponent's bet sizing/ ranges or remain nearly the same throughout the match? Finally, did it help to provide or suggest any solutions as to the actual value of being in position?

@noambrown8362 5 лет назад

Thanks! The opening ranges were pretty standard, similar to what humans would do. The bot essentially did not adjust to its opponent's strategy except by computing a closer approximation of GTO in the situations it found itself in. Providing or suggesting a value for different situations could in theory help the bot converge to a solution faster, but the speedup you would get would be too small to be worthwhile.

@peterleggatt3642 5 лет назад

@@noambrown8362 Thanks for the reply! Very interesting; I've since watched Doug Polk's 'Libratus' module too. So cool. But I'm afraid I didn't express my last question clearly enough. Playing HUNL poker from the small blind vs an opponent in the big blind is a "game" of a particular value to the small blind player (and of a different value to the big blind player). If your strategy was always folding from the SB the value of the game would be -0.5bbs. Since you can always fold, and will only act if you can do better than folding, we know that the value of this game is at least -0.5bbs (of course the actual value is likely to be positive, since being in position is an advantage). Moreover, if you were always betting and the big blind was always folding you would win 1bb each hand, so the value of the game to you would be 1bb (easy game!). This is the maximum value of the game if both players are playing unexploitably, since the big blind will only do anything other than folding when the outcome is better than -1bb. So the value of the game to the small blind is somewhere between -0.5bbs and +1bb (and likely to be positive), whilst the maximum value of the game to the big blind is +0.5bbs (winning the small blind's blind if the small blind is always folding, which it won't), and the minimum is -1bb (BB always folding). -0.5

@WeiKaiLin-tw 5 лет назад

The video jumped at 1:03:00?

@PunchTaker 6 лет назад

Thank you for making this video, and I have some silly questions. Since Libratus trained for 15 million core hours to be good at heads-up, is it right to assume it'd need to train for at least 15^5 million core hours for a 6-player game? (assuming pesky problems like collusion don't exist) If so, does this mean that AIs that currently excel at 2-player games (like heads-up hold'em and 1-on-1 Dota) will probably train for decades to master the same game with 5+ players, or do you think that new breakthroughs in AI research will happen frequently enough to dramatically shorten this time? Thank you.

@noambrown8362 6 лет назад

For starters, look at 55:50. Going from 10^161 to something like 10^171 does not make the game significantly harder for an AI and pro-level poker AIs likely already exist for all popular poker variants (or can now easily be made). However, poker and Dota2 are special cases. Poker for the reasons specified in the video and Dota2 because it is a two-team game, so a Nash equilibrium can be found in polynomial time. For a victory in a multiplayer game to be meaningful, there needs to be some opportunity for collaboration and negotiation between players. Diplomacy is the canonical example. We are a very, very long way off from that.

@PunchTaker 6 лет назад

Thanks for taking the time to reply!

@simonkotchou9644 Год назад

@@noambrown8362 this comment aged well 😂 your team solved diplomacy in no time

@Jorbs 4 года назад

Can I play against this somehow?

@jCodingStuff 5 лет назад

Do you also update the "strategy sum" for player 1 in iteration 1?

@noambrown8362 5 лет назад

No, just for player 2.

@jCodingStuff 5 лет назад

@@noambrown8362 Hi Noam thanks for the reply! I've also taken a look at other video where you show the code for Rock-Paper-Scissors MCCFR, where you update regrets and strategy sum for both players every iteration (makes sense since the game is simultaneous) . I am struggling to understand why you don't update regrets and strategy sum for both players in every iteration in this case as well.

@adamlindberg4862 5 лет назад

You mention that sometimes Libratus will for example raise 95 % of the time and fold 5 % of the time. By only going on regret the AI would always pick the action with most positive value. How does it decide/figure out when to use a split strategy?

@noambrown8362 5 лет назад

The AI doesn't pick the action with the highest regret. It picks actions proportional to the positive regret. So as long as the action has some positive regret, the AI will pick it sometimes in the iteration. Also, the actual strategy the AI played in the competition was the average of all the training iterations. The regrets can shift around during training, which results in the AI mixing its final strategy.

@adamlindberg4862 5 лет назад

@@noambrown8362 Thanks for the very fast answer

@Corpsecreate 3 года назад

Absolutely amazing presentation. Loved it from start to end. Some questions though: 1. I'm just checking to see if my understanding is correct, but is every abstracted state of the game considered as a a single 'category' so to speak. In other words, there is no sense of how 'close' each state is to another state, as they are considered to be entirely independent positions. From a Machine Learning perspective, this would mean the game state would essentially be represented by a very long one-hot encoded binary vector, where only one entry (the current state) is 1. 2. How are actions chosen in proportion to regret if some regrets are negative? Do you only select from the actions that have a non-negative regret? If so, then wouldn't you run into issues where due to the variance of the game, you might calculate a negative regret from one sample, even though the true EV is positive, causing you to ignore this action going forwards? 3. Selecting actions proportionally to regret sounds extremely similar to policy gradients in reinforcement learning. Is there any fundamental difference in the learning algorithm between the approach you presented and training an AI through self play and learning a stochastic policy? 4. With reinforcement learning above, would you expect to also arrive at a nash equilibrium solution? With a continuous representation of the game state, and a neural network as a function approximater, would you also expect faster convergence? Thanks in advance :)

@noambrown8362 3 года назад

Hi Sam, 1. Yes that's correct. You could use function approximation instead (see Deep CFR, for example) but it is much more expensive. Fortunately, in poker we have these abstractions that are extremely effective. 2. When sampling, you choose a strategy based on non-negative regret. However, you still explore all actions for the traverser, even if the probability on that action is zero. So even if you sample negative regret for an action that has positive EV, you'll collect more samples and eventually determine it is positive EV. 3. There are a lot of connections between CFR and other convex optimization and policy gradient algorithms. There are definitely differences though. For example, we don't need any tuning parameters or learning rates. 4. Only certain RL algorithms provably converge to a Nash equilibrium (because you need to get the learning rate just right). However, most non-CFR RL algorithms converge quite slowly to a Nash equilibrium. If you want to use something like Deep RL to converge to a Nash equilibrium, you probably want to use a deep learning variant of CFR (of which there are many... Deep CFR, Double Neural CFR, DREAM, ARMAC). NFSP is another reasonable choice. That said, continuous action spaces are still quite challenging.

@Corpsecreate 3 года назад

@@noambrown8362 Thanks so much for your answers. One thing that i think would be super interesting to see is what would happen if the abstraction process was very aggressive to start with. You find an approximate solution to this very small game, then rerun the abstraction that has more retained information than the previous abstraction. You could use the originial solution as a seed for the new abstracted game, and do a weighted sum of all strategies that are 'close' to a given state in this second game. It would be interesting to see how this converges, but damn all this CFR stuff is so tough to implement.

@fossana2927 5 лет назад

So when you solve a subgame, because there is the limitation that you have to make sure your opponent can't exploit you by increasing the frequency that they reach the subgame with certain hands, does that mean that you don't necessarily find an equilibrium for the subgame, i.e., the subgame strategy is not necessarily a best response strategy or an equilibrium strategy? It seems counter intuitive to solve a subgame and have the resulting strategy not be doing the best possible in that subgame. Perhaps it is finding an equilibrium strategy, but if a hand is sometimes doing action x and sometimes doing action y, the percentages have to be fine tuned to achieve safe subgame solving. Also is a subgame any subtree of the entire game tree, or does the root of the subgame need to be an information set with a single node?

@noambrown8362 5 лет назад

Good question. There are multiple optimal solutions to a subgame, but not all of them are robust to an opponent changing the frequency they reach the subgame. Safe subgame solving ensures that one of these robust optimal solutions is found. It is still an optimal solution though. A subgame is a subtree of the entire game tree. The root of the subgame is a set of nodes such that if A and B are nodes that a player cannot distinguish between and A is in the root set, then B is also in the root set.

@fossana2927 5 лет назад

That make sense. Thanks. One more question. So when the opponent takes an offtree action, you try to make it so that they don't get a higher EV by taking the offtree action than one of the actions in the abstraction, but in practice, how often is that achieved? Like let's say that we have a game consisting of actions A, B, and C, and our opponent can have either hand X or Y and in a nash equilibrium, the EVs are as follows: hand X: A=$20, B=$50, C=$100 hand Y: A=$80, B=$120, C=$50 So if our abstraction only has actions A and C but then our opponent takes action B, I don't see how we can make hand Y have an EV of only $80 for taking action B when the equilibrium EV is $120, unless we are going out of our way to punish hand Y for taking action B, and we have room to do this because we can let hand X have an EV of $100 for taking action B instead of the equilibrium $50. Love your work by the way.

@noambrown8362 5 лет назад

@@fossana2927 It's possible quite often in no-limit poker because the actions are along an (essentially) continuous action space. So actions A might be something like "Bet $100" and action C might be something like "Bet $200" and action B might be "Bet $150". In that case, action B probably isn't that much better than some mixture of actions A and C. But I agree if the actions were very different then safe subgame solving would not work as well.

@fossana2927 5 лет назад

@@noambrown8362 Thanks for answering all my questions so far. I think my understanding of Libratus is nearly complete, but I have a few more lingering questions: 1. The subgame solver is called whenever the opponent takes an action on the turn and river, even if that action is part of the abstraction. I don't see the point of calling the subgame solver repeatedly, unless you want to change the abstraction as you descend the game tree. So I'm guessing as you get closer to showdown, the subgame solver is using denser and denser action abstraction (i.e. more bet sizes). 2. When taking into account gifts, the EV of opting out of playing in the new, finger-grained abstraction effectively becomes the EV of the best alternative path? 3. Intuitively, by letting the opponent opt out of playing in the new, finer-grained with every hand, we are letting the opponent choose what range of hands they want to reach the current subgame with, so we don't make any assumptions about our opponent's range and let our opponent control that, but because we let our opponent take the EV of the old abstraction with each hand, there is an opportunity cost associated with reaching the current subgame for each hand?

@noambrown8362 5 лет назад

@@fossana2927 1. It is not necessary to run the subgame solver if the opponent chooses an action in the abstraction. However, in Libratus we ran the solver for any opponent bet, even if it was in the abstraction, to prevent the opponents from figuring out which actions were in the abstraction. 2. It is the EV of the best alternative path *if that path were chosen*, which is not the same thing as the blueprint EV of the path because if the path were chosen then subgame solving would be applied. It is possible to cope with this distinction but the way to do it is complicated and is covered in detail in our NeurIPS-17 paper. 3. Yes that's right.

@donniedarko479 5 лет назад

Is there a software or app where I can play a bot and learn from it or will it will teach this?

@nikoha1763 4 года назад

downloads 50TB app*

@regogu 3 года назад

Good day! I do believe that the algorithm you use are the algorithms used in poker solver. I don't entirely understand the whole algorithm but based on what I understood. At the start of the algorithm (iteration 1 for the blueprint strategy), it initializes both strategies to a uniform distribution: at each choice node, the strategies for player 1 and 2 will fold, call, and/or raise with equal probability. At most cases these probabilities are quite far from equilibrium. And the EV calculated is based from the accumulated averages from all iterations. From a solver perspective, wouldn't this make the EV calculated far from equilibrium? Since the early iterations are still weighted heavily. Thanks in advance!

@noambrown8362 3 года назад

We do a lot of iterations, sometimes thousands, and weigh the later iterations more than the early ones, so the fact that it starts at uniform random isn't too much of a problem.

@regogu 3 года назад

@@noambrown8362 Thanks for the answer. In solvers, we don't usually run them at high iterations since it would take a long time. Would you see this as a possible occurrence in low iterations? Also as an explanation why some solves takes longer to converge than others? Assuming a same sized game tree. Love the articles by the way. Also, correct me if I misunderstood the topic, but the near-Nash equilibrium strategy has always a value for exploitability denoted by "epsilon". However, I have not found any related article to how this is measured. Would you be able to give some recommendations?

@noambrown8362 3 года назад

@@regogu This is a problem for any iterative algorithm if you don't run a lot of iterations. To compute epsilon, calculate the value of a best response for both players and add them together. If that value is above zero then the solver's strategy is not an exact Nash equilibrium.

@zhengfengmao8281 6 лет назад

At 31:30 how did you calculate the regret to be [0,-550]. Because if we sum the rewards up, we should have [0,-450] and if we take negative of the left and right regrets, it'd be [0,-600]?

@noambrown8362 6 лет назад

You don't sum up the rewards. You choose one action (or some distribution over actions) and receive that reward. You then compare it to the hypothetical EV of each individual action. In this case we received an actual reward of $50. The left action would also get us a reward of $50, so regret for it is $50 - $50 = $0 on this iteration. The right action would have gotten us a reward of -$550, so the regret becomes -$500 - $50 = -$550.

@zhengfengmao8281 6 лет назад

The right action would have gotten us a reward of -$500, right?

@zhengfengmao8281 6 лет назад

Scrap it what I just said. Because the action caused us to lose $500, and the left action would gain us 50. Based on the calculation of regret: (actual reward) - (highest hypothetical reward), we'd have -500 -50 =-550. Thanks for the explanation.

@nandfednu3502 6 лет назад

ya i was just stomping some bots and ended up here, smelt em

@AltumNovo 5 лет назад

How much mbb/hand do you reckon Libratus would lose to the actual optimum strategy.

@noambrown8362 5 лет назад

It's really hard to say, but my guess is something like 50 mbb/h (which is 5 bb/100).

@buiniclassen5736 6 лет назад

How many players playing online, do you think have access to similar technology, that can beat good human players ? And how long before this kind technology will be widespread enough that online poker is dead?

@noambrown8362 6 лет назад

Libratus is far stronger than the best humans. You don't need something that powerful to beat good human players. I don't know how many online players have access to similar technology, but it's certainly possible for someone to implement these techniques and make a similar AI on consumer-grade hardware.

@koolcoy 3 года назад

Why do you use action translation instead of sub-game solving for the first two rounds of play ? Since all sub-game solving algorithms are already implemented, it seems strange to me that the first two rounds are played differently.

@noambrown8362 3 года назад

Libratus did not use depth-limited subgame solving. All subgames extended to the end of the game. On the first two rounds, that made the subgames too large to solve in real time.

@nevokrien95 5 лет назад

Weaknes attaking could be achived by combining this ai with a gentic ai which tells u if u wona take advantage of a weaknes and how

@ImTheMan725 4 года назад

give us the hands played

@BeePan2 5 лет назад

The worst match up for heads up possible is probably a 10 to 1 (let's say 88 vs 8 2 off)? If I actually have to play against an AI, and because I have no knowledge in poker, can't I just keep going all in without even caring about my cards? The AI will have to determine to call at a certain point, assuming we have similar stack sizes, and I will always at least have 10% chance of beating the AI? So is the point of the AI just to get as close to a 90% win chance as possible? Am I missing something?

@noambrown8362 5 лет назад

Poker is partly a game of chance, so it is impossible that you will win 100% of the hands played. But the goal of the AI is to win money *on average* against any opponent.

@bp56789 3 года назад

This is why you don't play with your entire bankroll, even if you are the best in the world.

@heemlo649 3 года назад

Sounds like he mispeaks at 27:20. Doesn't the subgame solving use no abstraction?

@noambrown8362 3 года назад

It uses no card abstraction. It still discretizes the action space. But if an opponent uses a bet size that wasn't in the abstraction then the bot can add it to the action abstraction.

@heemlo649 3 года назад

@@noambrown8362 Oh ok. Thank you for sharing the knowledge. You're lectures have been very helpful in understanding the papers.

@ainativehealth 7 месяцев назад

Omg This is the video made by noam brown himself

@williambennettofficial 6 лет назад

7-2 off suit is not the worst hand in ~heads-up~ NL (2-3 off suit is), so that is perhaps a bad example to use in this serious analytical context

@noambrown8362 6 лет назад

One of the interesting things about Libratus is that it can report the EV for every hand in every situation. At 200BB 72o is the worst starting hand, though 23o is not far behind.

@bp56789 3 года назад

32o is only worse than 72o if you know the other player's hand is 72o. 32o vs AA has 12.79% equity. 72o vs AA has 11.80% equity.

@williambennettofficial 3 года назад

again you're undermining your own argument in your own example by assuming you know the opponent has AA, at an advanced level this is basic stuff

@amrita49 3 года назад

why don't you put it online and see how many people will beat it?? tons. "top humans" is a meaningless definition, someone will adapt very fast to the bot without the programmers to re-tune it continuosly. it's not chess.