Depth-Limited Solving in Imperfect-Information Games

Noam Brown

Подписаться 1,7 тыс.

Просмотров 5 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Опубликовано:

27 окт 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 22

@reinerzer5870 5 лет назад

These videos are incredibly helpful towards understanding your work! Thank you very much for that! :-)

@jonathancg2161 6 лет назад

You should have more views. Thanks for this.

@timtaylor5364 4 года назад

When doing monte carlo rollouts to estimate the values of playing the strategies after subgame leaf node, are you sampling player strategies or just chance nodes? Does this need to be done for every iteration of cfr? When the opponent chooses an action that is not in the blueprint, you add it to the subgame. To estimate the values of the strategies after this subgame leaf node, you need to map it to a node in the blueprint. The size of the pot in the new subgame node will be different from the size of the pot in the blueprint node that it is mapped to. When evaluating the values of the strategies, is this new pot size "passed down" and used to when calculating the payoffs at the terminal nodes of the blueprint? For example, The pot is 4 big blinds and the blueprint has a 2 big blind bet and a 4 big blind bet. You want to add a 5 big blind bet to your subgame. You'd map it to the 4 big blind node in the blueprint, and the strategies after the subgame leaf node would be the same strategies as 4 big blind bet. When calculating the values of the strategies, the 5 big blind bet may have different values than the 4 big blind bet, because the pot is larger at the terminal nodes. (Say after the bet, there is one more bet of half pot and then reaches showdown. The pot after the 5 big blind bet would be 4+5+ 4.5, whereas the pot after the 4 big blind bet would be 4 + 4 + 4) edit* actually I think I understand now that you calculate the values as a percentage of the pot in the blueprint and then multiply that by the size of the pot in the subgame

@heemlo649 3 года назад

I see in your paper you say Modicum used the end of the current round as the depth limit. Wouldn't this end up being a very shallow depth limit if you are already near the end of the current round, and therefore not use the full extent of your computational resources? Did you ever explore an iterative deepening sort of approach?

@noambrown8362 3 года назад

Good question. In poker there is a chance node with a high branching factor at the end of each betting round. Searching beyond that chance node is extremely expensive. You probably could get some benefit for searching beyond it when you're close to the end of the round, but the gain would be small considering how expensive it is, and it would make the code a lot more complicated, so we didn't bother.

@yengamaurice 4 года назад

Just a few questions. I read the paper on Pluribus and there is one thing I don't seem to understand. The blueprint strategy used I assume IR 169-200-200-200 KE-KO buckets for its abstraction according to the supplementary material for Pluribus. But only gets used in the first round if the bet size is less then 100$ removed from any action abstraction in the blueprint strategy. However realtime search uses 500 buckets for future rounds and lossless abstraction in the round it is in. 1) Does this mean that the algorithm uses lossless abstraction first and then buckets information situations to 200 buckets if it's for example on the flop(round 2) and then uses 500 buckets for rounds 3 and 4 in the subgame it is in? 2) Given the fact that the supplementary material states that the algorithm only uses lossless abstraction in the first betting round but real-time search uses lossless abstraction in the round it is in and always gets used in rounds 2-4 could you explain how and when exactly lossless abstraction gets used? I understand that the two questions are sort of related

@noambrown8362 4 года назад

1) The blueprint uses 200 buckets after the preflop. When doing search, there is lossless abstraction on the current round only. If the search subgame extends to the end of the game, then 500 buckets are used in the search subgame for rounds beyond the current round. If the search subgame does not extend to the end of the game (i.e., it's depth-limited search) then the search algorithm does rollouts according to the blueprint strategy beyond the depth limit of the subgame. That means those rollouts happen in the 200-bucket abstraction. 2) Lossless abstraction is used in the preflop and on the current round when doing search.

@juhokim5140 4 года назад

Have you written about how you guys performed information abstraction in poker?

@missclik1419 5 лет назад

ID LOVE to see-hear more for modicum..a ii general for AI s ..the more people knows the more value for your field

@FirstLast-tx3yj 4 года назад

I have watched the long presentation on microsoft too but I have some questions. 1) practically depth limited means that you are playing until the preflop and flop but not calculating for the turn and river!? 2) is there another difference between modicum and deepstack except that modicum is calculating to depth limited while deepstack is calculating till the end? 3) a new strategy for the opponent is a new bet size or new range that he plays?

@noambrown8362 4 года назад

1) Yes. 2) Do you mean the difference between Modicum and Libratus? Since Libratus always calculated to the end, it could only do search on the turn/river. On the preflop/flop it used a pre-computed strategy. There were also other differences, like the form of CFR used, but the depth-limited search was definitely the biggest difference. 3) Do you mean when adding new values for a leaf node? It's a new strategy that he plays beyond the leaf node, so either different actions/bets or different probabilities on actions/bets.

@FirstLast-tx3yj 3 года назад

@@noambrown8362 since the depth limited is much less memory intensive is it possible for me as a human poker player to adopt such analogy?? In poker

@fossana2927 5 лет назад

Can P2 choose between a mix of his leaf-node policies?

@noambrown8362 5 лет назад

Yup

@brianbrewer974 4 года назад

In poker, how do you estimate the values that aren't in the blueprint strategy? The paper gave a simple and clean example where we had the value vs 0.5 psb and the value vs a 1 psb, and it took the average. Do you use linear regression for any size? So if the blueprint strategy has values for 0.25, 0.5, and 1, and we're trying to estimated value of a 1.5 psb, do you just use linear regression using all of the sizes that we have? Is it important to have many sises at each node to get accurate estimated values?

@noambrown8362 4 года назад

Brian Brewer good question, for this bot I just rounded it to the nearest size

@brianbrewer974 4 года назад

@@noambrown8362 Thanks, it took quite some time, but now I think I have a good understanding of how depth limited search works. It wasn't clicking that the choice of actions for the remainder of the game was a decision itself that would have regrets similar to actions such as betting and checking. In the Modicum paper and in Pluribus you used unsafe solving on earlier streets, which I think is basically solving from some point in the tree with the ranges that it thinks both players has at that point (similar to how commercial software is used to study the game). What if the opponent takes an action that the ai thinks it should take at 0% frequency for all hands? Such as player 1 raises, player 2(ai) calls, the flop comes, player 2 solves the flop using ranges that it believes both players has and thinks that player 1 should not be checking at any frequency, player2(ai) checks, player 1 then checks. How is the turn then solved with unsafe solving? Player 2(ai) believes that player1 does not have any hands in his range. I guess when you use cfr and take the average strategy, there won't be any zero strategy actions (as the first iteration is random). However, in the Pluribus paper I believe you said you used the current strategy when doing real-time solving. Was this ever an issue? It is possible that I have some fundamental misunderstandings about how unsafe solving works.

@noambrown8362 4 года назад

@@brianbrewer974 In Pluribus we played according to the current strategy, but we passed down beliefs assuming both players played the average strategy. So there were never any zero-probability actions. Unsafe solving could still in theory be exploited this way though. We actually address this in our latest paper we just put online: arxiv.org/abs/2007.13544 (see the section on playing according to an equilibrium at test time).

@brianbrewer974 4 года назад

@@noambrown8362 Thanks. So to deal with issues in unsafe solving, ReBeL chooses a random iteration and passes the belief state of that iteration to the next subgame. The paper briefly addresses the issue of choosing an early iteration where the policy is still poor. My intuition is that average strategy for cfr still converges over time even if you discard the strategies used in the first n iterations. If that is true, does that mean that during testing you could sample a range of the later iterations? Instead of sampling from {0,T-1} in testing could you sample from {30,T-1}? In training you could still sample from all iterations. I'm not sure if that is theoretically sound, and I may also be wrong about the average strategy still converging without the initial iterations' strategies. It seems that if you get unlucky and hit iteration 0 that you'll pass a random distribution as the belief state. Then the strategies in subsequent subgames will be extremely weak. Actually I guess that having a policy network is important in these cases, as the initial strategy isn't random.

@noambrown8362 4 года назад

@@brianbrewer974 Yes during testing you could sample from {30, T-1} and it would still approximate a Nash equilibrium as T -> infinity. It's true that if you hit iteration 0 then you'll pass down bad beliefs, but as T -> infinity that probability goes to zero. All of these algorithms only compute an approximate Nash equilibrium, and as you run it for longer (or use a larger T) it gets closer to an exact Nash equilibrium.