No video :(

Combining Deep Reinforcement Learning and Search for Imperfect-Information Games

Noam Brown

Подписаться 1,5 тыс.

Просмотров 11 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Опубликовано:

21 авг 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 89

@yangyin 3 года назад

This is the best ReBeL video I've watched. I'm very impressed by your insights into the method. Then I realized you are one of the authors ...

@zenlai1597 3 года назад

Lol

@quebono100 3 года назад

I like how you explain it shortly and precise

@multigladiator384 3 года назад

I just read "Superhuman AI for multiplayer poker" article by this Noam Brown guy and became interested in the "Supplementary Materials" he published for this project- but barely understood things. Then I open up youtube to introduce myself to Monte Carlo CFR and find this... Thank you in advance already!

@ryangardner3728 3 года назад

Amazing ideas, Noam, and great explanations. Thank you.

@cam41527 3 года назад

Great video Noam! I always love to see you videos in my feed. "With a ReBeL yell I want more... more...more" 😜

@andreasbacher9326 3 года назад

Extremely impressive, thank you for making the video (together with the paper). Congratulations! I wonder why you only submitted to a CS conference, and not to a real scientific journal like Nature or Science? I think the quality of the work certainly warrents a publication that is seen by a large amount of scientists, and that's gone through proper peer-review (not only the CS conference reviews). I believe your methods could be very exciting for the interface between deterministic and indeterministic processes in physics (classical and quantum mechanical boundary). They are in a way perfect information (classical physics) and imperfect information (quantum, see Bell's inequality).

@noambrown8362 3 года назад

Thanks! I think in order to turn it into a Nature or Science paper we would need experimental results on Go. That would take a lot of engineering work so we decided it wasn't quite worth it.

@hayatisschon 3 года назад

Great presentation, thanks for the effort. Now it is time to read the paper :)

@pranshubahadur 2 года назад

Ty for this!

@quebono100 3 года назад

Amazing, thank you for sharing

@juggernautuci8253 3 года назад

BRILLIANT IDEAS.

@bautistabaiocchi-lora1339 3 года назад

Amazing! thank you for this

@mitchconner1185 3 года назад

I have a few questions about the poker implementation of ReBeL described in the paper. Question 1: How is each player's infostate belief distribution (that the value/policy network takes as input) represented? I can think of the following four options: 1.1) Each infostate belief is equal to what some papers call the player's contribution to the reach probability (i.e. the product of the player's action probabilities of the actions he's taken thus far given the player's infostate and the current/average strategy profile). 1.2) The same as option 1.1 except that each belief of an infostate that is impossible due to the current public cards is set to 0. 1.3) The same as option 1.2 except that each infostate belief distribution (of a given player) is always normalized such that the sum of the player's infostate beliefs is 1. 1.4) The same as option 1.2 except that the card removal effect is applied to each distribution and then each distribution is normalized. By "applying the card removal effect" I mean taking into account the fact that the same card can't be dealt to different players and making sure that each infostate belief is equal to the actual probability that the player is in that infostate given the PBS. My understanding is that in this context the expected value (EV) of a player's infostate in a BPS is the EV given the infostate and the PBS and assuming that the player plays best response strategy against an opponent who plays the average strategy. Question 2: How are the infostate EVs (that the value network outputs) weighted? I can think of the following three options: 2.1) Weighted by chance's contribution to reach probability of the infostate, i.e. each infostate's value network output is equal to the infostate's EV multiplied by the probability (either 0 or 1) that the infostate is possible given the public cards the value network currently inputs (some infostates are impossible due to the card removal effect). 2.2) Weighted by the external reach probability of the infostate given the opponent infostate belief distribution that the value network currently inputs. The external reach probability of the infostate is equal to the reach probability of the infostate divided by the infostate player's contribution to the reach probability, or in other words, the product of chance's and the opponent's contributions to the reach probability of the infostate. 2.3) Weighted by the reach probability of the infostate (I'm quite confident that this is not the case). Question 3: What becomes of player's infostate belief distribution when the player makes an action that, according to the current strategy, he should never make? This can occur during training because of exploration (if the current strategy instead of the average strategy is used and therefore the action probabilities can be dead zero) and during testing because then the opponent can make whichever action (and then a random current strategy is used). I can think of the following three options: 3.1) The player's infostate belief distribution remains unchanged (the same as it was before this "impossible" action). 3.2) The player's infostate belief distribution is made uniform and normalized such that the sum of the player's infostate beliefs is 1. 3.3) The player's infostate belief distribution is made uniform and normalized such that the sum of the player's infostate beliefs is equal to whatever the sum was before this "impossible" action.

@noambrown8362 3 года назад

Nice questions! Question 1: We use 1.3. Card removal is accounted for when we're solving the subgame (when we sample hands for each player). When tracking beliefs, we don't correct for card removal (doing so would require tracking O(N^2) beliefs rather than O(N)). Question 2: I'm pretty sure we did 2.2. This is a very subtle aspect of ReBeL and it's very easy to introduce bugs by not handling the weights correctly. Fortunately, it's not an issue in games like Liar's Dice where the players' private information is uncorrelated. Question 3: When using CFR-D we did 3.2. If you use CFR-AVG then the belief distributions can never be undefined because the first iteration is uniform random. That's another perk of CFR-AVG.

@mitchconner1185 3 года назад

@@noambrown8362 I guessed that the belief distributions would have to be normalized because it considerably reduces the size of the PBS state space the network needs to learn. I believe the value network output values need to be then "re-weighted" to undo the normalization in order to get the correctly weighted counterfactual values that the CFR math requires. One more thing I wanted to make sure I understood correctly: At test time, when using CFR-AVG, the policies according which everyone is assumed to be playing (and according to which all belief distributions are adjusted) are defined by the current policy profile at a random iteration t, and not by the average policy profile over iterations [1, t], right? Therefore it seems to me that at test time, even when using CFR-AVG, it is still possible that the opponent's belief distribution could become undefined (all zeroes) and would need to be made uniform.

@noambrown8362 3 года назад

@@mitchconner1185 No, in CFR-AVG the policies for both players are assumed to be the average over iterations [1, t], where t is a random iteration you stop on. So the beliefs can never be undefined.

@mitchconner1185 3 года назад

@@noambrown8362 Thank you very much for your time. I realized I don't completely understand the CFR-AVG algorithm though. Specifically, when updating the average action probabilities of the average strategy in each infostate of the depth-limited subgame G, I know that the current action probabilities (that are being included to the average) should be weighted by the infostate player's own contribution to the infostate's reach probability, but is that reach probability calculated using the current strategy on iteration t or the average strategy over iterations [1, t]? I also think I've misunderstood how the undefined (all-zero) belief distributions are treated when using CFR-D. I thought (incorrectly?) that undefined belief distributions would be reset back to uniform immediately after the action that would make the distribution undefined. But I suppose that the undefined belief distribution should be made uniform only right before each leaf node is evaluated using the value network because if the player whose belief distribution becomes undefined is the one whose strategy is being updated during CFR-D, then all his average strategy updates in the infostates where his belief distribution is undefined would have to be weighted by his all-zero reach probability contributions (and not by whatever values resetting his beliefs would produce). Update: Now I'm thinking the depth limited, leaf-PBS-value contingent CFR-D algorithm probably works by keeping track of both reach probabilities and PBS's separately such that reach probabilities work exactly as they do in vanilla CFR, and PBS's are like reach probabilities except that the distributions are normalized and they're immediately reset to uniform once the distribution becomes undefined (all zeroes). Update: For anyone else struggling to understand CFR-D, I finally found a simple description of it followed by pseudocode in this paper: arxiv.org/abs/1906.11110 on page 21. It appears that "average strategy" in CFR-D is defined quite simply as the arithmetic mean over the strategies, as opposed to the player-reach-probability-weighted average it is defined as in the original CFR paper: poker.cs.ualberta.ca/publications/NIPS07-cfr.pdf.

@notgenius 3 года назад

Hi Noam, I think this field of research is fascinating and it's very impressive to see the strides you took and are taking in the realm of imperfect information games. Previously, Pluribus cost very little computationally, despite being placed in a 6 person table. I found one of the most impressive stats of Pluribus being that it only took ~$150 and a relatively weak computation to train and execute (which is because of the lack of deep neural nets?). In comparison, with the implementation of deep reinforcement and self play without information abstraction, how does ReBel compare in terms of cost and computability? Thanks

@noambrown8362 3 года назад

Thanks! ReBeL is much more computationally expensive. The exact resources are mentioned in the paper, but it's definitely way more than expensive than can be done on a home computer. However, it appears to scale well, and it's possible that future research will bring the computational cost down.

@mkrenn87 3 года назад

Very cool! Your Pro was Dong Kim who was able to play from home according to your SI. Does it mean you have a online-platform where one could play? Is there any chance to get access? Do you need special hardware to run the final version or is a standard workstation sufficient? Would love to get beaten in Chess and HU texas holdem :)

@noambrown8362 3 года назад

We do, but it's not publicly accessible unfortunately.

@pauljones9150 3 года назад

Wow! This is pretty cool! Was there any reason for not testing ReBeL in 6 player poker games?

@noambrown8362 3 года назад

I think it would work fine in 6-player and we thought about doing it, but there aren't many benchmark agents for 6-player so it would be hard to evaluate the bot. In theory public belief states (PBSs) might not have unique values outside of two-player zero-sum games but I don't think that would be an issue for 6-player poker.

@noambrown8362 3 года назад

@毛凡 it would probably work for any number of players but it becomes increasingly difficult to evaluate as you add more players.

@mingu600 3 года назад

Thank you for the video! I'm an undergraduate who wants to learn a lot more about RL, and this was really great. I was actually curious whether ReBeL could be directly applied to a game like Pokemon. I'm not sure how familiar you are with battling mechanics in Pokemon, but there is a lot of imperfect information (each opponent Pokemon has hidden moves, hidden stats which can later be inferred based on observable damage numbers and moves used) as well as damage rolls, crits, etc. In particular, unlike poker or RPS where there is a set turn order (for RPS we reformulated it to be this way), in Pokemon move order is determined by speed stat of the Pokemon in play, which is sometimes hidden to the players if not previously observed. Would you still be able to use ReBeL in this case, and would it have to be reformulated in any way? Also, I just had a couple quick clarifying questions about belief infostates. You mentioned in the paper that the input of the value network is the following size: 1(agent index) + 1(acting agent) + 1(pot) + 5(board) + 2 × 1326(infostate beliefs) Why is there a separate value for "agent index" and "acting agent"? Also, are infostate beliefs just the set of probabilities for every possible player hand (which we update based on player actions)? Why don't we include information such as amount of money each player put in the pot and bets made during that current round? Thank you again so much for the paper and video, I appreciate your time to help people like me learn more!

@noambrown8362 3 года назад

I'm not that familiar with Pokemon so it's hard for me to say, but one issue might be the amount of hidden information. At the end of the talk I mention that the input to the NN is proportional to the number of infostates in a public state. It sounds like that can be quite large in Pokemon. There's a separate value for agent index vs acting agent because we need to whose turn it is and who we want the value for. Those could be different. We don't need to keep track of prior bets and other information like that for the same reason that in chess you don't need to keep track of all the prior moves, you just need the current board configuration (plus some additional info to prevent repeating the same position). The info we track is sufficient for computing the EV for each player. And yes, infostate beliefs are the probability for each hand.

@dhruvjoshi588 3 года назад

Hi Noam, I was wondering what the computational requirements of training ReBeL was as opposed to for the libratus model. I assume this is significantly more expensive considering the continuous nature of game state space, and the fact that the game state now contains atleast as much information as libratus's game space. Could you improve my understanding of this? Thanks!

@noambrown8362 3 года назад

Yes, ReBeL is much more expensive than the Libratus approach. This is the cost of generality. We talk about compute usage a bit in the paper.

@brianbrewer974 3 года назад

The value vector computed with cfr is added as a training example. When solving the the last round, these values are completely accurate if it was solved to 0% exploitability. So over time it will get better at solving the round before the last, then the round before that. Something I don't understand is that you don't solve until the start of the next round. How does the value network improve at estimating the leaf nodes when you are giving it training data that consists only of public belief states at the start of a betting round?

@noambrown8362 3 года назад

The values of belief states at the end of a round can be computed by averaging over the values of belief states at the start of the next round. As we collect more data for values at the start of rounds, we also train the network to predict the values at the end of rounds.

@Free-pp8mr 8 месяцев назад

Great job , Noam! But what about Prisoners Dilemma? I see that you are study non-cooperative zero sum games, so may be it is not your subject at all. But if all AI algorithms will concentrate only on competing, but not cooperating, will we have any win-win games in human future? Are altruistic gens rudiments for future homo artifices?

@mitchconner1185 3 года назад

Talking specifically about playing poker at test time, it seems to me that you always solved each subgame rooted at the beginning of the current street, and if the opponent made an action that wasn't in the most recent subgame action abstraction, you added the action to the subgame abstraction and solved the subgame again from scratch. A couple of questions about this subgame solving (and re-solving) at test time: 1) Given a certain street that occurred during a given hand, did you always pick the same random CFR iteration to stop the subgame solving at? 2) Did you (similar to what you did with Pluribus) freeze the agent's actual hand's action probabilities for the actions he had already made earlier on the same street? Do you think it would be mathematically correct to solve the subgames such that they were rooted not at the beginning of the current street but rather at the game state (or rather PBS) that occurred right before the opponent's most recent action? I.e. you'd always pick a new random iteration to stop the subgame solving at, and right after the agent had chosen and made his action (based on the policy at that random iteration), update the PBS according to the observed actions' probabilities at that random iteration, and then use the updated PBS as the next subgame root.

@noambrown8362 3 года назад

Hi Mitch: 1) I think we always picked a different random iteration. 2) I think we did freeze the action probabilities for the agent's actual hand. I do think it would be mathematically sound to just start the subgame at the most recent PBS rather than the start of the betting round. However, there's in theory a risk that if you don't do that during training as well then you might encounter novel PBSs at test time. We didn't want to do that during training (didn't want to deal with the added complexity) so we didn't bother. That said, although in theory it could be a problem if you don't do it during training time, in practice I don't think it would matter.

@Ninjabrain 3 года назад

Hello! I am wondering how exactly a subgame is constructed from a PBS. At first I thought that you would just sample an infostate for each player using the infostate probabilities, and then construct a subgame in the discrete representation using those infostates, but then I realized this doesnt work because some infostates may be incompatible (eg both players cannot hold the same card in poker).Thanks in advance!

@noambrown8362 3 года назад

In the full game of poker, there is an initial chance node whose outcomes lead to players being dealt different poker hands (with each poker hand having equal probability). A subgame rooted at a PBS is constructed in a similar way, but with the initial chance node having outcome probabilities based on the probabilities of the PBS rather than being uniform. In other words, at the start of the subgame, each player is "dealt" a hand with probability equal to the belief for that hand (this gets a bit more complicated in poker compared to, say, Liar's Dice because players can't be dealt the same card, so you have to account for this "card removal", but it's not a very important conceptual detail).

@user-df4wh6dg1z Год назад

I have a question about the Algorithm 1 in the page 1 of paper. In my opinion, I understand that function COMPUTEEV computes the expected value of PBS Br based on the output of value neural network( compute in SETLEAFVALUES function). However, you add the (Br, v(Br)) to the value net data, which means using output of a model to train it (did I understand right?). The CFR-D in section 5.1 also uses output of the value network to compute new policy for next iterations. I think the CFR-D has to use groundtruth value to compute policy for new iteration but it seems uses the output of untrained value network. So where is the groundtruth of value network in leaf nodes? Thank you so much Noam.

@noambrown8362 Год назад

I'm not sure I understood your question. For some subgames, there will be leaf nodes at the end of the game (e.g., if the leaf node is for a fold action, or if you are on the river and both players have called). In those cases, you will receive a ground truth reward from the environment. Otherwise, you use the value network at the leaf node. But eventually the ground truth rewards will "bubble up" and the value network will become increasingly accurate.

@user-df4wh6dg1z Год назад

@@noambrown8362 so could you check my understanding about your algorithm at each iteration t (from twarm -> T): Step 1: UPDATEPOLICY(G, policy at t-1). This step compute policy at time t by applying CFR-D. CFR-D will compute regret matching with value at each leaf nodes (ground truth if node at the end of game or value network output at intermediate node of game). So it will update new policy based on the regret value. Step 2: Update average policy. This step just computer new average policy using current average policy and policy at time t computed in step 1 Step 3: G

@user-df4wh6dg1z Год назад

@@noambrown8362 I think I understand how to train value network after reading conversation between you and @brianbrewer974. Take a poker game for example, first we train value network with ground truth reward (river round), so this network will predict to output well in river round, which is almost exact as ground truth. So we can use this output to compute expected value for turn round and run CFR-D for T iterations to get proper value v(Br) at PBS Br. Now we add (Br, v(Br)) to train value network more so that it could predict well in turn round (may be that's why you call this "boostrapping" strategy training). Repeat this process to flop round and other previous rounds and the value network will be trained with every PBS Br in all rounds. ReBel finishes and we have a perfect trained value network. Is that my thought correct, mr Noam?

@BryanM61 2 года назад

As I watch this video, I'm also watching some online poker games. There is a player on one of the tables named 'NBrown'. Hmmmm......

@asaf92 3 года назад

33:30 Why is the input dimension of Texas Holdem 1326?

@noambrown8362 3 года назад

In Texas Holdem each player has two hidden cards, so there are 52 choose 2 = 1326 hand combos

@grabaka884 3 года назад

Does the exploitability issue when playing humans you mentioned at 26:19 also apply to Libratus or only to this new algorithm? Or does this issue only apply to ReBel because you are using value networks?

@noambrown8362 3 года назад

It was *mostly* not an issue in Libratus. Libratus used two kinds of search: upon reaching the turn for the first time, it used unsafe search. Thereafter, it used safe search. The exploitability issue was in theory relevant when using unsafe search, but in practice wasn't really an issue in poker. (The reason it wasn't an issue in practice is complicated. Basically because the search occurred immediately after the turn card came out, the unpredictability of the turn card made it difficult for a human to actually exploit the bot's strategy.) You might wonder why we used unsafe search at all if we had safe search. The safe search techniques that existed before ReBeL added extra constraints to the search strategy that in practice could make it do worse than unsafe search. The nice thing about ReBeL's search is that it's unexploitable just like past safe search algorithms, but it doesn't have their extra constraints so ReBeL's search in practice performs much better.

@grabaka884 3 года назад

@@noambrown8362 Thank you for the response. Conceptually, is it correct to think of the goal of the technique you employed of randomly selecting iterations in the search as making the strategies robust against every type of potential counterstrategy (even bad ones) in terms of long-run expectation/unexploitability ?

@noambrown8362 3 года назад

@@grabaka884 yes that's a good intuition for it. The real reason you need to select a random iteration has to do with the details of CFR. But understanding it would require going into the details of the CFR decomposition algorithm, which would take a while.

@jonabirdd 3 года назад

While this work tries to incorporate techniques successful in deep RL, notably the use of value function approximation to truncate search, into solving imperfect info games, it's not clear that it can scale to scenarios where the underlying state space is already large. In the toy poker game, one has a state space of cardinality 104. I suppose this issue is addressed with the final comments on recon chess.

@noambrown8362 3 года назад

The state space of the whole game is extremely large: 10^161 decision points, which is (very roughly) comparable to the size of Go. When converted into the belief space representation, Texas holdem can be viewed as a continuous state space with more than 2,651 dimensions. Go has 0 dimensions in the belief representation. The issue with scaling is with increasing amounts of hidden information, *not* with an increasing size in the state space.

@jonabirdd 3 года назад

-Actually, since you don't know either your cards or your opponent's in the belief space representation (they are not common knowledge), isn't the public belief state of size 52 * 51 * 50 * 49 / 2 or similar?- Oh, I see, what we care about is the size of the info states, and not the PBS as a whole. I suppose contrary to your comment about 6 player poker games, this approach might already be intractable for 3 player games, since the info state for a single player would have > 10^6 dimensions. Unless you are suggesting one might be able to handle them adequately separately? (in general, I suppose factorising a state space is a common technique to reduce dimensionality, whether learned or manually)

@jonabirdd 3 года назад

I'm quite interested, with regard to these limitations, in exploring whether learned representations over infostates, possibly "common representations", the sense they are shared by all players, could be used to overcome them. I suppose that agents that have solved imperfect info games with high dimensional/continuous info states like dota 2/starcraft already do this (approximating over hidden knowledge) implicitly. There does seem to be some crossover point where humans also can't handle high-dimensional info states with high precision. Since developing specific techniques was not required* in those cases to beat humans. Imperfect info is however only one part, rather than being the whole point of the game. (*with the caveat that alphastar uses population based training with exploiter agents to reduce main agent exploitability, but this is not an algorithmic approach, and certainly not in same vein as work done in poker AI. Alphastar's approach addresses the problem of data distribution, rather than imperfect info per se).

@noambrown8362 3 года назад

@@jonabirdd 6-player poker shouldn't be much of a problem for ReBeL (except the fact that PBS values in theory might not be unique). It's the number of infostates in a public state that is the limiting factor, not the size of the infostates. In 3-player poker there are 3 * 1,326 infostates per public state, whereas the size of infostates is about 10^6. You can approximate the equilibrium of any game regardless of the amount of hidden information by not doing search at all. This is what AlphaStar does in StarCraft 2 and OpenAI Five does in Dota 2. But search makes a *huge* difference in games where it has been used. The fact that AlphaStar was not able to achieve superhuman performance in StarCraft 2 and is highly exploitable in the game despite using massive amounts of compute suggests that search might be important in that game as well. The problem is that nobody knows how to do search effectively in RTS games yet (which is why it wasn't used in StarCraft / Dota).

@mitchconner1185 10 месяцев назад

Assuming CFR-AVG is used at test time such that the final CFR iteration t is a uniform random sample from interval [1, x], and then the average policy over iterations [1, t] is assumed for all players etc. It seems to me that as x approaches infinity, the overall policy that CFR-AVG produces approaches the policy that unsafe search would produce. Or am I missing something?

@noambrown8362 10 месяцев назад

No, unsafe CFR does not converge to an equilibrium in the final iterate. It's only the average that converges. The final iterate can cycle within the support of the equilibrium.

@mitchconner1185 10 месяцев назад

@@noambrown8362 If I understand correctly, what makes all the difference between plain old unsafe search and CFR-AVG is that, in any given PBS, unsafe search chooses a policy only for one player (the current one) at a time and then immediately after the current player's action updates the PBS according to the solved policy, whereas CFR-AVG chooses a policy for each player at the same time and updates the PBS according to the solved policies only after the agent has played at least one action using the solved policies.

@noambrown8362 10 месяцев назад

@@mitchconner1185 I don't think that's correct

@mitchconner1185 10 месяцев назад

@@noambrown8362 In that case I'm confused. The ReBeL paper gives an unsafe search example about a modified rock-paper-scissors game, where the first player's policy is solved close to optimal (but it chooses rock ever so slightly too often), after the first player's action the PBS is updated according to that policy, and then the second player solves his policy rooted at the updated PBS and ends up playing paper 100% of the time. Assuming CFR-AVG is used such that it does not always solve a subgame rooted at the beginning of the street, but rather a subgame rooted right before the opponent's most recent action, and that a random iteration to stop on is chosen randomly from the interval [1, 100000000000]. This means that an iteration to stop on that is greater than 10000000 is going to be randomly chosen 99.99% of the time. My assumption is that the average policy on each iteration between iterations 10000000 and 100000000000 is going to be pretty much the same. Thefore we're effectively always choosing a policy deterministically, not randomly. So, if it's not the act of stopping at a random CFR iteration that makes CFR-AVG "safe" compared to unsafe search, then what is it? Can you help me understand. I'm not good enough at math to follow the proofs in the ReBeL paper. As a side note, I do intuitively understand why choosing a random current policy (instead of a random average one) makes the algorithm "safe".

@davemateo1240 3 года назад

I think this technique would be applicable to tranding card games like magic or Yu-Gi-Oh but would it struggle with different matchups or decks that are not equal, and if so do you think there are ways of addressing it?

@noambrown8362 3 года назад

I don't see why it would struggle with unequal decks, especially if it's trained on all decks rather than on a specific pair of decks.

@davidadams7165 3 года назад

This seems like a cool idea outside the realm of poker. I'm gonna go implement it. Wish me luck

@davidadams7165 Год назад

@@noambrown8362 I am not sure if you are still looking at comments or not. But I gave it my best shot and got it published in AJCAI this year (yt won't let me link sadly) Thank you for being an awesome inspiration! I have learned so much.

@brianbrewer974 3 года назад

For the turn endgame holdem experiment where you trained with random values - Was this something like each of the possible hands has a random probability between 0 and 1? I was able to train a river net to a training loss of 0.013 and validation loss of 0.018 with 500k samples (solved to an exploitability of 0.1% of the pot) using Deepstack's pseudorandom range generating function and having the board be an input to the network (represented as 52 0's or 1s depending on if the card is on the board). The inputs were also players ranges rather than bucket probabilities. I also tested the network in different situations where both players had 100% of possible hands (a distribution not represented in the data), and it had a loss of about 0.003 in those cases (albeit a small sample). For some reason, that is an easier distribution than the pseudorandom distribution. I'm guessing that if you assign random probabilities to every hand - each hand strength is going to have a similar number of combos in most samples. For instance, there may be 100 hands that make top pair, most samples on that board are going to have around 50 combinations of top pair. It may not be able to generalize in situations where a hand strength is very unlikely or very likely. I've come to the conclusion that buckets were not necessary in Deepstack or Supremus, but the algorithm for generating players ranges is.

@noambrown8362 3 года назад

Yes, it was each possible hand with a random probability between 0 and 1 (and then normalized). What you're describing is an interesting result. I'm guessing the validation set was drawn from the same pseudorandom distribution used for training? What if you try solving the turn endgame that we tested in the ReBeL paper (where we assume uniform ranges at the start of the turn)? I'd be interested to see what kind of exploitability numbers you get. That would give you a sense of how good the network is outside of its training distribution.

@brianbrewer974 3 года назад

@@noambrown8362 Yes, the validation set was created identically to the training set. It's not clear to me how to measure exploitability when using a value net. When solving to the end of the game, finding a best response is fairly straightforward and similar to a cfr iteration. When using a value net at the leaf nodes of the subgame, both players ranges will impact the result of the network (adding strong hands to your own range will change the expectation of your weak hands, this isn't true at terminal nodes at the end of the game). Do you measure exploitability in just the first round, or do you solve the second round (by passing the reach probabilities from the solution of the first round?) and measure exploitability in the full game?

@noambrown8362 3 года назад

@@brianbrewer974 You should first solve the turn using the river value network, and then "freeze" that turn strategy for each player. To calculate the best response value against P1, you run CFR on the *full* turn endgame (both turn and river without any neural value network) where P1's turn strategy is fixed at the frozen strategy (but not P1's river strategy, and not P2's turn/river strategy). You can do the same thing but mirrored to get the best response value against P2. Add these up to get exploitability.

@brianbrewer974 3 года назад

@@noambrown8362 Sorry for the slow response. I created an end of turn network with 1m subgames and its huber loss was 0.00239. I solved the turn endgame holdem subgame described in the rebel paper. Locking the oop player strategy in a commercial solver caused him to lose about 50 chips, or 2.5% the pot. I haven't tested the ip player as copying the strategies was quite time consuming. I've noticed that the combo selection in the strategy (from the neural net) is super specific. It's much more likely to play pure strategies with specific hands than the solution from solving to the end of the game. This may be a weakness from the outputs not being bucketed.

@noambrown8362 3 года назад

@@brianbrewer974 It's not really possible to say what the exploitability is without measuring how much both players lose. The oop player might be losing 2.5% simply because the oop player might be at a disadvantage given the ranges. Can you let me know what the numbers look like for ip too? For comparison, ReBeL is exploitable by about 0.25% the size of the pot.

@user-xs9ey2rd5h Год назад

In the paper the following is said "At the start of the game, a depth-limited subgame rooted at the initial PBS βr is generated. This subgame is solved (i.e., a Nash equilibrium is approximated) by running T iterations of an iterative equilibrium-finding algorithm in the discrete representation of the game," but I'm a bit confused as to how you would solve the discrete representation of the game when you start in the public belief state. Do you sample a infostate from the public belief state and run CFR on this or do you do something else?

@noambrown8362 Год назад

This is a bit tricky. You can view it as sampling a history at the root of the subgame on each iteration (you don't just sample an infostate, because an infostate only describes one player's information). Now you might ask, how do you sample a history if you only have the probability distribution over infostates for each player? It turns out the probability distribution over infostates for each player uniquely defines a probability distribution over histories. You could alternatively just always view a PBS as a probability distribution over histories, but if each player has 1,000 possible hands then there are 2*1,000 infostates, whereas there are 1,000^2 histories, so representing a PBS using probabilities over infostates is more compact.

@user-xs9ey2rd5h Год назад

@@noambrown8362 I see but then how would you apply CFR to a PBS, since normally you feed CFR a history and reach probabilities. Do you sample/keep track of a history then? It's nice that a probability distribution over infostates uniquely defines a probability distribution over histories but I presume a mapping from one to the other can't be explicitly written down.

@noambrown8362 Год назад

@@user-xs9ey2rd5h Yes you sample a history from the PBS and then solve the subgame just as you normally would solve an imperfect-information sequential game in the discrete representation.

@quebono100 3 года назад

What is the diffrence between MuZero? So Im not an engineer, just a programer who learns ML by my own. So in my view you doing kind of the same what MuZero is doing, try to come up with a policy-model which are othe multi-agents using. Sorry for asking. I should have just read the paper :D

@noambrown8362 3 года назад

MuZero also can't play poker. It can play perfect-information games without having to know the rules. That's impressive, but a separate contribution from ReBeL.

@quebono100 3 года назад

@@noambrown8362 Thank you for your answer

@oed572 2 года назад

Why didn’t you mention AlphaStar?

@pavelk4524 3 года назад

Is there implementation on ReBel/ Dream on github?

@noambrown8362 3 года назад

Yes! Though not for Texas holdem specifically. ReBeL: github.com/facebookresearch/rebel DREAM: github.com/EricSteinberger/DREAM

@DhruvaKartik 3 года назад

The proof of Theorem 3 is not very convincing. It looks like you are proving that the strategy obtained by Algorithm 1 is an equilibrium in a game where players know the past strategies and can compute beliefs. This is because the value at leaf nodes is obtained as a function of beta^t which in turn is a function of the strategies. Furthermore, according to the proof, the policy averaged over all iterations and the policy that stops at a random iteration are equivalent. However, you seem to imply that one of them works while the other does not. There seems to be a discrepancy here, it would be great if you can clarify this. In zero-sum games, going from the case where the strategies are common knowledge to the case where they are not common knowledge is a tricky long-standing problem that has been addressed only in some special cases. That is why I feel skeptical about Theorem 3 and I think it should be handled more carefully.

@noambrown8362 3 года назад

Hi Dhruva, Theorem 3 (and all of Section 6) is in my opinion the most interesting, counter-intuitive, and under-appreciated parts of the paper, so I'm glad you highlighted it! Theorem 3 doesn't say that we need to know the opponent's actual strategy. Instead, it says that if we assume that the opponent's strategy is sampled from an iteration of CFR, then the strategy that we play is (in expectation) a Nash equilibrium strategy. It's still a Nash equilibrium strategy regardless of what the opponent actually plays (and indeed the opponent will almost certainly play a strategy different from what we assume). To be clear, it is *not* theoretically sound to run CFR in a subgame and then assume the opponent played that average strategy. However, it *is* theoretically sound to sample a random iteration and assume the opponent played the strategy from that random iteration. The policy you play (in expectation) in future subgames will be different in the latter case than in the former case.

@DhruvaKartik 3 года назад

Hi Noam, thanks a lot for clarifying the statement of the theorem. That is more or less how I understood it and I am inclined to agree with it. My confusion however is with the proof of the theorem. If the proof of this theorem is correct, then it is indeed a major development in game theory. It would be great if you can make the proof more mathematically rigorous. Specifically, these are my concerns: 1) As you just said, assuming the opponent played the average strategy is not sound while sampling a random iteration and assuming the strategy from this iteration is sound. If you look at the arguments used in the proof of theorem 3, there is nothing that distinguishes these two methods. In fact, the arguments clearly say that both of them are equivalent ("one can equivalently pick a random iteration..." in the fourth paragraph for instance). Which part of the proof suggests that randomly choosing iteration t is sound whereas the average strategy is not? In the proof arguments, where is the significance of this added layer of randomizing over iterations? 2) In the last line of paragraph 3, you state "the average policy over all iterations is (approximate) Nash equilibrium." I think one needs to be more specific about the game in which this is a Nash equilibrium. Is it the game in which players know the strategies or they do not know the strategies? To prove the theorem, we would like this to be the game in which players do not know the strategies. But since the beliefs beta^t (which depend on strategies) are being used at the leaf nodes, I am not sure if that is the case.

@noambrown8362 3 года назад

Hi @@DhruvaKartik, the key is that in paragraph 3 we say that the subgame is solved using CFR-D ("Solving Imperfect Information Games Using Decomposition" by Burch et al. AAAI-14). The theorem makes a lot more sense once you understand the CFR-D algorithm. Once you understand CFR-D, the proof can be summarized as follows: 1) CFR-D computes an average policy that is a Nash equilibrium for the full game. 2) Playing the average of T policies is equivalent to sampling one of the T policies and playing that policy. 3) Thus, rather than play the average policy produced by CFR-D, one could equivalently sample a random iteration and play that policy (and this sampling is done for every recursive call of CFR-D). 4) That is exactly what ReBeL does, but it also uses a value network to estimate the value of some subgames rather than solving them recursively. Feel free to email me if you'd like to discuss this further. Understanding CFR-D can be tricky but the proof will make a lot more sense then.

@DhruvaKartik 3 года назад

Thanks a lot for your time@@noambrown8362. I will read more about CFR-D and send you an email if necessary. Looking forward to seeing more results on ReBeL in future conferences.

@brianbrewer974 3 года назад

@@noambrown8362 Could theorem 3 be applied to mccfr? 1) Could you solve a subgame with mccfr and sample a random iteration? 2) If so, due to the external sampling of mccfr, could you simply use the average strategy of the final iteration (or even the current strategy of the final iteration), rather than sampling a random iteration? My intuition is that the last iteration of mccfr is really similar to doing cfr and sampling among the last few iterations. I'm not really sure how to prove it, and it may be wrong.