Noam Brown

4
71 324

Комментарии

@JarradSingly 3 месяца назад

I love poker so much.

@mitchconner1185 8 месяцев назад

I'm unable to solve Kuhn Poker (the version with 2 players and 3 cards: en.wikipedia.org/wiki/Kuhn_poker) by using the theoretically sound version of CFR-AVG. If the Kuhn Poker terminal nodes are evaluated assuming both players play the average policy, then the average policy doesn't converge to optimal. For the heck of it, I also tried if the average over all iterations' average policies would converge to optimal, and found that it doesn't. When I change the code to evaluate the Kuhn Poker terminal nodes assuming both players play the current policy, then the average policy converges to optimal.

@Free-pp8mr 8 месяцев назад

Great job , Noam! But what about Prisoners Dilemma? I see that you are study non-cooperative zero sum games, so may be it is not your subject at all. But if all AI algorithms will concentrate only on competing, but not cooperating, will we have any win-win games in human future? Are altruistic gens rudiments for future homo artifices?

@ainativehealth 9 месяцев назад

Omg This is the video made by noam brown himself

@ytpah9823 10 месяцев назад

🎯 Key Takeaways for quick navigation: 00:00 📖 The talk is about combining deep reinforcement learning and search for imperfect-information games. 00:14 🎮 AlphaGo was a breakthrough in AI but was specifically designed for the game of Go. 00:43 🔄 Alpha Zero is a generalized version that can play Chess, Go, and Shogi, but only perfect-information games. 00:58 🕵️ Most real-world situations involve hidden information, termed imperfect-information games. 01:26 🎲 The main challenge in imperfect-information games is poker, especially No Limit Texas Hold'em. 01:55 ♠️ In 2019, Pluribus AI surpassed human performance in six-player No Limit Texas Hold'em. 02:22 🤝 REBEL (Recursive Belief-Based Learning) aims to unify AI techniques for both perfect and imperfect information games. 03:06 ✅ REBEL shows potential in achieving superhuman performance in two-player No Limit Texas Hold'em with less domain knowledge. 03:34 🌐 A fundamental concept in AI gaming is the 'value of a state,' determining the probable outcome of a game state. 05:39 🔎 Search in AI gaming helps approximate an optimal game policy without having to explore every possible game state. 07:36 🔄 Alpha Zero's learning process involves updating its value network based on wins and losses from its self-play. 08:38 ❓ Traditional state values don't translate directly to imperfect-information games. 08:52 🪨 An example game, Rock Paper Scissors Plus, illustrates the challenges of sequential imperfect-information games. 09:47 🎲 Players can't improve by shifting probabilities in certain situations. 10:04 🤷 In a limited-depth search, there's insufficient data to determine the optimal move. 10:32 🔄 Players' strategies are often based on assumptions about opponents' moves, which may not hold true. 11:00 📈 The value of a move, like "rock", can change based on the frequency it's selected. 11:14 🎯 Imperfect games like poker differ from perfect ones like chess where a move's value remains consistent. 11:41 🕵️ Assumption: The opponent knows our strategy, but not specific random outcomes. 12:52 🔄 One solution is to redefine the game state to be a probability distribution of moves. 13:33 💡 Key idea of "Rebel": Change the state definition to be a probability distribution. 14:01 📚 The concept isn't new, similar strategies were used in AI and game theory previously. 14:30 🧸 To demonstrate the paper's ideas, a simple toy poker game is introduced. 15:12 🔄 The game has two representations, one with private info, the other without. 16:09 📜 Players announce their strategy based on each potential card they could hold. 16:38 🤝 Two game versions are strategically identical, even though one is perfect information, and the other is imperfect. 18:03 🔑 Rebel's key idea: Transform any imperfect game into a perfect information one. 18:31 📉 In the belief representation, a player updates their beliefs based on their and the opponent's moves. 19:28 📊 The state in belief representation is determined by the probability distribution over all potential cards. 19:57 📏 The game discussed uses a 104-dimensional continuous state space and a 156-dimensional action space, based on player decisions with cards. 20:41 ❓ It is debated whether AlphaZero could be used on the belief representation of the game. 21:09 🔍 Discretizing a 156-dimensional continuous action space doesn't yield quality solutions. 21:40 🔄 The public belief states are comparable to perfect information states, but they differ in dimensions based on the game's information perfection. 22:24 🌳 Monte Carlo Tree Search, as used in AlphaZero, becomes impractical for these high dimensions. 22:54 ⚙️ In poker AI, the Counterfactual Regret Minimization (CFR) algorithm outperforms fictitious play and is often used for better results. 23:22 📈 REBEL is introduced as an extension of AlphaZero for imperfect information games. It uses sub-games rooted in public belief states to find optimal policy. 24:35 🔄 The process of generating sub-games based on public belief states continues until the end of the game to improve the value network. 25:17 🔄 REBEL chooses actions randomly to ensure proper exploration and, in theory, can converge to a Nash equilibrium. 26:00 💡 When playing against humans, the challenge is that they don't announce their entire policy. 27:54 ⚠️ Relying on deterministic algorithms can make a player vulnerable to exploitation by a knowledgeable opponent. 28:38 🔀 To be unpredictable, the game stops CFR on a random iteration, making it hard for an opponent to exploit. 29:07 📄 Research validates that this random iteration approach leads to Nash equilibrium play. 29:33 🏆 REBEL was tested on two-player no-limit Texas Hold'em poker against prior poker AIs and showed promising results. 30:16 📊 Rebel achieved superhuman performance in poker, surpassing human players and benchmark bots. 30:30 🥈 Rebel is not as strong as Libratus, though it did better against top humans. 30:44 🤖 Rebel uses significantly less domain knowledge than prior poker bots. 30:59 🎲 In Liar's Dice, another imperfect-information game, Rebel performed well without domain-specific knowledge. 31:27 🔄 Rebel can be easily deployed in games without needing expert knowledge. 31:57 🎯 Rebel consistently converges to a Nash equilibrium in two-player zero-sum games. 32:10 📌 Rebel is a solution to extend AlphaZero strategies to imperfect information games. 32:38 🌲 Monte Carlo Tree Search is effective for perfect information games, but not as much for imperfect ones. 33:07 🔍 Future research aims for a universal algorithm for both perfect and imperfect information games. 33:22 ⚙️ Rebel has limitations, including a reliance on common knowledge and high-dimensional state in action space for its value functions. Made with HARPA AI

@mitchconner1185 10 месяцев назад

Assuming CFR-AVG is used at test time such that the final CFR iteration t is a uniform random sample from interval [1, x], and then the average policy over iterations [1, t] is assumed for all players etc. It seems to me that as x approaches infinity, the overall policy that CFR-AVG produces approaches the policy that unsafe search would produce. Or am I missing something?

@noambrown8362 10 месяцев назад

No, unsafe CFR does not converge to an equilibrium in the final iterate. It's only the average that converges. The final iterate can cycle within the support of the equilibrium.

@mitchconner1185 10 месяцев назад

@@noambrown8362 If I understand correctly, what makes all the difference between plain old unsafe search and CFR-AVG is that, in any given PBS, unsafe search chooses a policy only for one player (the current one) at a time and then immediately after the current player's action updates the PBS according to the solved policy, whereas CFR-AVG chooses a policy for each player at the same time and updates the PBS according to the solved policies only after the agent has played at least one action using the solved policies.

@noambrown8362 10 месяцев назад

@@mitchconner1185 I don't think that's correct

@mitchconner1185 10 месяцев назад

@@noambrown8362 In that case I'm confused. The ReBeL paper gives an unsafe search example about a modified rock-paper-scissors game, where the first player's policy is solved close to optimal (but it chooses rock ever so slightly too often), after the first player's action the PBS is updated according to that policy, and then the second player solves his policy rooted at the updated PBS and ends up playing paper 100% of the time. Assuming CFR-AVG is used such that it does not always solve a subgame rooted at the beginning of the street, but rather a subgame rooted right before the opponent's most recent action, and that a random iteration to stop on is chosen randomly from the interval [1, 100000000000]. This means that an iteration to stop on that is greater than 10000000 is going to be randomly chosen 99.99% of the time. My assumption is that the average policy on each iteration between iterations 10000000 and 100000000000 is going to be pretty much the same. Thefore we're effectively always choosing a policy deterministically, not randomly. So, if it's not the act of stopping at a random CFR iteration that makes CFR-AVG "safe" compared to unsafe search, then what is it? Can you help me understand. I'm not good enough at math to follow the proofs in the ReBeL paper. As a side note, I do intuitively understand why choosing a random current policy (instead of a random average one) makes the algorithm "safe".

@user-df4wh6dg1z Год назад

I have a question about the Algorithm 1 in the page 1 of paper. In my opinion, I understand that function COMPUTEEV computes the expected value of PBS Br based on the output of value neural network( compute in SETLEAFVALUES function). However, you add the (Br, v(Br)) to the value net data, which means using output of a model to train it (did I understand right?). The CFR-D in section 5.1 also uses output of the value network to compute new policy for next iterations. I think the CFR-D has to use groundtruth value to compute policy for new iteration but it seems uses the output of untrained value network. So where is the groundtruth of value network in leaf nodes? Thank you so much Noam.

@noambrown8362 Год назад

I'm not sure I understood your question. For some subgames, there will be leaf nodes at the end of the game (e.g., if the leaf node is for a fold action, or if you are on the river and both players have called). In those cases, you will receive a ground truth reward from the environment. Otherwise, you use the value network at the leaf node. But eventually the ground truth rewards will "bubble up" and the value network will become increasingly accurate.

@user-df4wh6dg1z Год назад

@@noambrown8362 so could you check my understanding about your algorithm at each iteration t (from twarm -> T): Step 1: UPDATEPOLICY(G, policy at t-1). This step compute policy at time t by applying CFR-D. CFR-D will compute regret matching with value at each leaf nodes (ground truth if node at the end of game or value network output at intermediate node of game). So it will update new policy based on the regret value. Step 2: Update average policy. This step just computer new average policy using current average policy and policy at time t computed in step 1 Step 3: G <- SETLEAFVALUES( Br, policy at time t, value network current parameters). Compute new leaf node values. This new leaf node values will be new output of value network computed at policy time t (if leaf node is an intermediate node of the game) or the ground truth reward receive from the environment (if leaf node is an end node of the game) Step 4: Compute new expected value for PBS Br, v(Br) After T iterations, SOMEHOW the value v(Br) will be close to the ground truth reward of leaf node (both at end nodes and intermediate nodes of game) and we add (Br, v(Br)) to value net training data. I cannot see any step that updates parameters of value network. Would you mind clarifying for me? Thank you so much Noam.

@user-df4wh6dg1z Год назад

@@noambrown8362 I think I understand how to train value network after reading conversation between you and @brianbrewer974. Take a poker game for example, first we train value network with ground truth reward (river round), so this network will predict to output well in river round, which is almost exact as ground truth. So we can use this output to compute expected value for turn round and run CFR-D for T iterations to get proper value v(Br) at PBS Br. Now we add (Br, v(Br)) to train value network more so that it could predict well in turn round (may be that's why you call this "boostrapping" strategy training). Repeat this process to flop round and other previous rounds and the value network will be trained with every PBS Br in all rounds. ReBel finishes and we have a perfect trained value network. Is that my thought correct, mr Noam?

@haochen9635 Год назад

Love the talk. Will you share the source code?

@klawiehr Год назад

In that graphic of “solved games” around the 1 hour mark, whether Scrabble is a solved game is debatable. The current bots still have issues calculating equity, especially risk evaluation and strategy for the long game. Forcing Function has an interview on here with top player David Eldar that explores this topic. I also recommend Mack Meller’s new “Mack vs. Machine” series for analysis on human vs bot play.

@user-xs9ey2rd5h Год назад

In the paper the following is said "At the start of the game, a depth-limited subgame rooted at the initial PBS βr is generated. This subgame is solved (i.e., a Nash equilibrium is approximated) by running T iterations of an iterative equilibrium-finding algorithm in the discrete representation of the game," but I'm a bit confused as to how you would solve the discrete representation of the game when you start in the public belief state. Do you sample a infostate from the public belief state and run CFR on this or do you do something else?

@noambrown8362 Год назад

This is a bit tricky. You can view it as sampling a history at the root of the subgame on each iteration (you don't just sample an infostate, because an infostate only describes one player's information). Now you might ask, how do you sample a history if you only have the probability distribution over infostates for each player? It turns out the probability distribution over infostates for each player uniquely defines a probability distribution over histories. You could alternatively just always view a PBS as a probability distribution over histories, but if each player has 1,000 possible hands then there are 2*1,000 infostates, whereas there are 1,000^2 histories, so representing a PBS using probabilities over infostates is more compact.

@user-xs9ey2rd5h Год назад

@@noambrown8362 I see but then how would you apply CFR to a PBS, since normally you feed CFR a history and reach probabilities. Do you sample/keep track of a history then? It's nice that a probability distribution over infostates uniquely defines a probability distribution over histories but I presume a mapping from one to the other can't be explicitly written down.

@noambrown8362 Год назад

@@user-xs9ey2rd5h Yes you sample a history from the PBS and then solve the subgame just as you normally would solve an imperfect-information sequential game in the discrete representation.

@oed572 2 года назад

Why didn’t you mention AlphaStar?

@xuanattran3778 2 года назад

Hi Sir, could it be right, if to say this kind of structure is kind of lamdra arch, which has batch layer and server layer?

@BryanM61 2 года назад

As I watch this video, I'm also watching some online poker games. There is a player on one of the tables named 'NBrown'. Hmmmm......

@pranshubahadur 2 года назад

Ty for this!

@bautistabaiocchi-lora1339 3 года назад

Amazing! thank you for this

@mitchconner1185 3 года назад

Talking specifically about playing poker at test time, it seems to me that you always solved each subgame rooted at the beginning of the current street, and if the opponent made an action that wasn't in the most recent subgame action abstraction, you added the action to the subgame abstraction and solved the subgame again from scratch. A couple of questions about this subgame solving (and re-solving) at test time: 1) Given a certain street that occurred during a given hand, did you always pick the same random CFR iteration to stop the subgame solving at? 2) Did you (similar to what you did with Pluribus) freeze the agent's actual hand's action probabilities for the actions he had already made earlier on the same street? Do you think it would be mathematically correct to solve the subgames such that they were rooted not at the beginning of the current street but rather at the game state (or rather PBS) that occurred right before the opponent's most recent action? I.e. you'd always pick a new random iteration to stop the subgame solving at, and right after the agent had chosen and made his action (based on the policy at that random iteration), update the PBS according to the observed actions' probabilities at that random iteration, and then use the updated PBS as the next subgame root.

@noambrown8362 3 года назад

Hi Mitch: 1) I think we always picked a different random iteration. 2) I think we did freeze the action probabilities for the agent's actual hand. I do think it would be mathematically sound to just start the subgame at the most recent PBS rather than the start of the betting round. However, there's in theory a risk that if you don't do that during training as well then you might encounter novel PBSs at test time. We didn't want to do that during training (didn't want to deal with the added complexity) so we didn't bother. That said, although in theory it could be a problem if you don't do it during training time, in practice I don't think it would matter.

@asaf92 3 года назад

33:30 Why is the input dimension of Texas Holdem 1326?

@noambrown8362 3 года назад

In Texas Holdem each player has two hidden cards, so there are 52 choose 2 = 1326 hand combos

@heemlo649 3 года назад

I see in your paper you say Modicum used the end of the current round as the depth limit. Wouldn't this end up being a very shallow depth limit if you are already near the end of the current round, and therefore not use the full extent of your computational resources? Did you ever explore an iterative deepening sort of approach?

@noambrown8362 3 года назад

Good question. In poker there is a chance node with a high branching factor at the end of each betting round. Searching beyond that chance node is extremely expensive. You probably could get some benefit for searching beyond it when you're close to the end of the round, but the gain would be small considering how expensive it is, and it would make the code a lot more complicated, so we didn't bother.

@mitchconner1185 3 года назад

I have a few questions about the poker implementation of ReBeL described in the paper. Question 1: How is each player's infostate belief distribution (that the value/policy network takes as input) represented? I can think of the following four options: 1.1) Each infostate belief is equal to what some papers call the player's contribution to the reach probability (i.e. the product of the player's action probabilities of the actions he's taken thus far given the player's infostate and the current/average strategy profile). 1.2) The same as option 1.1 except that each belief of an infostate that is impossible due to the current public cards is set to 0. 1.3) The same as option 1.2 except that each infostate belief distribution (of a given player) is always normalized such that the sum of the player's infostate beliefs is 1. 1.4) The same as option 1.2 except that the card removal effect is applied to each distribution and then each distribution is normalized. By "applying the card removal effect" I mean taking into account the fact that the same card can't be dealt to different players and making sure that each infostate belief is equal to the actual probability that the player is in that infostate given the PBS. My understanding is that in this context the expected value (EV) of a player's infostate in a BPS is the EV given the infostate and the PBS and assuming that the player plays best response strategy against an opponent who plays the average strategy. Question 2: How are the infostate EVs (that the value network outputs) weighted? I can think of the following three options: 2.1) Weighted by chance's contribution to reach probability of the infostate, i.e. each infostate's value network output is equal to the infostate's EV multiplied by the probability (either 0 or 1) that the infostate is possible given the public cards the value network currently inputs (some infostates are impossible due to the card removal effect). 2.2) Weighted by the external reach probability of the infostate given the opponent infostate belief distribution that the value network currently inputs. The external reach probability of the infostate is equal to the reach probability of the infostate divided by the infostate player's contribution to the reach probability, or in other words, the product of chance's and the opponent's contributions to the reach probability of the infostate. 2.3) Weighted by the reach probability of the infostate (I'm quite confident that this is not the case). Question 3: What becomes of player's infostate belief distribution when the player makes an action that, according to the current strategy, he should never make? This can occur during training because of exploration (if the current strategy instead of the average strategy is used and therefore the action probabilities can be dead zero) and during testing because then the opponent can make whichever action (and then a random current strategy is used). I can think of the following three options: 3.1) The player's infostate belief distribution remains unchanged (the same as it was before this "impossible" action). 3.2) The player's infostate belief distribution is made uniform and normalized such that the sum of the player's infostate beliefs is 1. 3.3) The player's infostate belief distribution is made uniform and normalized such that the sum of the player's infostate beliefs is equal to whatever the sum was before this "impossible" action.

@noambrown8362 3 года назад

Nice questions! Question 1: We use 1.3. Card removal is accounted for when we're solving the subgame (when we sample hands for each player). When tracking beliefs, we don't correct for card removal (doing so would require tracking O(N^2) beliefs rather than O(N)). Question 2: I'm pretty sure we did 2.2. This is a very subtle aspect of ReBeL and it's very easy to introduce bugs by not handling the weights correctly. Fortunately, it's not an issue in games like Liar's Dice where the players' private information is uncorrelated. Question 3: When using CFR-D we did 3.2. If you use CFR-AVG then the belief distributions can never be undefined because the first iteration is uniform random. That's another perk of CFR-AVG.

@mitchconner1185 3 года назад

@@noambrown8362 I guessed that the belief distributions would have to be normalized because it considerably reduces the size of the PBS state space the network needs to learn. I believe the value network output values need to be then "re-weighted" to undo the normalization in order to get the correctly weighted counterfactual values that the CFR math requires. One more thing I wanted to make sure I understood correctly: At test time, when using CFR-AVG, the policies according which everyone is assumed to be playing (and according to which all belief distributions are adjusted) are defined by the current policy profile at a random iteration t, and not by the average policy profile over iterations [1, t], right? Therefore it seems to me that at test time, even when using CFR-AVG, it is still possible that the opponent's belief distribution could become undefined (all zeroes) and would need to be made uniform.

@noambrown8362 3 года назад

@@mitchconner1185 No, in CFR-AVG the policies for both players are assumed to be the average over iterations [1, t], where t is a random iteration you stop on. So the beliefs can never be undefined.

@mitchconner1185 3 года назад

@@noambrown8362 Thank you very much for your time. I realized I don't completely understand the CFR-AVG algorithm though. Specifically, when updating the average action probabilities of the average strategy in each infostate of the depth-limited subgame G, I know that the current action probabilities (that are being included to the average) should be weighted by the infostate player's own contribution to the infostate's reach probability, but is that reach probability calculated using the current strategy on iteration t or the average strategy over iterations [1, t]? I also think I've misunderstood how the undefined (all-zero) belief distributions are treated when using CFR-D. I thought (incorrectly?) that undefined belief distributions would be reset back to uniform immediately after the action that would make the distribution undefined. But I suppose that the undefined belief distribution should be made uniform only right before each leaf node is evaluated using the value network because if the player whose belief distribution becomes undefined is the one whose strategy is being updated during CFR-D, then all his average strategy updates in the infostates where his belief distribution is undefined would have to be weighted by his all-zero reach probability contributions (and not by whatever values resetting his beliefs would produce). Update: Now I'm thinking the depth limited, leaf-PBS-value contingent CFR-D algorithm probably works by keeping track of both reach probabilities and PBS's separately such that reach probabilities work exactly as they do in vanilla CFR, and PBS's are like reach probabilities except that the distributions are normalized and they're immediately reset to uniform once the distribution becomes undefined (all zeroes). Update: For anyone else struggling to understand CFR-D, I finally found a simple description of it followed by pseudocode in this paper: arxiv.org/abs/1906.11110 on page 21. It appears that "average strategy" in CFR-D is defined quite simply as the arithmetic mean over the strategies, as opposed to the player-reach-probability-weighted average it is defined as in the original CFR paper: poker.cs.ualberta.ca/publications/NIPS07-cfr.pdf.

@brianbrewer974 3 года назад

For the turn endgame holdem experiment where you trained with random values - Was this something like each of the possible hands has a random probability between 0 and 1? I was able to train a river net to a training loss of 0.013 and validation loss of 0.018 with 500k samples (solved to an exploitability of 0.1% of the pot) using Deepstack's pseudorandom range generating function and having the board be an input to the network (represented as 52 0's or 1s depending on if the card is on the board). The inputs were also players ranges rather than bucket probabilities. I also tested the network in different situations where both players had 100% of possible hands (a distribution not represented in the data), and it had a loss of about 0.003 in those cases (albeit a small sample). For some reason, that is an easier distribution than the pseudorandom distribution. I'm guessing that if you assign random probabilities to every hand - each hand strength is going to have a similar number of combos in most samples. For instance, there may be 100 hands that make top pair, most samples on that board are going to have around 50 combinations of top pair. It may not be able to generalize in situations where a hand strength is very unlikely or very likely. I've come to the conclusion that buckets were not necessary in Deepstack or Supremus, but the algorithm for generating players ranges is.

@noambrown8362 3 года назад

Yes, it was each possible hand with a random probability between 0 and 1 (and then normalized). What you're describing is an interesting result. I'm guessing the validation set was drawn from the same pseudorandom distribution used for training? What if you try solving the turn endgame that we tested in the ReBeL paper (where we assume uniform ranges at the start of the turn)? I'd be interested to see what kind of exploitability numbers you get. That would give you a sense of how good the network is outside of its training distribution.

@brianbrewer974 3 года назад

@@noambrown8362 Yes, the validation set was created identically to the training set. It's not clear to me how to measure exploitability when using a value net. When solving to the end of the game, finding a best response is fairly straightforward and similar to a cfr iteration. When using a value net at the leaf nodes of the subgame, both players ranges will impact the result of the network (adding strong hands to your own range will change the expectation of your weak hands, this isn't true at terminal nodes at the end of the game). Do you measure exploitability in just the first round, or do you solve the second round (by passing the reach probabilities from the solution of the first round?) and measure exploitability in the full game?

@noambrown8362 3 года назад

@@brianbrewer974 You should first solve the turn using the river value network, and then "freeze" that turn strategy for each player. To calculate the best response value against P1, you run CFR on the *full* turn endgame (both turn and river without any neural value network) where P1's turn strategy is fixed at the frozen strategy (but not P1's river strategy, and not P2's turn/river strategy). You can do the same thing but mirrored to get the best response value against P2. Add these up to get exploitability.

@brianbrewer974 3 года назад

@@noambrown8362 Sorry for the slow response. I created an end of turn network with 1m subgames and its huber loss was 0.00239. I solved the turn endgame holdem subgame described in the rebel paper. Locking the oop player strategy in a commercial solver caused him to lose about 50 chips, or 2.5% the pot. I haven't tested the ip player as copying the strategies was quite time consuming. I've noticed that the combo selection in the strategy (from the neural net) is super specific. It's much more likely to play pure strategies with specific hands than the solution from solving to the end of the game. This may be a weakness from the outputs not being bucketed.

@noambrown8362 3 года назад

@@brianbrewer974 It's not really possible to say what the exploitability is without measuring how much both players lose. The oop player might be losing 2.5% simply because the oop player might be at a disadvantage given the ranges. Can you let me know what the numbers look like for ip too? For comparison, ReBeL is exploitable by about 0.25% the size of the pot.

@multigladiator384 3 года назад

I just read "Superhuman AI for multiplayer poker" article by this Noam Brown guy and became interested in the "Supplementary Materials" he published for this project- but barely understood things. Then I open up youtube to introduce myself to Monte Carlo CFR and find this... Thank you in advance already!

@Ninjabrain 3 года назад

Hello! I am wondering how exactly a subgame is constructed from a PBS. At first I thought that you would just sample an infostate for each player using the infostate probabilities, and then construct a subgame in the discrete representation using those infostates, but then I realized this doesnt work because some infostates may be incompatible (eg both players cannot hold the same card in poker).Thanks in advance!

@noambrown8362 3 года назад

In the full game of poker, there is an initial chance node whose outcomes lead to players being dealt different poker hands (with each poker hand having equal probability). A subgame rooted at a PBS is constructed in a similar way, but with the initial chance node having outcome probabilities based on the probabilities of the PBS rather than being uniform. In other words, at the start of the subgame, each player is "dealt" a hand with probability equal to the belief for that hand (this gets a bit more complicated in poker compared to, say, Liar's Dice because players can't be dealt the same card, so you have to account for this "card removal", but it's not a very important conceptual detail).

@DhruvaKartik 3 года назад

The proof of Theorem 3 is not very convincing. It looks like you are proving that the strategy obtained by Algorithm 1 is an equilibrium in a game where players know the past strategies and can compute beliefs. This is because the value at leaf nodes is obtained as a function of beta^t which in turn is a function of the strategies. Furthermore, according to the proof, the policy averaged over all iterations and the policy that stops at a random iteration are equivalent. However, you seem to imply that one of them works while the other does not. There seems to be a discrepancy here, it would be great if you can clarify this. In zero-sum games, going from the case where the strategies are common knowledge to the case where they are not common knowledge is a tricky long-standing problem that has been addressed only in some special cases. That is why I feel skeptical about Theorem 3 and I think it should be handled more carefully.

@noambrown8362 3 года назад

Hi Dhruva, Theorem 3 (and all of Section 6) is in my opinion the most interesting, counter-intuitive, and under-appreciated parts of the paper, so I'm glad you highlighted it! Theorem 3 doesn't say that we need to know the opponent's actual strategy. Instead, it says that if we assume that the opponent's strategy is sampled from an iteration of CFR, then the strategy that we play is (in expectation) a Nash equilibrium strategy. It's still a Nash equilibrium strategy regardless of what the opponent actually plays (and indeed the opponent will almost certainly play a strategy different from what we assume). To be clear, it is *not* theoretically sound to run CFR in a subgame and then assume the opponent played that average strategy. However, it *is* theoretically sound to sample a random iteration and assume the opponent played the strategy from that random iteration. The policy you play (in expectation) in future subgames will be different in the latter case than in the former case.

@DhruvaKartik 3 года назад

Hi Noam, thanks a lot for clarifying the statement of the theorem. That is more or less how I understood it and I am inclined to agree with it. My confusion however is with the proof of the theorem. If the proof of this theorem is correct, then it is indeed a major development in game theory. It would be great if you can make the proof more mathematically rigorous. Specifically, these are my concerns: 1) As you just said, assuming the opponent played the average strategy is not sound while sampling a random iteration and assuming the strategy from this iteration is sound. If you look at the arguments used in the proof of theorem 3, there is nothing that distinguishes these two methods. In fact, the arguments clearly say that both of them are equivalent ("one can equivalently pick a random iteration..." in the fourth paragraph for instance). Which part of the proof suggests that randomly choosing iteration t is sound whereas the average strategy is not? In the proof arguments, where is the significance of this added layer of randomizing over iterations? 2) In the last line of paragraph 3, you state "the average policy over all iterations is (approximate) Nash equilibrium." I think one needs to be more specific about the game in which this is a Nash equilibrium. Is it the game in which players know the strategies or they do not know the strategies? To prove the theorem, we would like this to be the game in which players do not know the strategies. But since the beliefs beta^t (which depend on strategies) are being used at the leaf nodes, I am not sure if that is the case.

@noambrown8362 3 года назад

Hi @@DhruvaKartik, the key is that in paragraph 3 we say that the subgame is solved using CFR-D ("Solving Imperfect Information Games Using Decomposition" by Burch et al. AAAI-14). The theorem makes a lot more sense once you understand the CFR-D algorithm. Once you understand CFR-D, the proof can be summarized as follows: 1) CFR-D computes an average policy that is a Nash equilibrium for the full game. 2) Playing the average of T policies is equivalent to sampling one of the T policies and playing that policy. 3) Thus, rather than play the average policy produced by CFR-D, one could equivalently sample a random iteration and play that policy (and this sampling is done for every recursive call of CFR-D). 4) That is exactly what ReBeL does, but it also uses a value network to estimate the value of some subgames rather than solving them recursively. Feel free to email me if you'd like to discuss this further. Understanding CFR-D can be tricky but the proof will make a lot more sense then.

@DhruvaKartik 3 года назад

Thanks a lot for your time@@noambrown8362. I will read more about CFR-D and send you an email if necessary. Looking forward to seeing more results on ReBeL in future conferences.

@brianbrewer974 3 года назад

@@noambrown8362 Could theorem 3 be applied to mccfr? 1) Could you solve a subgame with mccfr and sample a random iteration? 2) If so, due to the external sampling of mccfr, could you simply use the average strategy of the final iteration (or even the current strategy of the final iteration), rather than sampling a random iteration? My intuition is that the last iteration of mccfr is really similar to doing cfr and sampling among the last few iterations. I'm not really sure how to prove it, and it may be wrong.

@dhruvjoshi588 3 года назад

Hi Noam, I was wondering what the computational requirements of training ReBeL was as opposed to for the libratus model. I assume this is significantly more expensive considering the continuous nature of game state space, and the fact that the game state now contains atleast as much information as libratus's game space. Could you improve my understanding of this? Thanks!

@noambrown8362 3 года назад

Yes, ReBeL is much more expensive than the Libratus approach. This is the cost of generality. We talk about compute usage a bit in the paper.

@davemateo1240 3 года назад

I think this technique would be applicable to tranding card games like magic or Yu-Gi-Oh but would it struggle with different matchups or decks that are not equal, and if so do you think there are ways of addressing it?

@noambrown8362 3 года назад

I don't see why it would struggle with unequal decks, especially if it's trained on all decks rather than on a specific pair of decks.

@davidadams7165 3 года назад

This seems like a cool idea outside the realm of poker. I'm gonna go implement it. Wish me luck

@davidadams7165 Год назад

@@noambrown8362 I am not sure if you are still looking at comments or not. But I gave it my best shot and got it published in AJCAI this year (yt won't let me link sadly) Thank you for being an awesome inspiration! I have learned so much.

@regogu 3 года назад

Good day! I do believe that the algorithm you use are the algorithms used in poker solver. I don't entirely understand the whole algorithm but based on what I understood. At the start of the algorithm (iteration 1 for the blueprint strategy), it initializes both strategies to a uniform distribution: at each choice node, the strategies for player 1 and 2 will fold, call, and/or raise with equal probability. At most cases these probabilities are quite far from equilibrium. And the EV calculated is based from the accumulated averages from all iterations. From a solver perspective, wouldn't this make the EV calculated far from equilibrium? Since the early iterations are still weighted heavily. Thanks in advance!

@noambrown8362 3 года назад

We do a lot of iterations, sometimes thousands, and weigh the later iterations more than the early ones, so the fact that it starts at uniform random isn't too much of a problem.

@regogu 3 года назад

@@noambrown8362 Thanks for the answer. In solvers, we don't usually run them at high iterations since it would take a long time. Would you see this as a possible occurrence in low iterations? Also as an explanation why some solves takes longer to converge than others? Assuming a same sized game tree. Love the articles by the way. Also, correct me if I misunderstood the topic, but the near-Nash equilibrium strategy has always a value for exploitability denoted by "epsilon". However, I have not found any related article to how this is measured. Would you be able to give some recommendations?

@noambrown8362 3 года назад

@@regogu This is a problem for any iterative algorithm if you don't run a lot of iterations. To compute epsilon, calculate the value of a best response for both players and add them together. If that value is above zero then the solver's strategy is not an exact Nash equilibrium.

@notgenius 3 года назад

Hi Noam, I think this field of research is fascinating and it's very impressive to see the strides you took and are taking in the realm of imperfect information games. Previously, Pluribus cost very little computationally, despite being placed in a 6 person table. I found one of the most impressive stats of Pluribus being that it only took ~$150 and a relatively weak computation to train and execute (which is because of the lack of deep neural nets?). In comparison, with the implementation of deep reinforcement and self play without information abstraction, how does ReBel compare in terms of cost and computability? Thanks

@noambrown8362 3 года назад

Thanks! ReBeL is much more computationally expensive. The exact resources are mentioned in the paper, but it's definitely way more than expensive than can be done on a home computer. However, it appears to scale well, and it's possible that future research will bring the computational cost down.

@dhruvjoshi588 3 года назад

Around 59:00, when you talk about how in multiplayer games most players fold so the games become 2-player very quickly, wouldn't there be too many considerations about the other (now unattainable) subgames for it to be feasible to use this kind of algorithm on? And there were ways to reach it do you have any predictions for how different the strategy would look - other than just expecting a stronger opponent hand on avg?

@noambrown8362 3 года назад

The exponential increase in the number of states/subgames as more players are added is indeed a challenge. Fortunately we address this with our followup work on Pluribus.

@brianbrewer974 3 года назад

The value vector computed with cfr is added as a training example. When solving the the last round, these values are completely accurate if it was solved to 0% exploitability. So over time it will get better at solving the round before the last, then the round before that. Something I don't understand is that you don't solve until the start of the next round. How does the value network improve at estimating the leaf nodes when you are giving it training data that consists only of public belief states at the start of a betting round?

@noambrown8362 3 года назад

The values of belief states at the end of a round can be computed by averaging over the values of belief states at the start of the next round. As we collect more data for values at the start of rounds, we also train the network to predict the values at the end of rounds.

@yangyin 3 года назад

This is the best ReBeL video I've watched. I'm very impressed by your insights into the method. Then I realized you are one of the authors ...

@zenlai1597 3 года назад

Lol

@randomBBL 3 года назад

I'm tired of the AI propaganda, until I can call a company and have to go through that AI and when it can actually understand when I say the word "agent" or "yes" then I'll consider AI being better than people, until then fuck off with this bullshit

@mingu600 3 года назад

Thank you for the video! I'm an undergraduate who wants to learn a lot more about RL, and this was really great. I was actually curious whether ReBeL could be directly applied to a game like Pokemon. I'm not sure how familiar you are with battling mechanics in Pokemon, but there is a lot of imperfect information (each opponent Pokemon has hidden moves, hidden stats which can later be inferred based on observable damage numbers and moves used) as well as damage rolls, crits, etc. In particular, unlike poker or RPS where there is a set turn order (for RPS we reformulated it to be this way), in Pokemon move order is determined by speed stat of the Pokemon in play, which is sometimes hidden to the players if not previously observed. Would you still be able to use ReBeL in this case, and would it have to be reformulated in any way? Also, I just had a couple quick clarifying questions about belief infostates. You mentioned in the paper that the input of the value network is the following size: 1(agent index) + 1(acting agent) + 1(pot) + 5(board) + 2 × 1326(infostate beliefs) Why is there a separate value for "agent index" and "acting agent"? Also, are infostate beliefs just the set of probabilities for every possible player hand (which we update based on player actions)? Why don't we include information such as amount of money each player put in the pot and bets made during that current round? Thank you again so much for the paper and video, I appreciate your time to help people like me learn more!

@noambrown8362 3 года назад

I'm not that familiar with Pokemon so it's hard for me to say, but one issue might be the amount of hidden information. At the end of the talk I mention that the input to the NN is proportional to the number of infostates in a public state. It sounds like that can be quite large in Pokemon. There's a separate value for agent index vs acting agent because we need to whose turn it is and who we want the value for. Those could be different. We don't need to keep track of prior bets and other information like that for the same reason that in chess you don't need to keep track of all the prior moves, you just need the current board configuration (plus some additional info to prevent repeating the same position). The info we track is sufficient for computing the EV for each player. And yes, infostate beliefs are the probability for each hand.

@heemlo649 3 года назад

Sounds like he mispeaks at 27:20. Doesn't the subgame solving use no abstraction?

@noambrown8362 3 года назад

It uses no card abstraction. It still discretizes the action space. But if an opponent uses a bet size that wasn't in the abstraction then the bot can add it to the action abstraction.

@heemlo649 3 года назад

@@noambrown8362 Oh ok. Thank you for sharing the knowledge. You're lectures have been very helpful in understanding the papers.

@UncoveredTruths 3 года назад

cool i guess ... jk :P

@juggernautuci8253 3 года назад

BRILLIANT IDEAS.

@ryangardner3728 3 года назад

Amazing ideas, Noam, and great explanations. Thank you.

@jonabirdd 3 года назад

While this work tries to incorporate techniques successful in deep RL, notably the use of value function approximation to truncate search, into solving imperfect info games, it's not clear that it can scale to scenarios where the underlying state space is already large. In the toy poker game, one has a state space of cardinality 104. I suppose this issue is addressed with the final comments on recon chess.

@noambrown8362 3 года назад

The state space of the whole game is extremely large: 10^161 decision points, which is (very roughly) comparable to the size of Go. When converted into the belief space representation, Texas holdem can be viewed as a continuous state space with more than 2,651 dimensions. Go has 0 dimensions in the belief representation. The issue with scaling is with increasing amounts of hidden information, *not* with an increasing size in the state space.

@jonabirdd 3 года назад

-Actually, since you don't know either your cards or your opponent's in the belief space representation (they are not common knowledge), isn't the public belief state of size 52 * 51 * 50 * 49 / 2 or similar?- Oh, I see, what we care about is the size of the info states, and not the PBS as a whole. I suppose contrary to your comment about 6 player poker games, this approach might already be intractable for 3 player games, since the info state for a single player would have > 10^6 dimensions. Unless you are suggesting one might be able to handle them adequately separately? (in general, I suppose factorising a state space is a common technique to reduce dimensionality, whether learned or manually)

@jonabirdd 3 года назад

I'm quite interested, with regard to these limitations, in exploring whether learned representations over infostates, possibly "common representations", the sense they are shared by all players, could be used to overcome them. I suppose that agents that have solved imperfect info games with high dimensional/continuous info states like dota 2/starcraft already do this (approximating over hidden knowledge) implicitly. There does seem to be some crossover point where humans also can't handle high-dimensional info states with high precision. Since developing specific techniques was not required* in those cases to beat humans. Imperfect info is however only one part, rather than being the whole point of the game. (*with the caveat that alphastar uses population based training with exploiter agents to reduce main agent exploitability, but this is not an algorithmic approach, and certainly not in same vein as work done in poker AI. Alphastar's approach addresses the problem of data distribution, rather than imperfect info per se).

@noambrown8362 3 года назад

@@jonabirdd 6-player poker shouldn't be much of a problem for ReBeL (except the fact that PBS values in theory might not be unique). It's the number of infostates in a public state that is the limiting factor, not the size of the infostates. In 3-player poker there are 3 * 1,326 infostates per public state, whereas the size of infostates is about 10^6. You can approximate the equilibrium of any game regardless of the amount of hidden information by not doing search at all. This is what AlphaStar does in StarCraft 2 and OpenAI Five does in Dota 2. But search makes a *huge* difference in games where it has been used. The fact that AlphaStar was not able to achieve superhuman performance in StarCraft 2 and is highly exploitable in the game despite using massive amounts of compute suggests that search might be important in that game as well. The problem is that nobody knows how to do search effectively in RTS games yet (which is why it wasn't used in StarCraft / Dota).

@pavelk4524 3 года назад

Is there implementation on ReBel/ Dream on github?

@noambrown8362 3 года назад

Yes! Though not for Texas holdem specifically. ReBeL: github.com/facebookresearch/rebel DREAM: github.com/EricSteinberger/DREAM

@quebono100 3 года назад

What is the diffrence between MuZero? So Im not an engineer, just a programer who learns ML by my own. So in my view you doing kind of the same what MuZero is doing, try to come up with a policy-model which are othe multi-agents using. Sorry for asking. I should have just read the paper :D

@noambrown8362 3 года назад

MuZero also can't play poker. It can play perfect-information games without having to know the rules. That's impressive, but a separate contribution from ReBeL.

@quebono100 3 года назад

@@noambrown8362 Thank you for your answer

@quebono100 3 года назад

Amazing, thank you for sharing

@quebono100 3 года назад

I like how you explain it shortly and precise

@mkrenn87 3 года назад

Very cool! Your Pro was Dong Kim who was able to play from home according to your SI. Does it mean you have a online-platform where one could play? Is there any chance to get access? Do you need special hardware to run the final version or is a standard workstation sufficient? Would love to get beaten in Chess and HU texas holdem :)

@noambrown8362 3 года назад

We do, but it's not publicly accessible unfortunately.

@andreasbacher9326 3 года назад

Extremely impressive, thank you for making the video (together with the paper). Congratulations! I wonder why you only submitted to a CS conference, and not to a real scientific journal like Nature or Science? I think the quality of the work certainly warrents a publication that is seen by a large amount of scientists, and that's gone through proper peer-review (not only the CS conference reviews). I believe your methods could be very exciting for the interface between deterministic and indeterministic processes in physics (classical and quantum mechanical boundary). They are in a way perfect information (classical physics) and imperfect information (quantum, see Bell's inequality).

@noambrown8362 3 года назад

Thanks! I think in order to turn it into a Nature or Science paper we would need experimental results on Go. That would take a lot of engineering work so we decided it wasn't quite worth it.

@grabaka884 3 года назад

Does the exploitability issue when playing humans you mentioned at 26:19 also apply to Libratus or only to this new algorithm? Or does this issue only apply to ReBel because you are using value networks?

@noambrown8362 3 года назад

It was *mostly* not an issue in Libratus. Libratus used two kinds of search: upon reaching the turn for the first time, it used unsafe search. Thereafter, it used safe search. The exploitability issue was in theory relevant when using unsafe search, but in practice wasn't really an issue in poker. (The reason it wasn't an issue in practice is complicated. Basically because the search occurred immediately after the turn card came out, the unpredictability of the turn card made it difficult for a human to actually exploit the bot's strategy.) You might wonder why we used unsafe search at all if we had safe search. The safe search techniques that existed before ReBeL added extra constraints to the search strategy that in practice could make it do worse than unsafe search. The nice thing about ReBeL's search is that it's unexploitable just like past safe search algorithms, but it doesn't have their extra constraints so ReBeL's search in practice performs much better.

@grabaka884 3 года назад

@@noambrown8362 Thank you for the response. Conceptually, is it correct to think of the goal of the technique you employed of randomly selecting iterations in the search as making the strategies robust against every type of potential counterstrategy (even bad ones) in terms of long-run expectation/unexploitability ?

@noambrown8362 3 года назад

@@grabaka884 yes that's a good intuition for it. The real reason you need to select a random iteration has to do with the details of CFR. But understanding it would require going into the details of the CFR decomposition algorithm, which would take a while.

@hayatisschon 3 года назад

Great presentation, thanks for the effort. Now it is time to read the paper :)

@cam41527 3 года назад

Great video Noam! I always love to see you videos in my feed. "With a ReBeL yell I want more... more...more" 😜

@pauljones9150 3 года назад

Wow! This is pretty cool! Was there any reason for not testing ReBeL in 6 player poker games?

@noambrown8362 3 года назад

I think it would work fine in 6-player and we thought about doing it, but there aren't many benchmark agents for 6-player so it would be hard to evaluate the bot. In theory public belief states (PBSs) might not have unique values outside of two-player zero-sum games but I don't think that would be an issue for 6-player poker.

@noambrown8362 3 года назад

@毛凡 it would probably work for any number of players but it becomes increasingly difficult to evaluate as you add more players.

@Corpsecreate 3 года назад

Absolutely amazing presentation. Loved it from start to end. Some questions though: 1. I'm just checking to see if my understanding is correct, but is every abstracted state of the game considered as a a single 'category' so to speak. In other words, there is no sense of how 'close' each state is to another state, as they are considered to be entirely independent positions. From a Machine Learning perspective, this would mean the game state would essentially be represented by a very long one-hot encoded binary vector, where only one entry (the current state) is 1. 2. How are actions chosen in proportion to regret if some regrets are negative? Do you only select from the actions that have a non-negative regret? If so, then wouldn't you run into issues where due to the variance of the game, you might calculate a negative regret from one sample, even though the true EV is positive, causing you to ignore this action going forwards? 3. Selecting actions proportionally to regret sounds extremely similar to policy gradients in reinforcement learning. Is there any fundamental difference in the learning algorithm between the approach you presented and training an AI through self play and learning a stochastic policy? 4. With reinforcement learning above, would you expect to also arrive at a nash equilibrium solution? With a continuous representation of the game state, and a neural network as a function approximater, would you also expect faster convergence? Thanks in advance :)

@noambrown8362 3 года назад

Hi Sam, 1. Yes that's correct. You could use function approximation instead (see Deep CFR, for example) but it is much more expensive. Fortunately, in poker we have these abstractions that are extremely effective. 2. When sampling, you choose a strategy based on non-negative regret. However, you still explore all actions for the traverser, even if the probability on that action is zero. So even if you sample negative regret for an action that has positive EV, you'll collect more samples and eventually determine it is positive EV. 3. There are a lot of connections between CFR and other convex optimization and policy gradient algorithms. There are definitely differences though. For example, we don't need any tuning parameters or learning rates. 4. Only certain RL algorithms provably converge to a Nash equilibrium (because you need to get the learning rate just right). However, most non-CFR RL algorithms converge quite slowly to a Nash equilibrium. If you want to use something like Deep RL to converge to a Nash equilibrium, you probably want to use a deep learning variant of CFR (of which there are many... Deep CFR, Double Neural CFR, DREAM, ARMAC). NFSP is another reasonable choice. That said, continuous action spaces are still quite challenging.

@Corpsecreate 3 года назад

@@noambrown8362 Thanks so much for your answers. One thing that i think would be super interesting to see is what would happen if the abstraction process was very aggressive to start with. You find an approximate solution to this very small game, then rerun the abstraction that has more retained information than the previous abstraction. You could use the originial solution as a seed for the new abstracted game, and do a weighted sum of all strategies that are 'close' to a given state in this second game. It would be interesting to see how this converges, but damn all this CFR stuff is so tough to implement.

@amrita49 3 года назад

why don't you put it online and see how many people will beat it?? tons. "top humans" is a meaningless definition, someone will adapt very fast to the bot without the programmers to re-tune it continuosly. it's not chess.

@yengamaurice 3 года назад

Just a few questions. I read the paper on Pluribus and there is one thing I don't seem to understand. The blueprint strategy used I assume IR 169-200-200-200 KE-KO buckets for its abstraction according to the supplementary material for Pluribus. But only gets used in the first round if the bet size is less then 100$ removed from any action abstraction in the blueprint strategy. However realtime search uses 500 buckets for future rounds and lossless abstraction in the round it is in. 1) Does this mean that the algorithm uses lossless abstraction first and then buckets information situations to 200 buckets if it's for example on the flop(round 2) and then uses 500 buckets for rounds 3 and 4 in the subgame it is in? 2) Given the fact that the supplementary material states that the algorithm only uses lossless abstraction in the first betting round but real-time search uses lossless abstraction in the round it is in and always gets used in rounds 2-4 could you explain how and when exactly lossless abstraction gets used? I understand that the two questions are sort of related

@noambrown8362 3 года назад

1) The blueprint uses 200 buckets after the preflop. When doing search, there is lossless abstraction on the current round only. If the search subgame extends to the end of the game, then 500 buckets are used in the search subgame for rounds beyond the current round. If the search subgame does not extend to the end of the game (i.e., it's depth-limited search) then the search algorithm does rollouts according to the blueprint strategy beyond the depth limit of the subgame. That means those rollouts happen in the 200-bucket abstraction. 2) Lossless abstraction is used in the preflop and on the current round when doing search.