Hey man I'm 30:13 in and I think giving a simple example for an MDP problem can be a huge help. They didn't even understand what you asked, some guy answered you "To maximize the reward 😆"
They have an important question that I feel wasn't discussed properly. "What makes a problem an RL problem". The speaker claims is that its when the transition function is unknown. But then he was challenged and it wasn't exactly made clear what the disagreement was. Was somebody able to catch it more precisely? Was it that they agree that it's when the transition function is unknown they just disagreed that chess was an RL problem or not by its nature that you can check all possible ways the opponent might react and thus the transition is "known"? But isn't the speaker right, we don't actually know which action the other person will do and thus we need to estimate it from data (even if his steps are finite)?
Someone in the audience asked that they are familiar with methods for coming up with policies which dont use reinforcement learning algorithms to solve the problem. For example alpha-beta search (see this lecture for a nice overview: gki.informatik.uni-freiburg.de/teaching/ss14/gki/lectures/ai06.pdf). But note that alpha-beta search (or other search variants) make an assumption to turn an innately RL problem to a search problem. They assume that the transition function is known in that the other player/s will play optimally in response to your move. By assuming a transition model you can plan ahead by simulating what will happen to a (finite) horizon into the future if you now played a particular move. This search is still difficult due to the massive game tree that you need to expand and indeed until recent advances most of the SOTA agents in chess and other board games were mostly variants of search algorithms. But the problem remains inherently a RL problem (transitions are unknown). You can cast it down to a search/planning problem by assuming a transition model but your mileage will vary with how realistic the assumption of the opposition playing optimally is.
I didn't want to spend too much time on this aspect as I wanted to get to the NAS content instead of being bogged down in RL basics. It is difficult to teach NAS overview to such a broad audience (the set of all Microsoft employees) with widely varying backgrounds. I ended up spending too much time in the RL part just because some NAS algorithms use some techniques (policy-gradient and evolutionary methods). In later iterations I have abstracted the RL part away a bit more.
I don't think this question is as important. Having read the Sutton et al. book I would've said reward and value functions then pointed at the Bellman equation. Finite MDPs are chapter 2 of the RL introductory book so I dont really understand the original question unless the answer was finite. They are all generally different models of the same problem sets. 'RL' is introduced when AB pruning and other more exhaustive searches are impractical due to the scale of the domain (or for sample efficiency) and a larger heuristic must be introduced. This trend continues to this day with Deep RL where even more approximation is implemented to scale even further to domains (S-A spaces) previously unsolvable (AlphaGo, OpenAI Five, Alphastar). I'm sure theoretical computing models would show exhaustive search (not considering observability) can do the same but the problem and field of study therein is based upon finite compute resources: heuristics.