Just the fact that it can play chess, is so much more impressive than the fact it did not win from a level 5 trained computer algorithm. To me it show you these agents are perfectly capable to automate relatively simple tasks.
Recently I was thinking about chess and agents and strategy games overall and I had realized if I want to use an agent for chess then it should call a deep learning model that was trained on chess and then it can just handle the response, so the LLM is used for input and output of the user.
@@user-io4sr7vg1v to be fair it would be a very good video that truly answers that question given the disparate audience. Perhaps “llm fight it out in 64 square smackdown arena” is more accurate ;)
or any research which is unusual; this can include even be historical research but where there is very limited and difficult to find papers about very specific subjects.Also anything that is basically falling in to edge or outside cases. Also in code, where you are coding anything novel the usefulness of LLM based tools drops dramatically.
Hi! Thanks for the video and the code. Is there any reason you decided to separate the white and black moves in the prompt instead of using the "standard" format, e..g., 1. e4 e5 2. Nf3 Nf6, etc? Since this is more common in books and websites it could be easier for the models to parse? Just speculation, I may try this later if I find some time.
Thanks for watching. No reason I decided on that in particular, I doubt there would be much of an uplift in performance changing the representation of the board/moves. But let me know if you try and you do get an uplift.
can u give me review on codestral llm ? ollama i use ai to code to build web applications my ram is little low 32gigs to run codestral very smooth like other or llama3 do !! how much potential codestral have? and can it beat gpt3.5 atleast?
If you think about it I know very few chess player being able to play a chess game without seeing the board after like 8-10 moves. Would you ? I wouldn’t at all but I wouldn’t ever make the same mistakes you demonstrated if I can see the board.
Yeah, I probably wouldn't be able to remember the board state after a few moves of blindfolded chess (unless the previous moves were all book moves). I wonder how the bot would fare if there were a second agent that summarized the board state and included that into the context.
@@Data-Centric fair enough. I think I understand your argument in this video. However. Is a lot of agent features like chess or is it like mine craft? Remember how they got gpt4 to learn how to play it buy getting it to make its own tools and commands it could recall that seemed to work? Maybe agents may be more like that as it seemed it could manage mine craft, or perhaps more in-between minecraft and chess
Cool video, especially as someone who really enjoys chess. Obviously, chess is not an LLMs strong suit, but I was surprised just how poorly multiple agents did.
Does the LLM explanation will not be just pure hallucination to justify whatever move that was played ? Should it not reanalyse the board and it’s plan to make it useful ?
The end of the video convince me that it would not work because we will just emulate a pseudo search that will never be able to compare with stuff like Monte Carlo tree search But it was mostly to think what could trigger hallucinations or not
I love your content, and this video is no exception. That said, I think you are drawing overly broad conclusions about an LLM’s ability to reason in the face of new circumstances/material (versus merely parrot back aspects of its training data) based on the very specific type of “reasoning” required for chess. There are lots of types of reasoning that LLMs are terrible at. Chess requires a very specific type of thinking/planning that an autoregressive model is simply not well equipped to do-namely it must not only identify what seems to be the most promising possible next moves based on the current state, and from what the model already knows (its training data which informs its ‘intuition’), but it must then explore all the possibilities from that hypothetical state-then repeating the same exercise with another potential state. This is a highly systematic type of exploration that algorithms like MCTS are designed to perform and autoregressive GPTs are not. With an infinite context window and infinite max_tokens, the model could perhaps talk through the possibilities, but that not how people do it. And it would be hopelessly inefficient. People visualize the configurations to visually think through the implications. They don’t verbalize it. More fundamentally, the addition of chess-like methodical exploratory thinking capabilities (MCTS-like systematic exploratory thinking) would address a big deficit that LLMs have. But this is only one form of reasoning. I don’t think we can generalize from this that LLMs don’t reason.
Thank you for your feedback. I found your thoughts engaging and I broadly agree with you. My aim with this video was to demonstrate how LLM capabilities break down when asked to reason. I believe that what LLMs currently do is not reasoning at all, though I admit I've used that word to describe agent behaviour (for convenience's sake). I chose chess specifically because I believe it's a good way to visualise this concept. The chess boards displayed alongside the agent's "reasoning" trace demonstrates this quite well. The game complexity of chess is so vast that we know many chess scenarios simply don't exist in the training data. If LLMs truly "understood" the chess scenarios they had been trained on, that understanding could be transferred to new board states. LLMs attempt this by predicting the next token based on what they've already encountered, as you quite rightly pointed out, this next-token prediction isn't sufficient to play chess competently. I find your point about infinite context interesting, but I still believe it wouldn't "know" the best move to make even if it could walk through all chess scenarios from a given board state. Generating a set of possible moves is obviously within an LLM's capabilities, but knowing which is the best of that set would require an understanding of how each move brings you closer to the goal of checkmate. This isn't something that autoregressive next-token prediction is well-suited for. Then again, if all possible outcomes were in the training data, it could predict the best move , but this still isn't reasoning, or is it?
This is a good demonstration of how not to use agents. As there is practically an infinite number of chess moves at any piont, are not we just asking the llm for a random next move? Although llm's cant do random, they should just return the closest similar example from their training data.
I think ai agents, like coders, should write a test for the soln before generating it, they can test a solution using either: a calculator, write code and run it, use a custom function tool (ie is this a valid chess move),use local RAG, use web search from a quality source, simulate it, monte carlo tree search (for chess, etc), subdivide it and test, test using a different llm, human verification.
Interesting solution regarding your chess approach, however one might say there's no use for the LLM there at all because the algo is doing 99% of the chess. I assume by valid chess move you mean good (correct me if I'm wrong). I think in this case, the LLM still wouldn't know what a valid chess move is.
Your video is truly shocking. I never would have imagined a major LLM could so quickly make such trivial and direct reasoning errors worthy of a quasi-beginner. I actually think you just provided a clear demonstration that there is almost not an ounce of general reasoning in an LLM. We think there is because the language is logical and our prompts are recurring but this is wrong. In fact, it doesn't seem able of isolating key pieces in a layout and analyzing the impact of their movement. As soon as the game develops a little, he no longer understands anything. No chess player analyzes the potential movement of all pieces on the board. We know in a few seconds how to identify the main threats or opportunities and we figure out the few resulting options. Maybe training the model with a good move/ bad move starting from a random layout would help him isolate key pieces in a layout but I’m not even sure about that.
Its like using a wrench to write a book. Makes no sense. Now compare stockfish to make a financial report by providing it data, then compare it with LLM's.