LLMs are always going to be fundamentally flawed because they're created by using a compute-heavy algorithm to train a network on a static dataset. They don't learn while inferring (i.e. learn from experience). While recycling their output back in as input makes it appear that they're capable of one-shot learning, they're not learning anything. It's a fixed crystallization of knowledge gleaned from text on the internet, and varying your prompt is akin to shining a laser at different angles through this knowledge crystal to get it to shine something different out of it. I want a crystal that changes based on what lasers are shining through it, to affect what it outputs in the future, just like a brain that learns from experience. Massive backprop-trained networks will one day be regarded as outdated old-fashioned antiques after the advent of a proper dynamic online learning algorithm is created that is effectively digital cognition/consciousness. Nobody has created it yet, but after 20 years of staying on top of the neuroscience and artificial intelligence research that has come down the pipe, I know we're close - and it's not going to involve backprop-training massive deep networks on static datasets.
So, the way I've understood grokking is that, when you train an AI model, you also have a regularization step, which reduces weights towards zero. And by grokking you're giving that regularization step a LOT of opportunities to prune weights and patterns that aren't contributing to model performance. Because, remember, the first thing we do with an AI model is initialize all the weights to random values, so there's going to be a lot of patterns that don't actually mean anything but happen to score well enough on test output to not be overwritten by normal gradient update. The Grokfast paper seems to imply this explanation is totally wrong and that grokking is just a fundamental property of gradient descent backprop. Or is regularization just so commonplace that it's just assumed and nobody calls it out?
We need a foundation model that has been trained until Grokking a children's primary school syllabus before it ever sees an equation or chemical formula.
wait if im understanding grokfast correctly, they are attempting to predict the rate of change of weights at any given moment using fourier transform? thats insane that has way more use cases for other neural network architectures outside of just transformers
I struggle to comprehend how this process doesn't result in more hallucinations through syllogistic reasoning, given that the 'generalization' seems to be derived precisely from this inherent syllogism.
I love this because it's exactly the same as how human beings learn. Also very excited for that paper mentioned at the end. This type of generalization is a big smoking gun that will lead to AGI very quickly so speeding up the grok is incredibly hype news
I find that alice in wonderland type responses can be significantly improved when system prompting to form data structures from the known data and then inferring from that structure - something like this (a minimal version) ``` You are tasked with solving complex relationship questions by first mapping all known facts into a JSON structure and then using this structure to infer answers. When given a question, follow these steps: 1. Extract all given facts. 2. Create a JSON structure to represent these facts. 3. Use the JSON structure to navigate and infer answers. 4. Provide clear and logically consistent responses based on the JSON file. ``` I used this technique very successfully when working with gossip analysis and determining the source of gossip but quickly realized its benefits in other logical fields.
I am not surprised about the failure of LLMs to do basic reasoning with problems that involve numbers, it is already known that language models don't understand basic math, the most successful strategy is, instead of asking the LLM to solve the problem to instead translate the problem to a more explicit definition, that's how Google achieved to solve some mathematical olympiad questions by translating to Lean, with the advantage that you can verify the answer automatically and reject unverifiable proofs. Another alternative is asking the model to solve the problem using a programing language, since the python dataset is larger than the Lean dataset it is easier to train a model or use a pretrained model.
I wonder how long it will take before the pattern recognition and replication machines are thought of as pattern recognition and replication machines instead of magic. Magic certainly sells better though I guess, does come at steep crash in the future though.
Awesome video all in all, I have wanted to know a recap of what's happening in the last few months in text to video and this answered everything I had in mind! :) Thanks. ❤️
1. **Current State of LM Benchmarks** *Timestamp: **0:00:00* 2. **Benchmark Performance Issues** *Timestamp: **0:00:03* 3. **Implications of Reordering Questions** *Timestamp: **0:00:20* 4. **Alice in Wonderland Paper Findings** *Timestamp: **0:01:57* 5. **The Concept of Grocking** *Timestamp: **0:04:27* 6. **Grocking vs. Double Descent** *Timestamp: **0:06:06* 7. **Potential Solutions for Improved Reasoning** *Timestamp: **0:09:02* 8. **Grocking in Transformers and New Research** *Timestamp: **0:08:12* 9. **Grock Fast Implementation** *Timestamp: **0:11:28* ### Ads 1. **Hub SWAT AI Resources** *Timestamp: **0:01:13* ### Funny Jokes 1. **"Absolute dog water"** *Timestamp: **0:00:03* 2. **"Kind of crazy from a more cynical and critical perspective"** *Timestamp: **0:00:33* 3. **"Can you imagine an AI being able to do this only humans would be able to come up with something this random and absurdly funny"** *Timestamp: **0:03:13* 4. **"If an AI can truly do this it actually might be so over over for us so for the sake of burning down rainforests"** *Timestamp: **0:03:23* 5. **"Elon's Grock LM is probably named after the book and not related to the ML concept that we are talking about today"** *Timestamp: **0:05:43* 6. **"Mr. Zuck said that when llama 370 never stopped learning even after they trained it three or four times past the chinella optimum is not copium"** *Timestamp: **0:10:03*
First, we need to realize that there is no such thing as general intelligence. Second is that they are language models and represent language based features with no time or steps. Only then will we be able to mimic the human brain. Step by step reasoning is not a thing that actually happens. It is an interpretation of some people's externalization of their reconstruction of how they came to an answer. It's all post-hoc rationalizations that don't actually tell you anything of an underlying system. Most of what is happening is simply a very complex combination of representations that are filtered through other representations and iteratively refined until it fits with an internal world model of all representations without violating any of the ones it partains to.
nice video well done! but wasnt the grok paper about specific puzzles? in you statements it seems like grokking could work magically for any task.. maybe im wrong but i thought it was for a very specific subset of tasks like "addition and subtraction" where the weights could randomly "click" at some point and just get the whole task right. this would never happen for a general use LLM right?
You don't know how right you are 😂 Grokking will be a super important step to AGI. Essentially, you're training a model on data so much it practically becomes an expert at it. At some point, we will achieve the quantity of compute necessary to achieve this. At that point, might as well take the road of brute force. Naturally, algorithmic breakthroughs are incredibly important and also very essential to the improvement of LLMs. As they allow us to do more with less
In the loss landscape, some minimum are better than the others. These minimum are flat, i.e., the diagonalized Hessian matrix contains zeros. Due the the stochastic nature of the SGD algorithm, the gradient trajectory tends to jump out of non-flat minimum and get stuck in the flat minimum, resulting in a simpler parametrization of the function. Flat minimum only lead to slightly better generalization and will certainly not lead to AGI. The fact that some ML researchers had to invent the term "grokking" only shows their lack of understanding in basic multi variable calculus.
How far are we from just having a virtual block of solid computronium with inference result simply being the exit points of virtual Lichtenberg figures forming thru it, with most of the volume of the block remaining dark?
As folks get past the initial "wow" moment of this new tech, we begin to see through veil. A LLM is only as smart as its training data. It predicts tokens. It's a parlour trick, an amazingly powerful illusion of intelligence. People can already spot an AI piece of art from a mile off, same goes for text. It all smells of LLM, and the more we use LLMs for brainstorming, especially the highly guard railed models, the less interesting they become. Then there is the reliability problem. Can't really trust what it says in the same way you can with traditional search that uses actual human produced content references. Idk, there is value in LLMs, significant value, but we are sobering up and levelling our expectations rapidly.
By cheating like this, companies are only shooting themselves in the foot once it gets real and closer to AGI, because they will be left behind with this approach.
I'm thinking: If you imagine a model divided in two halves where one is the generalization part and the other is the overfitting part, it's still most beneficial to have the generalization half get as close to the right answer as possible so as to lighten the load on the overfitting half. Or in another way, you should devote as many parameters as you can to memorizing the corrections to the wrongest answers, and you can do that by minimizing the number of parameters needed to get to what is a generally a fairly close answer
Grokking is akin to the evolution of language even after a population has been entirely alphabetized. Every once in a while someone figures out a new connection between seemingly unrelated concepts, uses one of them in a new context by mistake or because it forgot the intended word etc. This continuous increase in information entropy even after exhausting the parameter space reminds me a lot of what some scientists say about information in degenerate era's black holes.