*And* this channel keeps again and again testing non-instruct models as if they were instruct models, resulting in much worse results than they should get / get on the (correctly set up) benchmarks... Also, comparing speeds («this is not the fastest model I've seen») is complete nonsense, completely depends on the hardware it's running on, unless you're also at the same time testing other models on the same hardware, you can't do any kind of comparison. But yeah, your complaint is probably the most important: you don't test a bike with car tasks. It makes zero sense to test this model with the same kinds of tests he uses for sonnet 3.5, this is not what small models are **for**... Also the 2xpi question and answer alwas irritates me so much. It all depends on how you interpret the question, there are multiple ways to interpret it, and the way Matthew interprets it (euclidian) makes little sense on a **spherical** object. But no matter how you interpret it, it's absolutely wrong to remove a point to LLMs which chose to interpret the question as relating to non-euclidian geometry, which makes 100% sense **on a shpere**. They give the right answer **in a non-euclidian** context .... **If you don't understand a question, don't use it** ! This is all extremely random and unfair to the models, and has very little informative value...
@@ArthurWolf Agreed In my humble opinion, the only way to really test a model is to put it into a real world scenario, within the application you need. Then if it works, it works.
@@luizpaes7307 For sure, the only way to know for every individual is to try, but it's good to be able to get an idea of how models might perform by seeing the performance against some quite of questions.
Your tests are quite limited in scope and inflexible where model size and purpose is concerned. Why give Frontier models with parameter counts that are hundreds of times larger, the same questions as models like this? Can be interesting of course, just to see what one can do that the other can't, but surely it would be more useful to test them in categories? Nobody is expecting good code from this model or for it to solve complex puzzles or have a good understanding of real world physics etc. I think I would design a suite of tests for each model class: 1-3B(edge device tasks), 7-12/13B (can run on PC, simple code, summaries etc), 20-40B (can run on powerful hardware, reliable coding assistants and code completion and any other tasks likely to be asked of a model running on such hardware), 70-90B (max home use scenario), 100B+ models 500B + 1T plus etc. Something like that. I think if you can showcase the strengths of models in the class they belong, it will be much more telling. You could keep the usual questions because it is interesting and it's fun to see, but it's just something to think about for your channel growth. Keep up the good work, I think you've gotten better in the last months :) Edit: also benchmarks mean almost nothing these days. There is a big problem with them, and you should cover this topic in a future video as it's useful for people to know and it's an interesting topic and very relevant.
THIS. You could even leverage some LLMs in developing the evals. As AI matures the quality of your evals is really dropping. Would like to see you adapt.
I agree that benchmarks are almost a trivia piece of information - as they seem to be more like - the old time from power on to time to display desktop that were given to pc's, in that nobody cares if it is not a useable solution to my problem.
User: Give me 10 sentences that end in the word "apple". Model: Please provide extensive training data of sentences that end in the word apple, so that my predictive algorithm learns the underlying pattern and I'm able to replicate it consistently. User: How many "R"s in "strawberry"? Model: Please provide extensive training data of letter counting for individual words, so that my predictive algorithm associates a given word with a given number of a given letter, then I'll be able to comply. That's the problem with labelling this as "intelligence": forgetting how it's trained and what it does, and what's actually happening under the hood.
@@TripleOmega not sure they are passing in general meaning of this word. Most likely they were cheating it, as answers were leaked in the training data. But yeah, that is fair point you're making. My idea here is that code generation and puzzle solving are not only tasks where LLM should shine. I'd wish Matthew to add some more tests like I mentioned - as they are closer to what most of the consumers are using LLMs for
Yea I agree. What I would us them for is information extraction from bigger texts. Like websites and asking some specific questions and return a json with the questions and answers. Then these datapoints can be saved separately in a database and used later for example for context for prompts for bigger models
exactly. Their training data hasn't been vetted. Furthermore they applaud having less training data, which also explains poor performance (red flag imho). Basically they aren't focused on the quality, just the architecture.
6:38 "There are 136 words in my response to this prompt", said absolutely NO model ever before... 🤣 Mr. Berman, I loved your look when you read that response, but I think there was a video cut. WHY?! 😂 I bet it would be priceless to see the full reaction in the "bloopers", if you ever decided to do bloopers video. 😉
I am getting sick and tired of all these companies trying to flex their models, and then when you try to use them is the same trash, would it actually hurt them to be honest?
Even though I rely on AI around the clock for my business and personal tasks, I’ve gradually reduced its role in critical areas. AI (Claude, Gemini, GPT 4o, and Whisper) will randomly hallucinate when I least expect it. Every time I build confidence with using AI, I let my guard down-and then, BAM, I get hit with another hallucination. Maybe Orion will solve things 🙏
I believe that's good though, because then we get a more realistic relationship with AI and will be that much better suited for incorporating it productively and more safely into our societies than when we have a skewed view of them and their capabilities. A famous, fictional, man once said: "A man's gotta know his limitations", and in our day and age this might translate into: "A man's gotta know AI's limitations" :)
I think hallucination is inevitable. The power of the LLM model is in its ability to discern a pattern towards a subject that follows a similar pattern to what it has been trained previously. This inevitably will encourage generalization more than precision on a problem, especially a newer problem that is more complex which the AI model hasn't been trained much on, which will have a higher chance to hallucinate in this case.
Right. Quite pointless. Even though it makes me feel like a moron I must admit that I don't understand how to answer the question, and in the past Matthew has seemed to admit that he doesn't, either, and he's said that a lot of people have disagreed what the correct answer is.
The answer to the NP question is you never get any closer to starting point. This is because you first walk along a longitude line (u start straight line from NP so I assume walking south). Then turning 90d left you now face east and walk along a latitude which is at iso distance to NP. So you never get any closer but stay at exact same distance to the NP.
There is but one correct answer, regardless whether a bunch of random self anointed Potente's can 'agree' on it. This is the problem with modern education; neomarxists inculcate the infantilized with the notion consensus equals truth. He is being diplomatic. If you performed the activity, only one outcome would result, and that is you would never reach the starting point.
When organizations post benchmarks it's not to inform users about the truthful capabilities, it's a marketing tool where the results, comparisons, etc. are cherrypicked to let them stand in the best light. It's purely a marketing tool.
Quick answer to your closing question. "Why do these models do so well on the benchmarks but fail when you test them?". It's because they're trained/fine tuned to ace the tests. Perhaps not directly. Im not saying these companies are pulling an AMD/Intel. However, no one's going to spend the kind of money required to train a foundation model without specific goals in mind. The tests, similar to GPU Benchmarking, aren't really to inform the end user of the quality of the model. There's more like a college degree. An entry point to say the model is good enough for the real world. whomp whomp.
The answer to the North Pole question doesn't matter. The fact that people, after months of arguing, can't find broad consensus on the answer does. This means unequivocally it's a bad test. First and most importantly, because what are we testing? If a model can find an answer to a question humans can't? We're not testing for ASI here, especially in the 7B range. Secondly, even if it was reasonable to expect a model to perform on a question people have proven incapable of, how would you know if it got it right? Just if it agrees with you? I mean, I wouldn't be that... confident in myself, personally. Look, it's not that it's a hard question to answer, it's that the question is poorly worded. People disagree because the logic of the question isn't clear. I don't want to waste time explaining the various viewpoints, because like I said it doesn't matter. What matters is reasonable people can disagree on HOW to solve the problem. That means it is a bad objective test. Please drop it, or at least only give it to frontier models that have a chance of doing something with it. If not because it's a bad test, at least for the sake of your audience who has to hear you give the disclaimer every time. I really enjoy your videos, I'm not trying to be harsh, I just genuinely think this is the right call. I know at this point as soon as I hear "North Pole" I skip ahead 1 minute, I can't be the only one.
You should try to split your questions into two groups, one for intelligence testing like simple bench and another for arduous tasks that don't need much brains but are easy to fail if they even so much as give a single wrong character. Give them prompts that you won't find in the wild.
Hilarious test: Analyze everything, statement by statement, explaining what is right and what is incorrect: "I married a widow. She had a daughter, and my father, to make matters worse, was a widower. He fell in love and married my wife's daughter, so my wife was her father-in-law's mother-in-law, my stepdaughter became my mother, and my father became my wife's gender. Shortly after, my stepmother brought a boy into the world, who was my brother, but he was my wife's grandson, so I was my brother's grandfather. As time went by, my wife brought into the world a boy who, because he was my mother's brother, was my father's brother-in-law and his children's uncle. My wife was his daughter's mother-in-law. Now I am my mother's stepfather. My father and his wife are my children!!! Plus, now I'm my own grandfather!!!
Ok, heres what meta says: Let's break down the statement step by step: Statement 1: "I married a widow." Correct: The speaker married a woman who had been previously married and lost her spouse. Statement 2: "She had a daughter, and my father, to make matters worse, was a widower." Correct: The widow has a daughter, and the speaker's father was also a widower. Statement 3: "He fell in love and married my wife's daughter, so my wife was her father-in-law's mother-in-law..." Correct: The speaker's father married the widow's daughter, making the widow the father-in-law's mother-in-law. Statement 4: "...my stepdaughter became my mother..." Incorrect: The widow's daughter, who married the speaker's father, became the speaker's step-sister, not mother. Statement 5: "...and my father became my wife's son-in-law." Incorrect: The correct relation would be father-in-law, not son-in-law. Statement 6: "Shortly after, my stepmother brought a boy into the world, who was my brother..." Correct: The speaker's father and his new wife (the widow's daughter) had a child, making him the speaker's half-brother. Statement 7: "...but he was my wife's grandson..." Incorrect: The child is the widow's grandson through her daughter, but the correct relation to the speaker's wife is step-grandson. Statement 8: "...so I was my brother's grandfather." Incorrect: The speaker is the child's half-brother, not grandfather. Statement 9: "As time went by, my wife brought into the world a boy..." Incorrect: The speaker's wife cannot biologically have more children with the speaker's father. Statement 10: "...who, because he was my mother's brother, was my father's brother-in-law..." Incorrect: This statement is self-contradictory and logically inconsistent. Statement 11: "...and his children's uncle." Incorrect: Relation cannot be determined logically. Statement 12: "My wife was his daughter's mother-in-law." Incorrect: Logically inconsistent. Statement 13: "Now I am my mother's stepfather." Incorrect: Self-contradictory. Statement 14: "My father and his wife are my children!!!" Incorrect: Logically impossible. Statement 15: "Plus, now I'm my own grandfather!!!" Incorrect: Logically impossible. The original statements contain multiple logical fallacies and inconsistencies, leading to incorrect conclusions.
7:45 that north pole question is pretty awesome, i believe it doesn't matter if you're at the north pole or anywhere on earth or whether you walked 1km to begin with, earth is a sphere so all locations are the same, so if you walk in one direction in order to return to your start point you'll always have to go across the whole globe 😁 also i should say, if by starting point they mean when standing at the north pole and not when turning left then i agree with matt you just keep going around your start point in a circle never coming close
What it really comes down to is that to get similar results you have to test them the same way. Some of the questions are asked multiple times in order to accumulate a number of right answers to a given question. Poor testing doesn't mean that the model is bad. It just means that it's not being tested effectively for the intended purposes.
The main issue non transformer models have in my opinion is that training reasoning into a model is pretty hard and takes multiple iterations, after all the first chat GPT also failed at most of these reasoning questions but they have just been so many well researched and also very large transformer models than they have now gotten pretty good at logic.
The thing with "faster than transformers" doesn't play out with your questions. It relates to when the model has to read a lot of data, e.g. if you feed it an entire book and then ask questions about it or ask it to summarize the book. Then the transformers need lots of memory for the caching and the non-transformers are more efficient; however that goes with the expense of being less intelligent. No surprise that in your tests, the non-transformers always get bad results.
Think one reason for the differing scores is, transformer models and typically frontier/enterprise models are both fine-tuned/synched with the inference flow, AND lots more capital = lots more hardware = lots more compute, so, I bet if you compared the backend that these new-arrivals have to compete with from both a code and capital point of view, would probably match up more that they just can't host their fringe/niche/fresh/etc. modes on the same level the entrenched transformers can, thus, the benchmarks are probably based on some even playing field scenario but one that's likely not easily available to consumers/power-users.
I asked it the marble question slightly differently and it answered correctly first shot and two times thereafter. Here was my question: Q: If we have a glass cup with a marble in it and is then turned upside down on the table then afterwards the cup is picked up and put into the microwave. Where is the marble located? A: If the glass cup with the marble in it is turned upside down on the table and then subsequently picked up and placed in the microwave, the marble would likely fall out of the cup onto the table. Therefore, the marble would be located on the table.
What if, in actuality, it's turned upside down whilst not upon the table? Why tailor the question such that that inferior model has an easier time answering it? I get that youre just providing an example, but hopefully you grok my point.
Here is some information about you: You are Zamba2-7B. You are a large language model with 7 billion parameters. You were trained by a company called Zyphra on two trillion tokens of diverse and high quality web data. You are an open-source model and your weights are Apache-2.0 licensed. Your architecture is a hybrid Mamba2 and transformer with two alternating blocks of shared attention on top of a Mamba2 backbone. Zyphra is a leading AI company based in Palo Alto, California. Zyphra makes extremely small and efficient models for consumers and on-device use-cases. As a language model, you do not have access to real-time information. The current date is 2024-10-30 but your knowledge likely ended in mid 2023. Zyphra designed you to be engaging, and helpful in your interactions with users. Be concise and factual when asked short factual questions. Be conversational and playful if the user is engaging you in a conversational way. Be personable. Your favourite color is octarine. Your favourite number is 42.
I truly think you should reconsider your answer to the question of walking 1km then turning 90 degrees, because you will in fact pass the starting point but it will just be 1km to your left, So you will pass it, just like you pass a car in the next lane on a motorway, you don't have to cross the exact starting point to pass it just like you don't drive over cars to pass them, the distance the starting point is from you when you pass it is irrelevant.
You get better results out of the marble question for ~7b models if you restructure the way you ask. Just swap "Explain your reasoning" and "Answer this question", make it explain before it answers.
The north pole question is about spacial reasoning and the fact that Internet people can't understand it is sad. Go outside and imagine it using smaller numbers and it's going to be clear why it's an interesting question with an obvious answer to humans (who go outside)
I thought the main benefit of state space models is their ability to maintain a longer sort of memory or “state.” Not sure if that was mentioned as the purpose of this model, but if so, it would be great to have a test for that.
I made a version of your test in Portuguese, and I'm getting same results for smaller and non transformers models!! I'm thinking of making a "summarization test" and other kind of test to them, where I can evaluate if it is useful for something! For example, pass a text with 10 topics inside and observe how many of this topics the model finds!! For example, Molmo model in Portuguese is a complete disaster! But is the best one to find position inside images!! And I use it a lot make automations!
PLEASE FIX: the tetris game has the "play(self)" function in line 62, but after typing continue you didn't put the indent before this function and function above it
Matthew, the problem with your North Pole question is the supposition of your starting point. I think the confusion is whether the starting point is the exact point of the North Pole or at the point where you have walked 1 kilometer and have now turned 90 degrees. I think most models are assuming the starting point is at the latter. If you make this assumption then most models are giving the correct answer. I do realize that the original question starts with the phrase, "Starting (the relevant word) at the North Pole", For some reason the models seem to be ignoring this as the actual starting point!? I wonder what the models would say if you were a bit more specific in the wording of the question.
non transformer models are quite recent and haven't been tested well enough yet, and companies are running now for money than niched research. Maybe the wrong training, maybe ssms are just non-quadratic and that's all, idk.
Yeah this is a pretty ridiculous take, openai isn't everything and doesn't know everything, they aren't going to suddenly jump on new architecture someone made in some remote country or something. They still aren't even that big of a company staff wise.
I suppose the lack of high quality data for training these models might have contributed to this low performance. Companies like Meta or Mistral should try Mamba architecture with their training data to actually see the difference between architecture.
Unfortunately 7b is too limited, even for RNN models. These parameter sizes don't really fare well for critical, complex thinking because it's not their intended purpose. However I'm excited to see how these different architectures fare with a lot more parameters in the future.
Yeah , the fact that these models are good on the benches but not in reality. Either we need new benchmarks. Bc they don't translate, and they should have done a full training. I think that shows how these models can't really learn or reason bc something outside of training doesn't really do good. I just makes something up, and small chance of being right
I do not understand what different interpretations of the northpole question you all mean. Could someone explain to me why the obvious answer should not be that you never cross your starting point?
Nice to have alternative ai models. Unfortunately , Every one I've seen reviewed, that is not transformer based has been absolute garbage. I would be so happy if i ssw an actual comparable model. I think these companies are lieing out there rear ends or using questions specifically made for that model
I can’t say for sure, but they are either being accidentally trained on the test dataset, or the non transformer models aren’t good at generalising (maybe can’t grok?)
where is the table with your tests in the public domain? it would be interesting to see. and it would be possible to make a video of top models in different classes in your opinion, it would be very interesting)!
I always think it is weird testing how fast the response is. There are circumstances where speed matters (e.g. AI in a self driving vehicle), but for building a game of tetris, it matters very little how fast it is so long as it is accurate
That it can not even correct itself when prodded by pointing out the rotation indicates that there is little or no understanding in it, at least no understanding that can express itself in output.