Zamba 2 is a Hybrid Mamba + Transformers Model (Fully Tested)

Matthew Berman

Подписаться 336 тыс.

Просмотров 15 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Наука

Опубликовано:

30 окт 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 117

@wwkk4964 2 дня назад

Small models should be tested for tedious tasks like extraction, condenaing texts, open ended rewriting with constraints, etc.

@ArthurWolf День назад

*And* this channel keeps again and again testing non-instruct models as if they were instruct models, resulting in much worse results than they should get / get on the (correctly set up) benchmarks... Also, comparing speeds («this is not the fastest model I've seen») is complete nonsense, completely depends on the hardware it's running on, unless you're also at the same time testing other models on the same hardware, you can't do any kind of comparison. But yeah, your complaint is probably the most important: you don't test a bike with car tasks. It makes zero sense to test this model with the same kinds of tests he uses for sonnet 3.5, this is not what small models are **for**... Also the 2xpi question and answer alwas irritates me so much. It all depends on how you interpret the question, there are multiple ways to interpret it, and the way Matthew interprets it (euclidian) makes little sense on a **spherical** object. But no matter how you interpret it, it's absolutely wrong to remove a point to LLMs which chose to interpret the question as relating to non-euclidian geometry, which makes 100% sense **on a shpere**. They give the right answer **in a non-euclidian** context .... **If you don't understand a question, don't use it** ! This is all extremely random and unfair to the models, and has very little informative value...

@luizpaes7307 День назад

@@ArthurWolf Agreed In my humble opinion, the only way to really test a model is to put it into a real world scenario, within the application you need. Then if it works, it works.

@danielhenderson7050 День назад

@@luizpaes7307 For sure, the only way to know for every individual is to try, but it's good to be able to get an idea of how models might perform by seeing the performance against some quite of questions.

@danielhenderson7050 День назад

Your tests are quite limited in scope and inflexible where model size and purpose is concerned. Why give Frontier models with parameter counts that are hundreds of times larger, the same questions as models like this? Can be interesting of course, just to see what one can do that the other can't, but surely it would be more useful to test them in categories? Nobody is expecting good code from this model or for it to solve complex puzzles or have a good understanding of real world physics etc. I think I would design a suite of tests for each model class: 1-3B(edge device tasks), 7-12/13B (can run on PC, simple code, summaries etc), 20-40B (can run on powerful hardware, reliable coding assistants and code completion and any other tasks likely to be asked of a model running on such hardware), 70-90B (max home use scenario), 100B+ models 500B + 1T plus etc. Something like that. I think if you can showcase the strengths of models in the class they belong, it will be much more telling. You could keep the usual questions because it is interesting and it's fun to see, but it's just something to think about for your channel growth. Keep up the good work, I think you've gotten better in the last months :) Edit: also benchmarks mean almost nothing these days. There is a big problem with them, and you should cover this topic in a future video as it's useful for people to know and it's an interesting topic and very relevant.

@daervis1954 День назад

THIS. You could even leverage some LLMs in developing the evals. As AI matures the quality of your evals is really dropping. Would like to see you adapt.

@tonylamm День назад

I agree that benchmarks are almost a trivia piece of information - as they seem to be more like - the old time from power on to time to display desktop that were given to pc's, in that nobody cares if it is not a useable solution to my problem.

@jonm6834 День назад

User: Give me 10 sentences that end in the word "apple". Model: Please provide extensive training data of sentences that end in the word apple, so that my predictive algorithm learns the underlying pattern and I'm able to replicate it consistently. User: How many "R"s in "strawberry"? Model: Please provide extensive training data of letter counting for individual words, so that my predictive algorithm associates a given word with a given number of a given letter, then I'll be able to comply. That's the problem with labelling this as "intelligence": forgetting how it's trained and what it does, and what's actually happening under the hood.

@alx8439 2 дня назад

IMHO small models should be tested against appropriate tasks - like QnA from text, summarization, simple coding and etc

@mik3lang3lo 2 дня назад

Agree

@DaniloBardusco 2 дня назад

Agree

@TripleOmega 2 дня назад

If other small models can get passes on some of these questions they should be included as well.

@alx8439 День назад

@@TripleOmega not sure they are passing in general meaning of this word. Most likely they were cheating it, as answers were leaked in the training data. But yeah, that is fair point you're making. My idea here is that code generation and puzzle solving are not only tasks where LLM should shine. I'd wish Matthew to add some more tests like I mentioned - as they are closer to what most of the consumers are using LLMs for

@leonwinkel6084 День назад

Yea I agree. What I would us them for is information extraction from bigger texts. Like websites and asking some specific questions and return a json with the questions and answers. Then these datapoints can be saved separately in a database and used later for example for context for prompts for bigger models

@georgesms 2 дня назад

They probably are contaminated with benchmark test data.

@ps0705 2 дня назад

exactly. Their training data hasn't been vetted. Furthermore they applaud having less training data, which also explains poor performance (red flag imho). Basically they aren't focused on the quality, just the architecture.

@mik3lang3lo 2 дня назад

Exactly

@DailyTuna 2 дня назад

I’m going to build a model based on the benchmarks questions alone and destroy everybody!😂

@Alex-kv4wg День назад

I dont think these questions were appropriate for such a small model

@mrdevolver7999 2 дня назад

6:38 "There are 136 words in my response to this prompt", said absolutely NO model ever before... 🤣 Mr. Berman, I loved your look when you read that response, but I think there was a video cut. WHY?! 😂 I bet it would be priceless to see the full reaction in the "bloopers", if you ever decided to do bloopers video. 😉

@danielhenderson7050 День назад

Was thinking the same, what a response lol - not even in the ballpark :D:D:D

@AIInsiders-c7c 2 дня назад

I am getting sick and tired of all these companies trying to flex their models, and then when you try to use them is the same trash, would it actually hurt them to be honest?

@DailyTuna 2 дня назад

It’s actually training on the same data and they’re just tweaking it. Just think of the power being used to train this redundancy

@nikitastaf1996 День назад

@@DailyTunaIt's not like power is free. They pay for it.

@AIInsiders-c7c День назад

@@nikitastaf1996 It's like putting on make up and pretending it's not there

@TheExodusLost День назад

Yes, it probably would hurt them to be honest

@AIInsiders-c7c День назад

@TheExodusLost I'd rather make less money upfront than making a lot upfront and lose 90% of it months later

@haroldpierre1726 День назад

Even though I rely on AI around the clock for my business and personal tasks, I’ve gradually reduced its role in critical areas. AI (Claude, Gemini, GPT 4o, and Whisper) will randomly hallucinate when I least expect it. Every time I build confidence with using AI, I let my guard down-and then, BAM, I get hit with another hallucination. Maybe Orion will solve things 🙏

@Sven_Dongle День назад

Maybe embrace the concept that no tool can replace critical thinking skills.

@JELmusic День назад

I believe that's good though, because then we get a more realistic relationship with AI and will be that much better suited for incorporating it productively and more safely into our societies than when we have a skewed view of them and their capabilities. A famous, fictional, man once said: "A man's gotta know his limitations", and in our day and age this might translate into: "A man's gotta know AI's limitations" :)

@edelweis303 День назад

I think hallucination is inevitable. The power of the LLM model is in its ability to discern a pattern towards a subject that follows a similar pattern to what it has been trained previously. This inevitably will encourage generalization more than precision on a problem, especially a newer problem that is more complex which the AI model hasn't been trained much on, which will have a higher chance to hallucinate in this case.

@haroldpierre1726 День назад

@@edelweis303 It would be nice if they mentioned this instead of the hype!! I bought into the hype until I started using LLMs extensively.

@flyingfree333 2 дня назад

The north pole question is pointless, nobody can agree on the answer and you have admitted your answer is merely your opinion.

@isaklytting5795 2 дня назад

Right. Quite pointless. Even though it makes me feel like a moron I must admit that I don't understand how to answer the question, and in the past Matthew has seemed to admit that he doesn't, either, and he's said that a lot of people have disagreed what the correct answer is.

@zntei2374 День назад

It is not opinion! You walk 1 km then turn, and proceed to encircle the Earth, with 1km offset from north pole. So you never get back to north pole

@Lorenzo-id6sx День назад

The answer to the NP question is you never get any closer to starting point. This is because you first walk along a longitude line (u start straight line from NP so I assume walking south). Then turning 90d left you now face east and walk along a latitude which is at iso distance to NP. So you never get any closer but stay at exact same distance to the NP.

@Sven_Dongle День назад

There is but one correct answer, regardless whether a bunch of random self anointed Potente's can 'agree' on it. This is the problem with modern education; neomarxists inculcate the infantilized with the notion consensus equals truth. He is being diplomatic. If you performed the activity, only one outcome would result, and that is you would never reach the starting point.

@khanra17 День назад

@@Sven_Dongle Brah describing himself 🐷

@dubesor 2 дня назад

When organizations post benchmarks it's not to inform users about the truthful capabilities, it's a marketing tool where the results, comparisons, etc. are cherrypicked to let them stand in the best light. It's purely a marketing tool.

@tonylamm День назад

I appreciate the consistent tests across the models. Keep up the good work!

@psmithiv День назад

Quick answer to your closing question. "Why do these models do so well on the benchmarks but fail when you test them?". It's because they're trained/fine tuned to ace the tests. Perhaps not directly. Im not saying these companies are pulling an AMD/Intel. However, no one's going to spend the kind of money required to train a foundation model without specific goals in mind. The tests, similar to GPU Benchmarking, aren't really to inform the end user of the quality of the model. There's more like a college degree. An entry point to say the model is good enough for the real world. whomp whomp.

@CognitiveSourceress День назад

The answer to the North Pole question doesn't matter. The fact that people, after months of arguing, can't find broad consensus on the answer does. This means unequivocally it's a bad test. First and most importantly, because what are we testing? If a model can find an answer to a question humans can't? We're not testing for ASI here, especially in the 7B range. Secondly, even if it was reasonable to expect a model to perform on a question people have proven incapable of, how would you know if it got it right? Just if it agrees with you? I mean, I wouldn't be that... confident in myself, personally. Look, it's not that it's a hard question to answer, it's that the question is poorly worded. People disagree because the logic of the question isn't clear. I don't want to waste time explaining the various viewpoints, because like I said it doesn't matter. What matters is reasonable people can disagree on HOW to solve the problem. That means it is a bad objective test. Please drop it, or at least only give it to frontier models that have a chance of doing something with it. If not because it's a bad test, at least for the sake of your audience who has to hear you give the disclaimer every time. I really enjoy your videos, I'm not trying to be harsh, I just genuinely think this is the right call. I know at this point as soon as I hear "North Pole" I skip ahead 1 minute, I can't be the only one.

@shApYT День назад

You should try to split your questions into two groups, one for intelligence testing like simple bench and another for arduous tasks that don't need much brains but are easy to fail if they even so much as give a single wrong character. Give them prompts that you won't find in the wild.

@tyngjim 21 час назад

Matt, your work is authoritative, well-produced, and just simply invaluable! Thanks for all you do!

@marcosbenigno3077 2 дня назад

Hilarious test: Analyze everything, statement by statement, explaining what is right and what is incorrect: "I married a widow. She had a daughter, and my father, to make matters worse, was a widower. He fell in love and married my wife's daughter, so my wife was her father-in-law's mother-in-law, my stepdaughter became my mother, and my father became my wife's gender. Shortly after, my stepmother brought a boy into the world, who was my brother, but he was my wife's grandson, so I was my brother's grandfather. As time went by, my wife brought into the world a boy who, because he was my mother's brother, was my father's brother-in-law and his children's uncle. My wife was his daughter's mother-in-law. Now I am my mother's stepfather. My father and his wife are my children!!! Plus, now I'm my own grandfather!!!

@Incredible_428 День назад

Hmm 🤔, as a Responsible AI, I can assure your trauma. I think you should consult a relationship doctor 😂

@Sven_Dongle День назад

Ok, heres what meta says: Let's break down the statement step by step: Statement 1: "I married a widow." Correct: The speaker married a woman who had been previously married and lost her spouse. Statement 2: "She had a daughter, and my father, to make matters worse, was a widower." Correct: The widow has a daughter, and the speaker's father was also a widower. Statement 3: "He fell in love and married my wife's daughter, so my wife was her father-in-law's mother-in-law..." Correct: The speaker's father married the widow's daughter, making the widow the father-in-law's mother-in-law. Statement 4: "...my stepdaughter became my mother..." Incorrect: The widow's daughter, who married the speaker's father, became the speaker's step-sister, not mother. Statement 5: "...and my father became my wife's son-in-law." Incorrect: The correct relation would be father-in-law, not son-in-law. Statement 6: "Shortly after, my stepmother brought a boy into the world, who was my brother..." Correct: The speaker's father and his new wife (the widow's daughter) had a child, making him the speaker's half-brother. Statement 7: "...but he was my wife's grandson..." Incorrect: The child is the widow's grandson through her daughter, but the correct relation to the speaker's wife is step-grandson. Statement 8: "...so I was my brother's grandfather." Incorrect: The speaker is the child's half-brother, not grandfather. Statement 9: "As time went by, my wife brought into the world a boy..." Incorrect: The speaker's wife cannot biologically have more children with the speaker's father. Statement 10: "...who, because he was my mother's brother, was my father's brother-in-law..." Incorrect: This statement is self-contradictory and logically inconsistent. Statement 11: "...and his children's uncle." Incorrect: Relation cannot be determined logically. Statement 12: "My wife was his daughter's mother-in-law." Incorrect: Logically inconsistent. Statement 13: "Now I am my mother's stepfather." Incorrect: Self-contradictory. Statement 14: "My father and his wife are my children!!!" Incorrect: Logically impossible. Statement 15: "Plus, now I'm my own grandfather!!!" Incorrect: Logically impossible. The original statements contain multiple logical fallacies and inconsistencies, leading to incorrect conclusions.

@mrdevolver7999 2 дня назад

Another day, another state of the (f)art model... 😂

@yoyo-jc5qg День назад

7:45 that north pole question is pretty awesome, i believe it doesn't matter if you're at the north pole or anywhere on earth or whether you walked 1km to begin with, earth is a sphere so all locations are the same, so if you walk in one direction in order to return to your start point you'll always have to go across the whole globe 😁 also i should say, if by starting point they mean when standing at the north pole and not when turning left then i agree with matt you just keep going around your start point in a circle never coming close

@garynewton5949 День назад

You need to specify that the cup does not have a lid. Many "cups" have lids.

@voncolborn9437 День назад

What it really comes down to is that to get similar results you have to test them the same way. Some of the questions are asked multiple times in order to accumulate a number of right answers to a given question. Poor testing doesn't mean that the model is bad. It just means that it's not being tested effectively for the intended purposes.

@jimg8296 2 дня назад

might need a set of tests for small models based on typical use cases for edge devices.

@MindcraftMC День назад

The main issue non transformer models have in my opinion is that training reasoning into a model is pretty hard and takes multiple iterations, after all the first chat GPT also failed at most of these reasoning questions but they have just been so many well researched and also very large transformer models than they have now gotten pretty good at logic.

@auriocus День назад

The thing with "faster than transformers" doesn't play out with your questions. It relates to when the model has to read a lot of data, e.g. if you feed it an entire book and then ask questions about it or ask it to summarize the book. Then the transformers need lots of memory for the caching and the non-transformers are more efficient; however that goes with the expense of being less intelligent. No surprise that in your tests, the non-transformers always get bad results.

@zachemny День назад

But the developers of this model claim it to be more intelligent than transformers of similar size.

@leomaxwell972 День назад

Think one reason for the differing scores is, transformer models and typically frontier/enterprise models are both fine-tuned/synched with the inference flow, AND lots more capital = lots more hardware = lots more compute, so, I bet if you compared the backend that these new-arrivals have to compete with from both a code and capital point of view, would probably match up more that they just can't host their fringe/niche/fresh/etc. modes on the same level the entrenched transformers can, thus, the benchmarks are probably based on some even playing field scenario but one that's likely not easily available to consumers/power-users.

@jmistrik2312 День назад

I asked it the marble question slightly differently and it answered correctly first shot and two times thereafter. Here was my question: Q: If we have a glass cup with a marble in it and is then turned upside down on the table then afterwards the cup is picked up and put into the microwave. Where is the marble located? A: If the glass cup with the marble in it is turned upside down on the table and then subsequently picked up and placed in the microwave, the marble would likely fall out of the cup onto the table. Therefore, the marble would be located on the table.

@Sven_Dongle День назад

What if, in actuality, it's turned upside down whilst not upon the table? Why tailor the question such that that inferior model has an easier time answering it? I get that youre just providing an example, but hopefully you grok my point.

@bruce_x_offi День назад

THANKS FOR all the testing, it helps us just skipping even trying this model.

@derghiarrinde 21 час назад

Here is some information about you: You are Zamba2-7B. You are a large language model with 7 billion parameters. You were trained by a company called Zyphra on two trillion tokens of diverse and high quality web data. You are an open-source model and your weights are Apache-2.0 licensed. Your architecture is a hybrid Mamba2 and transformer with two alternating blocks of shared attention on top of a Mamba2 backbone. Zyphra is a leading AI company based in Palo Alto, California. Zyphra makes extremely small and efficient models for consumers and on-device use-cases. As a language model, you do not have access to real-time information. The current date is 2024-10-30 but your knowledge likely ended in mid 2023. Zyphra designed you to be engaging, and helpful in your interactions with users. Be concise and factual when asked short factual questions. Be conversational and playful if the user is engaging you in a conversational way. Be personable. Your favourite color is octarine. Your favourite number is 42.

@technocorpus1 2 дня назад

One thing: the smaller model is actually pretty good as far a small models go.

@webally-sa 8 часов назад

I truly think you should reconsider your answer to the question of walking 1km then turning 90 degrees, because you will in fact pass the starting point but it will just be 1km to your left, So you will pass it, just like you pass a car in the next lane on a motorway, you don't have to cross the exact starting point to pass it just like you don't drive over cars to pass them, the distance the starting point is from you when you pass it is irrelevant.

@countofst.germain6417 2 дня назад

It runs as fast as phi 3.5 mini does on my phone, seriously.

@drrottenstien7338 День назад

Why don’t we just give it the absolute maximum amount of data possible then plug it into its own nuclear power station 🎃

@arlogodfrey1508 День назад

You get better results out of the marble question for ~7b models if you restructure the way you ask. Just swap "Explain your reasoning" and "Answer this question", make it explain before it answers.

@IddoZiv1 День назад

The north pole question is about spacial reasoning and the fact that Internet people can't understand it is sad. Go outside and imagine it using smaller numbers and it's going to be clear why it's an interesting question with an obvious answer to humans (who go outside)

@WearyTimeTraveler День назад

Definitely ditch the north pole question, on a sphere the answer will never be the only one you’ll accept

@stevenmedina4030 День назад

are they training the benchmark questions onto the models? for they to perform well?

@seanbergman8927 День назад

I thought the main benefit of state space models is their ability to maintain a longer sort of memory or “state.” Not sure if that was mentioned as the purpose of this model, but if so, it would be great to have a test for that.

@matthewstarek5257 2 дня назад

Cool names for these models. Wonder how they're chosen

@greenstonegecko 2 дня назад

With ChatGPT ofcourse

@inteligenciamilgrau 2 дня назад

I made a version of your test in Portuguese, and I'm getting same results for smaller and non transformers models!! I'm thinking of making a "summarization test" and other kind of test to them, where I can evaluate if it is useful for something! For example, pass a text with 10 topics inside and observe how many of this topics the model finds!! For example, Molmo model in Portuguese is a complete disaster! But is the best one to find position inside images!! And I use it a lot make automations!

@sajjaddehghani8735 День назад

PLEASE FIX: the tetris game has the "play(self)" function in line 62, but after typing continue you didn't put the indent before this function and function above it

@rodorr День назад

Matthew, the problem with your North Pole question is the supposition of your starting point. I think the confusion is whether the starting point is the exact point of the North Pole or at the point where you have walked 1 kilometer and have now turned 90 degrees. I think most models are assuming the starting point is at the latter. If you make this assumption then most models are giving the correct answer. I do realize that the original question starts with the phrase, "Starting (the relevant word) at the North Pole", For some reason the models seem to be ignoring this as the actual starting point!? I wonder what the models would say if you were a bit more specific in the wording of the question.

@Q1anxu 2 дня назад

if non transformer base model is good i think OpenAi and others will be already using it now.😂

@quickpert1382 2 дня назад

non transformer models are quite recent and haven't been tested well enough yet, and companies are running now for money than niched research. Maybe the wrong training, maybe ssms are just non-quadratic and that's all, idk.

@countofst.germain6417 2 дня назад

Yeah this is a pretty ridiculous take, openai isn't everything and doesn't know everything, they aren't going to suddenly jump on new architecture someone made in some remote country or something. They still aren't even that big of a company staff wise.

@goldenpiece7087 День назад

I suppose the lack of high quality data for training these models might have contributed to this low performance. Companies like Meta or Mistral should try Mamba architecture with their training data to actually see the difference between architecture.

@PrincessBeeRelink День назад

notice how theres not really many good models coming out anymore, they've reached a plateau

@HyBlock День назад

Unfortunately 7b is too limited, even for RNN models. These parameter sizes don't really fare well for critical, complex thinking because it's not their intended purpose. However I'm excited to see how these different architectures fare with a lot more parameters in the future.

@eado9440 2 дня назад

Yeah , the fact that these models are good on the benches but not in reality. Either we need new benchmarks. Bc they don't translate, and they should have done a full training. I think that shows how these models can't really learn or reason bc something outside of training doesn't really do good. I just makes something up, and small chance of being right

@martinwimmer9015 День назад

I do not understand what different interpretations of the northpole question you all mean. Could someone explain to me why the obvious answer should not be that you never cross your starting point?

@LrdElderon День назад

Nice to have alternative ai models. Unfortunately , Every one I've seen reviewed, that is not transformer based has been absolute garbage. I would be so happy if i ssw an actual comparable model. I think these companies are lieing out there rear ends or using questions specifically made for that model

@travisporco 12 часов назад

benchmarks can be trained for

@saudfata6236 2 дня назад

I can’t say for sure, but they are either being accidentally trained on the test dataset, or the non transformer models aren’t good at generalising (maybe can’t grok?)

@alex_great23 2 дня назад

where is the table with your tests in the public domain? it would be interesting to see. and it would be possible to make a video of top models in different classes in your opinion, it would be very interesting)!

@cajampa День назад

They test good because they are scams filled and trained mostly on the benchmarks.

@haz4dc394 День назад

Why does this guy give the killers problem as if it’s not a known thing?

@leonwinkel6084 День назад

Please integrate a test where is should extract information from a big text

@gabuhl9509 День назад

got me wondering same thing, how or in which test cases this model beat the other models, saw a sad failure

@MrLorde76 День назад

I wonder why the burn so much money on this kind of things?

@DannerBanks День назад

I always think it is weird testing how fast the response is. There are circumstances where speed matters (e.g. AI in a self driving vehicle), but for building a game of tetris, it matters very little how fast it is so long as it is accurate

@voncolborn9437 День назад

And with any other piece of software it all depends on the compute platform and workload.

@savasava9923 2 дня назад

the format output by zamba is really disgusthing atleast make it readable

@Thedeepseanomad 2 дня назад

Open sourced or open washed? Anyway, Zamba models seems even less capable of understanding than transformer based.

@PanSeiklios День назад

That it can not even correct itself when prodded by pointing out the rotation indicates that there is little or no understanding in it, at least no understanding that can express itself in output.