Mistral Medium - The Best Alternative To GPT4

Подписаться 284 тыс.

Просмотров 70 тыс.

50% 1

Re-upload; the first one was cut off before the ending. I got access to Mistral Medium, Mistral's prototype model that is available only through API (for now). It performs incredibly well and is a fraction of the price of GPT4. This is a great replacement for developers building most use cases.
Check out D-ID here: timf.io/3t8Xj73
Enjoy :)
Join My Newsletter for Regular AI Updates 👇🏼
www.matthewberman.com
Need AI Consulting? ✅
forwardfuture.ai/
Rent a GPU (MassedCompute) 🚀
bit.ly/matthew-berman-youtube
USE CODE "MatthewBerman" for 50% discount
My Links 🔗
👉🏻 Subscribe: / @matthew_berman
👉🏻 Twitter: / matthewberman
👉🏻 Discord: / discord
👉🏻 Patreon: / matthewberman
Media/Sponsorship Inquiries 📈
bit.ly/44TC45V
Links:
Script - gist.github.com/mberman84/4f7...
Mistral - mistral.ai/
LLM Rubric Results - bit.ly/3qHV0X7
Chapters:
0:00 - About Mistral Medium
2:28 - Pricing Comparisons
4:59 - Test Results

Наука

Опубликовано:

5 июл 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 215

@matthew_berman 6 месяцев назад

New model reached #1 on the LLM leaderboard SOLAR 10.7b, should I review it?

@MikeKasprzak 6 месяцев назад

Nah. 😋

@pn4960 6 месяцев назад

10.7b is this a case of over fitting by contamination ?

@dkracingfan2503 6 месяцев назад

Sure!

@federicocucinotta7772 6 месяцев назад

Of course!

@annettewatson1803 6 месяцев назад

Yessss 🎉

@paul1979uk2000 6 месяцев назад

What's remarkable is how quickly these models are developing, I remember what the quality was like early this year and we've seen massive improvements over that time that it's shocking many of us and likely scaring OpenAI, Microsoft and Google, as I suspect they didn't expect open source models to close the gap on them so quickly, and you really do have to wonder how much better they'll get over the next few years.

@TimoGraw 6 месяцев назад

I guess you don't remember the "We have no moat" memo

@jnevercast 6 месяцев назад

Exactly what I thought too, no moat, as written. Even then though, I thought maybe OpenAI had more than a year head start. They might not be that lucky @@TimoGraw

@leucome 6 месяцев назад

They probably did not expect it to run on consumer hardware that quick too. Even I am bit surprised. I tried Mixtral 4X7B on a 7900xt. It is a totally coherent AI with proper ability to hold a conversation in real-time and about as smart as GPT 3.5 running on a Gaming GPU... No only that imagine it is on an AMD GPU that is allegedly not supposed to be good for AI. I mean if you completely believe Nvidia self gloating marketing.

@mattbarber6964 6 месяцев назад

I created a simple flask front-end and have this working as well. I use LLMs all day long for my work and this is the first time I can honestly say I'm more impressed with it than GPT4. This is great news for LLMs in general, cause now OpenAI has an actual threat.

@carlkim2577 6 месяцев назад

Do you mean mixtral small or medium? I thought medium is not yet open source?

@mattbarber6964 6 месяцев назад

@@carlkim2577 I am using the medium through the API. So it's not open source, no, but it's on par with GPT4, if not, better. Plus, as Matt said, its 10x cheaper

@orieporter5299 6 месяцев назад

@@carlkim2577I’m sure he means a flask front end for the api

@businessemail9825 6 месяцев назад

what is the output token limit?

@georgesms 6 месяцев назад

Dude, first: awesome channel! I was thinking that your tests have a serious chance of contaminating the training data of newer models. Therefore, i think adding different questions is always benefitial even if they are not harder than the previous ones.

@cluelesssoldier 6 месяцев назад

Information sent and received through the API backend typically is not used for training purposes - at least this is how it is with OpenAI.

@DihelsonMendonca 6 месяцев назад

⚠️ Matthew, what about creating a leaderboard of LLMs that could answer all your questions ? So, we could track what's the best one till this day ? Please consider that. It could be made using a single spreadsheet. Thanks. 🎉🎉❤

@brandon1902 6 месяцев назад

I liked seeing you add a couple unique questions. Consider adding a couple fresh questions to the standard set in each video.

@Utoko 6 месяцев назад

It is nice to be able to compare results but he should kick the easy 100% pass questions which no model ever fails and ad a couple hard new ones. The models are just getting better.

@Dreamslol 6 месяцев назад

BEST OPEN SOURCE MODEL.. Every Video i hear that lol.

@esimpson2751 6 месяцев назад

he makeas a video every time a new champion arises

@miladkhademinori2709 6 месяцев назад

So excited 😊 thanks Matt 🙏

@aldousd666 6 месяцев назад

BTW, I recommend your channel to everybody who asks me how to learn how to use these models and compare them to each other. I'm the principal engineer at my company and a lot of people ask me.

@stephanembatchou5300 6 месяцев назад

Hi @matthew_berman, Thanks for the excellent content like this one. I might say that it is better to have new variations of complex tasks at each evaluation. I am pretty sure the new models are trained or finetuned on your current evaluation tasks.

@rheale9420 6 месяцев назад

wow your camera and lighting looks amazing!

@fabiankliebhan 6 месяцев назад

Impressive. And you have to consider that even GPT-4 struggles with the marble problem. I think if you would prompt it again a few times it would get it. And also mixtral sometimes does not get it. So it could be a coincidence that it performed worse on this question than mixtral. Anyway really good model and great to see this progress.

@RainbowSixIntel 6 месяцев назад

Yep gpt 4 turbo on api and on ChatGPT gets this wrong consistently.

@ylazerson 6 месяцев назад

great video - thanks!

@toastrecon 6 месяцев назад

One thing that I've been thinking about recently: I get that it's good to have a set of standard "tests" for these models, but at what point do they become "overtrained" on those tests? Like I you wanted to "game the system", you could just have a pre-trained game of snake in python at the ready to "infer". I'd almost rather it be given something novel like: create the game of snake, but it's multiplayer and 3d. It is interesting to consider how these models go from plain inference to reasoning engines.

@PvMLad 6 месяцев назад

Creat your own private test and rank them against it 🙂

@BienestarMutuo 6 месяцев назад

there is a video about that, is a real problem with no solution today 15/12/2023

@joey3070 6 месяцев назад

Best we have is OpenAI's data decontamination thing

@craiganderson6126 6 месяцев назад

Please redo the mixtral and mistral medium test with variations of your current questions. I think there's a pretty good chance they have been trained on your questions. Especially if they approached you with an API key to test their model.

@helbertgascon 6 месяцев назад

But then again, I agree... He should change his questions on every video to avoid getting his tests trained. Like that snake game coding for example. It's so overused now to the point that it has become on of the 1 button click prompts on the huggingface chat 😂

@SmOgER10 6 месяцев назад

Agree. It seems very relevant for this one especially. With a different set of questions I personally found Mistrall-small to perform BETTER than Mistral-medium.

@travisporco 6 месяцев назад

I like that you actually show people how to get things done...actual code! and not just talk-talk.

@coldlyanalytical1351 6 месяцев назад

It's clear that in, say, a year these models will be able to handle most puzzles as well as most humans. At this point AGI of a sort could be possible using clusters of models, local memory, and wrapper code. These quasi AGIs could be directed to complete tasks autonomously - although they won't be sentient in any way.

@deeplearning5408 4 месяца назад

You are quite famous in GenAI world already. I would not be surprise if some companies have your questions and perfect answers already in their datasets.

@snow-hk9zw 6 месяцев назад

amazing!I want you use this right now!

@ossian882 6 месяцев назад

Please add programming tasks that are a bit more unique. There are thousands of python snake games online so it dosen't show very much. Make it add some twist that makes the model need to reason more.

@Parisneo 6 месяцев назад

Amazing video. I think Mistral AI is doing a great job. As of the code, you could have used the python library they provide. That's better than just using requests and you can also have it stream the output sequencially. Nice video as always.

@AINEET 6 месяцев назад

Would be nice if you made a current top models compilation video lol there's so much to keep up with

@captanblue 6 месяцев назад

Oooo this is going to be good.

@ramoneduardo 6 месяцев назад

Yes, please add these more complex word problems to your test set! Thank you!

@ldsviking 6 месяцев назад

I think you should drop the easy questions that everyone gets right, and for the tricky questions, make them regenerate answers to see if they can get it right twice in a row. Or even scale the questions up incrementally to see where the breaking points are. For example, include a fourth runner, then a fifth, etc.

@TheGeneticHouse 6 месяцев назад

This is I think the first to get it correct holy and the absolute best response to the T-shirts drying in the sun question

@notme222 6 месяцев назад

I've been using the test "Write a palindrome using the word ____". (Where the word is something that reasonably works backwards. Like "wolf".) None have gotten it correct yet but it's interesting to see the failures.

@MeinDeutschkurs 6 месяцев назад

Very promising!!!!

@tomski2671 6 месяцев назад

100% / 25m = 4 %/m - fill rate 100% / 55m = 1.8181... %/m - drain rate 4 %/m - 1.81 %/m = 2.18 %/m - difference(tank is filling at this rate) 100 % / 2.18 %/m = 45.8 m (units cancel out) I had to think about this one for about 2 minutes 😄

@DihelsonMendonca 6 месяцев назад

I couldn't answer that. Mixtral is better than me on reasoning. But hey, a tiny calculator can do also better than ever human being. 😅😅

@trevoC132 6 месяцев назад

What is the best model available to train with your own data (documents) that include proprietary information for users to query in the form of questions?... looking at llama 2 at the moment.

@fenix20075 6 месяцев назад

Great! I love uncensorship! Hahahaha... Suggest adding a 2 equation test: 2A+3B=12, A+B=5. Try to find out A and B's value. (which A=3 and B=2)

@steveking4878 6 месяцев назад

I'm still waiting for the response to the how many words question to be "One." ;)

@scosee2u 6 месяцев назад

Loving the more complex problems! What game do you have in mind to replace snake once that is easily done by better and better models?

@chrisvernon3156 6 месяцев назад

Doom?

@leucome 6 месяцев назад

Theres is also a Mixtral 4X7B now. It small enough to fit in a consumer GPU. This one is really promising for a relatively smart chatbot that talk in real-time. The quality on casual language is nearly as good as the 8X7B.

@Parisneo 6 месяцев назад

By the way, LoLLMS has now mistralai binding to talk to their API and it is really cool. Their models are solid.

@YvesQuemener 6 месяцев назад

About math problems: it is interesting to see its ability to do arithmetics but I would also give it the opportunity to write a program to solve the problem.

@marcosbenigno3077 6 месяцев назад

@MatthewBerman. Please provide your list of best tested LLMS, please!

@matthew_berman 6 месяцев назад

it's in the description! :)

@Mcmeider 6 месяцев назад

Hey Matthew, i have the feeling, after watching alot of your videos, that these questions should be changed. Another thing. Could feel weird, but when i think about how fast we got a 7B-Model that is run by "Experts" - when can we run it at a smartphone or in a linux kernel?

@byrdziak-media 6 месяцев назад

Might be a good idea to create a matrix spreadsheet of which models got which questions right/wrong. Can be something as simple as a Google Spreadsheet. If you leave it private, people will need to request access, leaving their email address with you. Great lead magnet and I'd want to know which models are out there, links to them and how well they're doing. I would also suggest that for every new question you ad you drop an old question that no model has failed. you can still add the old question to the matrix, but we dont need to see bill clinton again!

@DihelsonMendonca 6 месяцев назад

Good idea

@gidmanone 6 месяцев назад

@matthew_berman How do you know your test cases (or any of their variants) have not made it into the training data ?

@NoHandleToSpeakOf 6 месяцев назад

We do not control sampling method and parameters while using API. It is possible that local modes are smarter than they first appear if you tweaking them. How about Mirostat 2.0 sampling, it sound great but I never see it used.

@chrisvernon3156 6 месяцев назад

These models are already *amazing*. I'd be more interested in how to get the best out of them, how to use them effectively in the real world, on real tasks. I sometimes wonder what would happen if there was NO further development for a decade? I feel like we're only scratching the surface of what today's models are capable of. They are pretty much black boxes, there's still so much to learn about *these* before we charge ahead with new models.

@leucome 6 месяцев назад

The other day I was thinking. Yeah sure AI will get to work everywhere but what will be my use, what can I do with that in my actual life, what utility it has directly for me. It the kind of question we did not have before so there is no obvious quick answer.

@matten_zero 6 месяцев назад

@7:10 did you turn off "safe mode"? Apparently that parameter controls how censored the responses are

@davidrose9577 6 месяцев назад

Love your stuff. Would like to see this mixed with Dolphin, please.

@ChairmanHehe 6 месяцев назад

go mistral!!

@0xDeadC0de1 6 месяцев назад

i want to be pedantic on the "How many words are in your response to this prompt?" . Nowhere in your code did I see anything that would print out a 'confidence score', so at minimum it responded with 10 words (`Confidence` is a word). Thoroughly enjoyed watching you test this. Have you tried Microsoft Phi-2 yet? It's neat, but in oobabooga in particular it's slower than mistral7b by nearly 2x on my 3080.

@UnFleshedOne 6 месяцев назад

"Confidence: 95%" is another word in the response (also 95% can be counted as 0, 1 or 3 words depending on how you count), so it got it wrong :)

@josjos1847 6 месяцев назад

@8eck 6 месяцев назад

We need to improve our test questions. 😆

@businessemail9825 6 месяцев назад

what is output token limit on mistral small and medium? please talk about this when discussing llm, and i am taking both input context and output token limit.

@friendofai 6 месяцев назад

Do you know if they plan release this model for local use in the future?

@Leto2ndAtreides 6 месяцев назад

For logic questions like If A > B and B > C, is A > C? ... It may be worth it to not ask bigger models to explain their reasoning. The more advanced models should be able to reason internally, one would hope.

@dr.mikeybee 6 месяцев назад

Ask the model to use calculus for the tank problem. BTW, I always enjoy your videos. Cheers!

@MikeBtraveling 6 месяцев назад

of course you need to play with this more BUT i dont recall seeing you do this with gpt4...

@aldousd666 6 месяцев назад

Have you done any more word counts? I'll be interested to see if it gets that right over and over again. That's kind of insane because that means it has to have some short term working memory

@darwinboor1300 6 месяцев назад

Keep making the problems more complex over time. After each problem, ask the AI if it has seen this type of problem before. You are looking for edge case solutions. You are not looking for previously trained solutions. It is interesting to see how the AIs respond. NOTE: To remove ambiguity, the filling task should state that the rates are fixed. In our world, rates are dependent upon pressure and resistance. The emptying pressure increases with the height of the fluid column.

@lucademarco5969 6 месяцев назад

I suppose that the different answer for the marble question is because it assumed that the ball is big eonugh to be stuck inside the cup. Could it be? In that case, the answer would be correct. Maybe you should rephrase the question saying the the ball diameter is smaller than the cup diameter os something like that

@aldoyh 6 месяцев назад

Yes I'd vote for a spin with SOLAR 10.7b! 😍😍

@xdasdaasdasd4787 6 месяцев назад

I do hope they open-source it. Like keep the models open-source but make access easy for a price

@Dron008 6 месяцев назад

"Confidence" is also a word in the response so there are 10 words ). Anyway I don't think we can rely on a single answer, it could be coincidence. More different questions should be asked to make sure it can count words. Usually models are very bad in counting objects. No onw could count seagulls on my photo..

@MHTHINK 6 месяцев назад

I wonder if a subset of the 8 models is censored and not all of them

@neel_aksh 6 месяцев назад

Mistral 7b Instruct v0.2 and Mixtral instruct both are fine tuned on the killers problem

@JoePiotti 6 месяцев назад

The batch answer should consider that the sun doesn’t shine equally all day.

@dr.mikeybee 6 месяцев назад

an inlet pipe can fill a tank in 25 minutes. the drain can empty the tank in 55 minutes. Once the tank was empty, the drain was left open and the tank was filled by the pipe. How long did it take for the tank to fill? use calculus. Let's denote the rate at which the inlet pipe fills the tank as $F$ and the rate at which the drain empties the tank as $D$. We know that $F = \frac{1}{25}$ tanks per minute and $D = \frac{1}{55}$ tanks per minute. When both the inlet pipe and the drain are open, the net rate of filling the tank is $F - D$. We can set up the following integral to represent the total amount of water in the tank over time: \[ \int_0^t (F - D) \, dt = 1 \] Substituting the values of $F$ and $D$, we get: \[ \int_0^t \left(\frac{1}{25} - \frac{1}{55} ight) \, dt = 1 \] Solving this integral gives us the time it takes to fill the tank: \[ t = \frac{1}{\frac{1}{25} - \frac{1}{55}} \approx 34.375 \text{ minutes} \] So, it takes approximately 34.375 minutes to fill the tank when both the inlet pipe and the drain are open. Please note that this is an approximate value and the actual time may vary depending on the specific conditions.

@alinayossimouse 6 месяцев назад

Can you ask Mistral Medium "How many primes end in 2?" I found that many models fail to answer this even when reasoing about it in steps includinng mixtral 8x7b and llama2 70b

@errgo2713 6 месяцев назад

I just asked Mistral-Medium, it answered: Only one prime number ends in 2, and that is the number 2 itself. All other even numbers greater than 2 are not prime because they can be divided by 2. Primes are numbers that have only two distinct positive divisors: 1 and the number itself. In this case, 2 is the only even prime number.

@dr.mikeybee 6 месяцев назад

Here's Microsoft Copilot: an inlet pipe can fill a tank in 25 minutes. the drain can empty the tank in 55 minutes. Once the tank was empty, the drain was left open and the tank was filled by the pipe. How long did it take for the tank to fill? use calculus. Let's denote the rate at which the inlet pipe fills the tank as $F$ and the rate at which the drain empties the tank as $D$. We know that $F = \frac{1}{25}$ tanks per minute and $D = \frac{1}{55}$ tanks per minute. When both the inlet pipe and the drain are open, the net rate of filling the tank is $F - D$. We can set up the following integral to represent the total amount of water in the tank over time: \[ \int_0^t (F - D) \, dt = 1 \] Substituting the values of $F$ and $D$, we get: \[ \int_0^t \left(\frac{1}{25} - \frac{1}{55} ight) \, dt = 1 \] Solving this integral gives us the time it takes to fill the tank: \[ t = \frac{1}{\frac{1}{25} - \frac{1}{55}} \approx 34.375 \text{ minutes} \] So, it takes approximately 34.375 minutes to fill the tank when both the inlet pipe and the drain are open. Please note that this is an approximate value and the actual time may vary depending on the specific conditions.

@Garbhj 6 месяцев назад

It used the right method, but I guess it just can't do numerical calculations.

@noomondai 6 месяцев назад

Do you think you can run the Mixtral-8x7B model with a AMD Ryzen 7040 if you have enough RAM?

@bestemusikken 6 месяцев назад

Seems you'll have to up the game on your tests. They are catching up quite nicely.

@dewijones92 6 месяцев назад

More please

@saudfata6236 6 месяцев назад

I suggest postman for working directly with APIs

@dinonovak743 6 месяцев назад

Hi, I experienced interesting behaviour, when asked mixtral can create tables with content in it. Also when asked to create graphs it is giving out positive answer, but I am not sure what is returned, can you check?

@rawkarpojk 6 месяцев назад

How come none of the models have changed there OpenAI SDK code. I keep getting the wrong code.

@MatrixCodeBreaker88 6 месяцев назад

When are you gonna cover the new update and features of LM Studio?

@someideas5388 6 месяцев назад

I've tried couple of models locally and in colab, but could not get a correct answer for this prompt so far: The electricity price is 28p per kWh. A kettle is rated as 2.7 kWh, which means it uses 2.7 kilo watt electricity in one hour. The kettle can boil 1 litter of water in 60 seconds. What is the cost of boiling 1.7 litre of water? it uses 2700/3600 = 3/4 = 0.75 watt per second. 1 litre => 60 sec, 1.7 => 102 sec 1.7 litre = 102 * 0.75 = 76.5 watt 1000 watt => 28p, 76.5 watt = 76.5 * 28/1000 = 2.142p

@Fordtruck4sale 5 месяцев назад

So a 6 bit EXL2 quant would be how much VRAM?

@LukasLiesis 6 месяцев назад

3:03 Matthew i run 7B on macbook air m1 16gb ram w/o any issue and response speed is comparable or faster then chatgpt on paid acc. not sure what's current m1 16gb air price but probably something definitely

@bobinorlando 6 месяцев назад

I hope you are keeping a separate set of different tests that you haven't published but that you are conducting since it would be rather trivial for any of these AI's to hardcode the correct answers to any of these tests knowing that you are going to use them.

@elck3 6 месяцев назад

@matthew_berman Matthew, please try “What is the largest number written out in word form between 1 and 1 million that does not contain the letter 'n'?” Even GPT-4 doesn’t get this right. Could you please try this?

@pon1 6 месяцев назад

This is mindblowing, that it got the count of words right for its next answer is sign of planning ahead, which traditionally AI models can't do since it is just predicting the next word. So it would have first had to predict word for word the next sentence then counted it knowing that the number in the sentence wouldn't change the count. Incredible. Wonder how the large model would do.

@gileneusz 6 месяцев назад

you need to prepare more difficult set of questions!

@russelllapua4904 6 месяцев назад

Wait wait. Mixtral was only a few days ago right? And this model already beats it?

@adamrak7560 6 месяцев назад

Their API is invitation only. So they announce a new powerful and cheap model, give access to influencers so they can test it and show people that the model is actually quite good. But it turns out that there is no API or model weight access at all, so people who see the video cannot actually use it. Also there is no ETA for the access.

@kanishak13 6 месяцев назад

why no model is comparing its results with bard or gemini?

@victorwilson1337 6 месяцев назад

is it fair to compare the costs of the models without including GPT-4-1106-Preview?

@Kingzt369 6 месяцев назад

what is the point in paying for tokens when gpt 4 is free in microsoft copilot?

@MacS7n 6 месяцев назад

The team behind Mistral already watched your videos and test questions so I’m wondering if they haven’t fine tuned Mistral medium it with your test questions. You should have new questions never seen before but same type of logic and reasoning.

@PerfectArmonic 6 месяцев назад

The Einstein test of intelligence is the best test for LLMs

@B4zing4 6 месяцев назад

will this run on my 7900xtx?

@DeepThinker193 6 месяцев назад

Ugh this test is basically pointless unless you ask it questions outside of your usual question set, like the SAT question it got wrong. The developers can easily make it answer your specific AI questions and thus why they would give you an inside scoop.

@matten_zero 6 месяцев назад

Ah you didn't use the SDK?

@big_sock_bully3461 6 месяцев назад

Can you try Bionic GPT or Anything LLM?

@phobes 6 месяцев назад

This whole time we've been watching AI reproduction of Matthew.

@tomschuelke7955 6 месяцев назад

Wouldnt it be nessesair to repeat the answers several time to get a statistically correct representation for the quality of the answers?

@Yonatan_1770 6 месяцев назад

Yes do more complex word problems

@jgcornell 6 месяцев назад

Does this mean there’s a Mistral Large out soon to blow GPT4 out of the water?!

@LiFancier 6 месяцев назад

The last few times you used the ball/marble and cup problem, you've called it a marble in the first part of the problem, and a ball in the last part. It might be confusing the model to call it both a marble and a ball.

@burnt1ce85 6 месяцев назад

How come you didnt compare GPT-4 Turbo pricing with mistral medium? Mistral medium is just a bit cheaper than GPT-4 turbo

@tomski2671 6 месяцев назад

I'm seeing 3.64 times cheaper on output - that is not insignificant.

@robxsiq7744 6 месяцев назад

Turbo? you mean GPT3.5 turbo?

@redone823 6 месяцев назад

What's their pricing? I'm on a wait-list and can't see anything.

@lio1234234 6 месяцев назад

@@robxsiq7744 No, GPT-4 Turbo, it's on the API

@burnt1ce85 6 месяцев назад

Oops... After reading the pricing on my desktop (i initially read it on my phone), I realize that I misread the pricing... ignore my dumb comment

@sultonbekrakhimov6623 5 месяцев назад

How to get access to mistral medium ?

@zeburgerkang 6 месяцев назад

So about about an AI calculator for your hardware like the mining calculators you see... even though you could just say wait for the A100 to come down in price.

@dennisestenson7820 6 месяцев назад

Um, actually there were 11 words in its response. There were 9 in the original sentence, but then it appended a confidence level, making it technically wrong.

@BarakaAndrew 6 месяцев назад

Time to find other questions for testing, u can easily change these questions to have same meaning but phrased different. Highly likely every new model is contaminated with these questions already.