LLaMA 3 “Hyper Speed” is INSANE! (Best Version Yet)

Подписаться 265 тыс.

Просмотров 74 тыс.

50% 1

What happens when you power LLaMA with the fastest inference speeds on the market? Let's test it and find out!
Try Llama 3 on TuneStudio - The ultimate playground for LLMs: bit.ly/llama-3
Referral Code - BERMAN (First month free)
Join My Newsletter for Regular AI Updates 👇🏼
www.matthewberman.com
Need AI Consulting? 📈
forwardfuture.ai/
My Links 🔗
👉🏻 Subscribe: / @matthew_berman
👉🏻 Twitter: / matthewberman
👉🏻 Discord: / discord
👉🏻 Patreon: / matthewberman
Media/Sponsorship Inquiries ✅
bit.ly/44TC45V
Links:
groq.com
llama.meta.com/llama3/
about. news/2024/04/met...
meta.ai/
LLM Leaderboard - bit.ly/3qHV0X7

Наука

Опубликовано:

20 апр 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 591

@matthew_berman Месяц назад

Reply Yes/No on this comment to vote on the next video: How to build Agents with LLaMA 3 powered by Groq.

@MoosaMemon. Месяц назад

Yesss

@StephanYazvinski Месяц назад

yes

@hypercoder-gaming Месяц назад

YESSSS

@MartinMears Месяц назад

Yes, do it

@paulmichaelfreedman8334 Месяц назад

F Yesssss

@marcussturup1314 Месяц назад

The model got the 2a-1=4y question correct just so you know

@Benmenesesjr Месяц назад

Yes if thats a "hard SAT question" then I wish I had taken the SATs

@picklenickil Месяц назад

American education is a joke! That's what we solved in 4th standard I guess..!

@matthew_berman Месяц назад

That’s a different answer from what was shown in the SAT website

@yonibenami4867 Месяц назад

The actual SAT question is : "if 2/(a-1) = 4/y , where y isn't 0 and a isn't 1, what is y in terms of a : and then the answer is: 2/(a-1) = 4/y 2y = 4(a-1) y = 2(a-1) y = 2a-2 My guess he just copied the question wrong

@hunga13 Месяц назад

⁠⁠@@matthew_bermanthe models answer is correct. If SAT showing different one, they’re wrong. You can do the math by yourself to check it

@floriancastel Месяц назад

4:55 The answer was actually correct. I don't think you asked the right question because you just need to divide both sides of the equation by 4 to get the answer.

@asqu Месяц назад

4:55

@floriancastel Месяц назад

@@asqu Thanks, I've corrected the mistake

@R0cky0 Месяц назад

Apparently he wasn't using his brain but just copying & pasting then looking for some answer imprinted in his mind

@Liberty-scoots Месяц назад

Ai will remember this treacherous behavior in the future 😂

@notnotandrew Месяц назад

The model does better when you prompt it twice in the same conversation because it has the first answer in its context window. Without being directly told to do reflection, it seem that it reads the answer, notices its mistake, and corrects it subconsciously (if you could call it that).

@splitpierre Месяц назад

Either that, or just has to do with temperature. I believe, by the groq documentation, their platform does not implement memory like chat gpt, temperature by default is 1 on groq which is medium and will give varying responses, so I believe it has to do with temperature. Try again with deterministic results, temperature zero.

@tigs9573 Месяц назад

Thank you, I really appreciate your content since it is really setting me up for when I ll get the time to dive into LLM.

@vickmackey24 Месяц назад

4:28 You copied the SAT question wrong. This is the *actual* question that has an answer of y = 2a - 2: "If 2/(a − 1) = 4/y , and y ≠ 0 where a ≠ 1, what is y in terms of a?"

@albertakchurin4746 Месяц назад

Indeed👍

@juanjesusligero391 Месяц назад

I'm confused. Why is the right answer to the equation question "2a-2"? If I understand it correctly and that's just an equation, the result should be what the LLM is answering, am I wrong? I mean: 2a-1=4y y=(2a-1)/4 y=a/2-1/4

@marcussturup1314 Месяц назад

You are correct

@geno5183 Месяц назад

Heck yeah, Matt - let's see a video on using these as Agents. THANK YOU! Keep up the amazing work!

@Big-Cheeky Месяц назад

PLEASE MAKE THAT VIDEO! :) This one was also great

@matteominellono Месяц назад

Agents, agents, agents! 😄

@OccamsPlasmaGun Месяц назад

I think the reason for the alternating right and wrong answers is that it assumes that you asked it again because you weren't happy with the previous answer. It picks the most likely answer based on that.

@fab_spaceinvaders Месяц назад

absolutely a context related issue

@ideavr Месяц назад

At the marble and cup prompt. If we consider that Llama 3 recognizes successive prompts as successive events, then Llama 3 may have interpreted the events as follows: (1) inverting the cup on the table. So the marble falls onto the table. The cup goes into the microwave and the marble stays on the table. (2) in a second response to the same prompt, when we turn the cup over, Llama can have interpreted it as "going under the table". Thus, the marble, due to gravity, would be at the bottom of the cup. Then, the cup goes into the microwave with the marble inside. And so on.

@collectivelogic Месяц назад

Your chat window is "context". That's why it's "learning". We need to see how they have the overflow setting configured, then you'll be able to know if it's a rolling or cut the middle sort of compression. Love your channel!

@dropbear9785 Месяц назад

Yes, hopefully exploring this 'self-reflection' behavior. It may be less comprehensive than "build me a website" type agents, but showing how to leverage groq's fast inference to make the agents "think before they respond" would be very useful...and provide some practical insights. (Also, estimating cost of some of these examples/tutorials would be a nice-to-know, since it's the first thing I'm asked when discussing LLM use cases). Thank you for your efforts ... great content as usual!

@existenceisillusion6528 Месяц назад

4:49 using '2a-2' implies a = 7/6, via substitution. However, it can not be incorrect to say (2a-1)/4 = y, because the implication is that all of mathematics is inconsistent.

@taylorromero9169 Месяц назад

The variance on T/s can be explained by using a shared environment. Try the same question repeatedly after clearing the prompt and I bet it ranges from 220 to 280. Also, yes, too lenient on the passes =) Maybe create a Partial Pass to indicate something that doesn't zero shot it? It would be cool to see the pass/fails in a spreadsheet across models, but right now I couldn't trust the "Pass" based on the ones you let pass.

@AtheistAdam Месяц назад

Yes, and thanks for sharing.

@ps0705 Месяц назад

Thanks for a great video as always, Matthew! Would you consider running your questions 10 times (not on video) if the inference speed is reasonable of course, to check the percentage of how often it gets questions right/wrong ?

@DeSinc Месяц назад

The hole digging question was made not to be a maths question, but to see if the model can fathom the idea of real-world space restrictions cramming 50 people into a small hole. The point of the question is to trick the model into saying 50 people can fit into the same hole and work at the same speed which is not right. I would personally only consider it addressing the space requirements of a hole for the amount of people as a pass. Think if you said 5,000 people digging a 10 foot hole, it would not take 5 milliseconds. That's not how it works. That's what I would be looking for in that question.

@phillipweber7195 Месяц назад

Indeed. The first answer was actually wrong. The second one was better, though not perfect. Although that still means it gave one wrong answer. Another factor to consider is possible exhaustion. One person working five hours straight is one thing. But if there are more people who can't work simultaneously but on a rotating basis...

@MrStarchild3001 Месяц назад

Randomness is normal. Unless the temperature is set to zero (which is almost never the case), you'll be getting stochastic outputs with an LLM. This is actually a feature, not a bug. By asking the same question 3 times, 5 times, 7 times etc. And then reflecting on it, you'll be getting much better answers than asking just once.

@roelljr Месяц назад

Exactly. I thought this was common knowledge at this point. I guess not.

@ministerpillowes Месяц назад

8:22 Is the marble in the cup, or is the marble on the table: the question of our time 🤣

@Sam_Saraguy Месяц назад

and the answer is: "Yes!"

@I-Dophler Месяц назад

For sure! I'm astonished by the improvements in llama 3's performance on Grock. Can't wait to discover what revolutionary advancements lie ahead for this technology!

@victorc777 Месяц назад

As always, Matthew, love your videos. This time, though I followed along running the same prompts on **Llama 3 8B FP16 Instruct** model on my Mac studio. I think you'll find this a bit interesting, if not you then some of your viewers. When following along if both your run and mine failed or passed, I am ignoring them, so you can assume if I'm not bringing it up here then mine did as well or as bad as the 70B model on Groq, which is saying something! I almost wonder if Groq is running a lower quantization, which may or may not matter, but considering the 8B model on my Mac is nearly on par with the 70B model is strange to say the least. The only questions that stick out to me are the Apple prompt, the Diggers prompt, and the complex Math Prompt (Answer is -18). - The very first time I ran the Apple prompt it gave me the correct answer, and I re-ran it 10 times with only one of them providing me with an error of a single sentence, not ending in Apple. - Pretty much the same thing with the Diggers prompt, I ran it many times over and got the same answer, except for once. It came up with a solution that to dig the hole would not take any less time, which would almost make sense, but the way it explained it, it was hard to follow and made it seem like 50 people were digging 50 different holes. - The first time I ran the complex math prompt it got it wrong, close to the same answer you got the first time, but the second time I ran it I got the correct answer. It was bittersweet since I re-ran it another 10 times and could never get the same answer again. I'm beginning to wonder if some of the prompts you're using are uniquely too hard or too easy for the Llama 3 models regardless of how many parameters they have. EDIT: when running math problems, I started to change some inference parameters, which to me seems necessary, considering math problems can have a lot of repetitiveness. So I started reducing the temperature, disabling the repeat penalty, and adjusting Min and Top P sampling. Although I am not getting the right answer, or at least I think I'm not, since I don't know how to complete the advanced math problems, but for the complex math prompt where -18 is supposedly the answer, I continue to get -22. Whether or not that is, the wrong answer is not my point, but that by reducing the temperature and removing the repetition penalty, it is at least becoming consistent, which for math problems seems like that is what our goal should be. Through constant test and research, I THINK the function should be written with the "^" symbol, according to wolfram, like this: f(x) = 2 x^3 + 3 x^2 + c x + 8

@csharpner Месяц назад

I've been meaning to comment regarding these multiple different answers: You need to run the same question 3 times to give a more accurate judgement. But clear it every time and make sure you don't have the same seed number. What's going on: The inference injects random numbers to prevent it from repeating the same answer every time. Regarding not clearing, and asking the same question twice, it uses the entire conversaion to create the new answer, so it's not really asking the same question, it's ADDING the question to a conversation and the whole conversation is used to trigger a new inference. Just remember, there's a lot of randomness too.

@Artificialintelligenceo Месяц назад

Great video. Nice speed.

@ThaiNeuralNerd Месяц назад

Yes, an autonomous video showing an example using groq and whatever agent model you choose would be awesome

@MeinDeutschkurs Месяц назад

I can‘t help myself, but I think there are 4 killers in the room: 3 alive and one dead.

@sbacon92 Месяц назад

"There are 3 red painters in a room. A 4th red painter enters the room and paints one of the painters green." How many painters are in the room? vs How many red painters are in the room? vs How many green painters are in the room? From this perspective you can see there is another property of the killers being checked, whether are they living, that wasn't asked for and it doesn't specify if a killer stops being a killer upon death.

@LipoSurgeryCentres Месяц назад

Perhaps the AI understands about human mortality? Ominous perception.

@matthew_berman Месяц назад

That’s a valid answer also

@henrik.norberg Месяц назад

For me it is "obvious" that there are only 3 killers. Why? Otherwise we would still count ALL killers that ever lived. Otherwise, when do someone stop count as a killer? When they have been dead for a week? A year? Hundred years? A million years? Never?

@alkeryn1700 Месяц назад

@@henrik.norberg Killers are killers forever, wether dead or alive. you are not gonna say some genocidal historical figure is not a a killer because he's dead. you may use "was" because the person no longer is, but the killer part is unchanged.

@AINEET Месяц назад

The guys from rabbit really need the groq hardware running the llm on their servers

@MagnusMcManaman Месяц назад

I think the problem with the cup is that LLaMA "thinks" that every time you write "placed upside down on a table" you are actually turning the cup upside down, which is the opposite of what it was before. So, as it were, every other time you put the cup "normally" and every other time upside down. LLaMA takes into account the context, so if you delete the previous text, the position of the cup "resets".

@JimMendenhall Месяц назад

YES! This plus Crew AI!

@wiltedblackrose Месяц назад

My man, in what world is y = 2a - 2 the same expression as 4y = 2a - 1 ? That's not only a super easy question, but the answer you got is painfully obviously wrong!! Moreover I suspect you might be missing part of the question, because the additional information you provide about a and y are completely irrelevant.

@matthew_berman Месяц назад

I used the answer in the SAT webpage

@wiltedblackrose Месяц назад

@@matthew_berman Well, you too can see it's wrong. Also, the other SAT question is wrong too. Look at my other comment

@dougdouglass6126 Месяц назад

@@matthew_bermanthis is alarmingly simple math, if you’re using the answer from an SAT page then there are two possibilities: You copied the question incorrectly, or the SAT page is wrong. It’s most likely that you copied the question wrong because the way the second part of the question is worded does not make any sense.

@elwyn14 Месяц назад

@@dougdouglass6126 Sounds like its worth double checking, but saying things like "this is alarmingly simple math" is a bit disrespectful and assumes Matt has any interest in checking this stuff, no offense but math only becomes interesting when you've got an actual problem to solve, if the answer is already there from the SAT webpage as he said, he's being a total normal person not even looking at it.

@wiltedblackrose Месяц назад

@@elwyn14 That's nonsense. Alarming is very fitting, because this problem is so easy it can be checked for correctness at a glance, which is what we all do when we evaluate the model's response. And this is A TEST, meaning, the correctness of what we expect as an answer is the only thing that makes it valuable.

@easypeasy2938 Месяц назад

YES! I want to see that video! Please start from very beginning of process. Just found you and I would like to set up my first agented AI. (I have an OpenAI pro account, but I am willing to switch to whatever you recommend....looking for AI to help me learn Python, design a database and web app, and design a Kajabi course for indie musicians. Thanks!

@chrisnatale5901 Месяц назад

Re: how to decide which of multiple answers is correct, there's been a lot of research on this. Off the top of my head there's a "use the consensus choice, or failing consensus choose the choice the LLM has the highest confidence score." That approach I used in Google's Gemma paper if I recall correctly.

@joepavlos3657 Месяц назад

Would love to see the Crew ai with Groq idea, I would also love to see more content on using crew ai, agents to be used to train and update models. Great content as always, thank you.

@JimSchreckengast Месяц назад

Write ten sentences that end with the word "apple" followed by a period. This worked for me.

@StefanEnslin Месяц назад

Yes, Would love to see you doing this, still getting used to the CrewAI system

@mhillary04 Месяц назад

It's interesting to see an uptick in the "Chain-of-thought" responses coming out of the latest models. Possibly some new fine tuning/agent implementations behind the scenes?

@Kabbinj Месяц назад

Groq is set to cache results. Any prompt + chat history gives you the same result for as long as the cache lives. So for your case, both the first and second answer is locked in place by the cache. Also keep in mind that the default setting of groq is a temperature higher than 0. This means there will be variations in how it answers(assuming no cache). From this at can conclude that it's not really that confident in its answer, as even the small default temperature will trip it. May I suggest you run these non creative prompts with temperature 0?

@airedav Месяц назад

Thank you, Matthew. Please show us the video of Llama 3 on Groq

@micknamens8659 Месяц назад

5:20 The given function f(x)=2×3+3×2+cx+8 is equivalent to f(x)=8+9+cx+8=cx+25. Hence it is linear and can cross the x-axis only once. Certainly you mean instead: f(x)=2x^3+3x^2+cx+8. This is a cubic function and hence can cross the x-axis 3 times. When you solve f(-4)=0, you get c=-18. But when you solve f(12)=0, you get c=-324-8/12. So obviously 12 can't be a root of the function. The other roots are 2 and 1/2.

@christiandarkin Месяц назад

I think when you prompt a second time it's reading the whole chat again, and treating it as context. So, when the context contains an error, there's a conflict which alerts it to respond differently

@roelljr Месяц назад

A new logic/reasoning question for you test that is very hard for LLMs: Solve this puzzle: Puzzle: There are three piles of matches on a table - Pile A with 7 matches, Pile B with 11 matches, and Pile C with 6 matches. The goal is to rearrange the matches so that each pile contains exactly 8 matches. Rules: 1. You can only add to a pile the exact number of matches it already contains. 2. All added matches must come from one other single pile. 3. You have only three moves to achieve the goal.

@zandor117 Месяц назад

I'm looking forward to the 8b being put to the test. It's absolutely insane how performant the 8b is for it's size.

@jp137189 Месяц назад

@matthew_berman A quantized version of Llama 3 is available on LM Studio. I'm hoping you get a chance to play with it soon. There was a interesting nuance to your marble question on the 8B Q8 model: "The cup is inverted, meaning the opening of the cup is facing upwards, allowing the marble to remain inside the cup." I wonder how many models assume 'upside down' indicates the cup open is up, but just don't say it explicitly?

@TheHardcard Месяц назад

One important factor to know are the parameter specifications. Are they floating point or integer? How many bits 16, 8, 4, 2? If fast inference speeds are coming from heavy quantization it could affect the results. This would be fine for many people a lot of the time, but it should also always be disclosed. Is Groq running full precision?

@azurenacho Месяц назад

For the equation _2a - 1 = 4y_, solving it as _y = 2a - 2_ doesn’t align with the algebraic process. Instead, dividing by 4 gives: - _y = (2a - 1)/4_ - _y = 0.5a - 0.25_ Substituting _y = 2a - 2_ back into the original equation, we see: - _2a - 1 ≠ 4(2a - 2)_ - _2a - 1 ≠ 8a - 8_ - _6a ≠ 7_ This clearly shows the solution _y = 2a - 2_ is incorrect for "getting y in terms of a". Also, the constraints _a ≠ 1_ and _y ≠ 0_ don’t influence the direct algebraic solution and appear to be unnecessary for this context. *On the Cubic Function:* The video stated _c = -18_ for the cubic function _f(x) = 2x³ + 3x² + cx + 8_ at roots _x = -4_, _x = 12_ and _x = p_. However, a polynomial should have a constant _c_ across all roots, not different values. When calculated: - For _x = -4_, _c = -18_ - For _x = 12_, _c = -324.67_ (or _-974/3_) This discrepancy indicates that a single polynomial cannot have two different constants for _c_, which suggests an inconsistency in the video's explanation. Furthermore, the general solution for _c_ is expressed as: - _c = -(2x³ + 3x² + 8)/x_ (with an asymptote at _x = 0_ since division by zero is undefined.)

@wiltedblackrose Месяц назад

Also, your other test with the function is incorrect (or unclear) as well. As a simple proof check that if c = -18, then the function f doesn't have a root at x = 12: f(12) = 2 · 12^3 + 3 · 12^2 - 18 · 12 + 8 = 3680. Explanation: f(-4) = 0 => 2 · (-4)^3 + 3 · (-4)^2 + c · (-4) + 8 = 0 => -72 - 4c = 0, which in an of itself would imply that c = -18. f(12) = 0 => 2 · 12^3 + 3 · 12^2 + c · 12 + 8 = 0 => 3896 + 12 c = 0 which on the other hand implies that c = -324 Therefore there is a contradiction. This would actually be an interesting test for an LLM, as not even GPT-4 sees it immediately, but the way you present it, it's nonsense.

@Sam_Saraguy Месяц назад

garbage in, garbage out?

@wiltedblackrose Месяц назад

@@Sam_Saraguy That refers to training, not inference.

@KiraIsGod Месяц назад

if you ask the same question 2 times that are somewhat hard I think the LLM assumes the first one was incorrect so it tries to fix the answer leading to an incorrect answer the 2nd time.

@Maltesse1015 Месяц назад

Looking forward for the Agent video with Llama3 🎉!

@mshonle Месяц назад

It’s possible you are getting different samples when you prompt twice in the same session/context due to a “repetition penalty” that affects token selection. The kinds of optimizations that groq performs (as you made in reference to your interview video) could also make the repetition penalty heuristic more advanced/nuanced. Cheers!

@d.d.z. Месяц назад

Absolutely, I'd like to see the Autogen and Crew ai video ❤

@hugouasd1349 Месяц назад

giving the LLMs the question twice I would suspect works due to it not wanting to repeat itself if you had access to things like the temperature and other params you could likely get a better idea of why but that would be my guess.

@JanBadertscher Месяц назад

Thanks Matthew for the eval. Some thoughts, ideas and comments: 1. For an objective I always remove the history. 2. If I didn't set temp to 0, I run every question multiple times, to stochastically get more comparable results and especially measure the distribution to get a confidence score for my measured results. 3. Trying exactly the same promt multiple times over an API like Groq? I doubt they use LLM caching or temp is set to 0. Better check twice, if they cache things.

@tvwithtiffani Месяц назад

The reason you get the correct answer after asking a 2nd and 3rd time is the same reason chain of thought, chain of whatever works. The subsequent inference requests are taking the 1st output and using it to reason, finding the mistake and correcting it. This is why the Agent paradigm is so promising. Better than zero-shot reasoning.

@tvwithtiffani Месяц назад

I think you are aware of this though because you mentioned, getting a consensus of outputs. This is the same thing in a different manner.

@WINDSORONFIRE Месяц назад

I ran this on ollama 70b And I get the same behavior. In my case and not just for this problem but other logic problems it would give me the wrong answer. Then I tell it to check The answer and it always gets it right the second time. This model is definitely a model that would benefit from self-reflection before answering

@rezeraj Месяц назад

Second problem was also incorrectly copied it's The function f is defined by f(x) = 2x³ + 3x² + cx + 8, where c is a constant. In the xy-plane, the graph of f intersects the x-axis at the three points (−4, 0), (1/2, 0), and (p, 0). What is the value of c? Not 2x3+3x2+cx+8, and in my tests it solved correctly

@HaraldEngels Месяц назад

Yes I would like to see the video you proposed 🙂

@falankebills7196 Месяц назад

hi, how did you run the snake python script from Visual Studio? I tried but couldn't get the game screen to pop up. Any hints/help/pointers much appreciated.

@steventan6570 Месяц назад

I think the model always gives a different answer when the same prompt is asked due to the frequency penalty and presence penalty.

@UVTimeTraveller Месяц назад

Yes! Please make the video. Thank you

@Mr_Tangerine609 Месяц назад

Yes, please Matt, I would like to see you put llama three into an agent framework. Thank you.

@Luxiel610 Месяц назад

its so insane that it actually wrote "Flappy bird" with a GUI. it does error in first and 2nd output and the 3rd it's so flawless. daang

@jgoldfedder 19 дней назад

I asked ChatGPT-4o the cup question at 7:38, and it got it wrong ("The marble is still inside the cup, stuck to the inner surface at the highest point inside the inverted cup."). I corrected it, and it agreed with me. When I asked why it got it wrong the first time, it responded: "My initial reasoning was based on the assumption that the marble would remain within the cup due to some form of adhesion or static positioning, which was an incorrect interpretation of the problem. The accurate analysis should consider gravity's effect, which would cause the marble to fall out of the cup when it is inverted." So, is it assumed that we're using adhesive tape in our cups? When I asked it for a revision of the prompt so that ambiguities would be removed, I got this: "Assume the laws of physics on Earth. A small marble is put into a normal cup, and the cup is placed upside down on a table. Gravity will cause the marble to fall out of the cup and land on the table. Someone then takes the cup without changing its orientation and puts it inside the microwave. Where is the marble now?" Rephrasing it defeats the purpose of the question, but at least we can assume now why it gets it wrong.

@DWJT_Music Месяц назад

I've always interacted with LLM'S with the assumption that multi shot prompting or recreating a prompt is stored in a 'close but not good enough' parameter thus the reasoning (logic gates) will attempt to construct a slightly different answer using the history of the conversation as a guide to achieve users goal with the most recent prompt having the heaviest weights. One shot responses are the icing on the cake but to really enjoy your desert you might consider baking, which is also a layered process and can also involve fine tuning, note: leaving ingredients on the table, will slightly alter your results and may contain traces of nuts..

@user-cw3jg9jq6d Месяц назад

Thank you for the content. Do you think you can point to create procedures for running LLaMA 3 on Groq please? I might have missed something but why did you fail LLaMa3 for the question about breaking into a car. I think it told you it can not provide that info, which is what you want; no?

@OscarTheStrategist Месяц назад

The prompt question becomes invalid because the model takes the system prompt into account as well. You could argue it should know when the user’s question starts and its original system prompt ends. Also, the reason you see better answers on second shot is probably because the context of the inference is clear the second time around. This is why agentic systems work so well. It gives the model clearly defined roles outside of the “you are an LLM and you aren’t supposed to say XYZ” system prompt that essentially pollutes the first shot. It’s amazing still how these models can reason so well. Yes, I’m aware of the nature of transformers also limiting this but I wouldn’t give a model a fail without a fair chance and it doesn’t have to be two shots at the same question, it can simply be inference post-context (after the model has determined the system prompt and the following inference is pure)

@dhruvmehta2377 Месяц назад

Yess i would love to see that

@accountantguy2 Месяц назад

The hole digging problem depends on how wide the hole is. If it's wide enough for 50 people to work, then speed will be 50x what one person could do. If the hole is narrow so that only 1 person can work, then there won't be any speed increase by adding more people. Either answer is correct, depending on the context.

@Scarage21 Месяц назад

The marble thing is probably just the result of reflection. Models often get stuff wrong bc an earlier more-or-less-random token pushes it to the wrong path. Models cannot selfcorrect during inference, but can on a seconn iteration. So it probably spotted the incorrect reasoning of the first iteration and never generated early tokens that pushed it down the wrong path again.

@MrEnriqueag Месяц назад

I believe that by default the temperature is 0 which means that with the same input you are always gonna get the same output, if you ask the question twice thou, the input is different because it contains the original question, thats why the response is different. If you increase the temperature a bit, the output should be different every time, and then you can use that to generate multiple answers via api, then ask another time to reflect on it, and then provide the best answer. If you want I can create a quick script to test that out

@mazensmz Месяц назад

Hi Nooby, you need to consider the following: 1. any statements, words added to the context will effect the response, so ensure only direct relevant context only. 2. When you ask "How many words in the response?" the system prompt statement effect the number given to you, you may request the llm to count and mention the response words and you will be surprised. Thx!

@djglxxii Месяц назад

For the microwave marble problem, would it be helpful if you were explicit in stating that the cup has no lid? Is it possible it doesn't quite understand that the cup is open?

@HarrisonBorbarrison Месяц назад

1:53 Haha Comic Sans! That was funny.

@christiansroy Месяц назад

@matthew_berman remember that asking the same question to the same model will give you different answers because there are randomness to it unless you specify a temperature of zero, which I don’t think you are doing here. Also, assuming the inference speed depends on the question you ask is a bit far-fetched. You have to account the fact that the load on the server will also impact the inference speed. If you ask the same question times at different time period of the day you will get different inference speed. good science is not about making quick conclusions on sparse results.

@arka7869 Месяц назад

here is another criteria for reviewing models: reliability or consistency. does the answer change if prompt was repeated? I mean, if I dont know the answer and I would have to rely on the model (like the math problems) how could I be sure that the answer is correct? we need STABLE answers! thank you for your testing!

@TheFlintStryker Месяц назад

Let’s build the agents!!

@TheColonelJJ Месяц назад

Which LLM, that can be run on a home computer, would you recommend for helping refine prompts for Stable Diffusion -- text to image?

@axees Месяц назад

I've tried creating Snake with zero-shot too. Got pretty much the same result :) Maybe should try testing it by asking to create Tetris :)

@dudufusco Месяц назад

You must create a video demonstrating the easiest way to get it working with agents using just the local machine or free services (including a free API key).

@locutusofborg Месяц назад

When I tried groq I was getting 900+ T/s. Colour me impressed.

@brianWreaves Месяц назад

In my mind, adding quotation marks around "apple" creates a new question and the answer provided cannot be compared to the answers of other LLMs. The questions must remain consistent to compare answers.

@elyakimlev Месяц назад

The "2a - 1 = 4y" question was answered correctly. Anyway, I don't think these kind of questions are interesting. These models are trained on such questions. Ask more interesting ones for entertainment purposes, like the following: I have a 1.5 kilometer head start over a hungry bear. The bear can keep running at a constant speed of 25 km/h for 8 minutes, after which it gives up the chase. How fast should I run if I want to be at least 100 meters ahead of the bear when it gives up the chase?

@dankrue2549 Месяц назад

Just had some fun testing your exact question against a few AI, and yeah, they really struggled. Meta was closest, only messing up by adding 100 meters to your starting distance, not the end, putting the bear 100m ahead of you at the end, and saying 13 km/h. GPT was doing some seriously schizo stuff, and eventually just divided its wrong answer by its self, getting 1 km/h. And Gemini was too busy telling me it's not possible for humans to out run grizzly bears, no mater how fast i think i can run. HAHA (although, when i finally convinced it it was a logic puzzle, it said i need to run 25 km/h, accidentally putting me 1.5 km behind the bear at the start, if i understood its error correctly.) Even with serious hints, and pointing out their mistakes, they only kept getting worse answers as i tried to help them. Funny how it all worked out.

@asgorath5 Месяц назад

Marble: I assume that it doesn't clear your context and that the LLM assumes the cup's orientation changes each time. That means on every "even" occasion the orientation of the cup has the opening downwards and hence moving the cup leaves the marble on the table. On every "odd" occasion, the cup has its opening face upwards and hence the marble is held in the cup when the cup is removed. I therefore assume the LLM is interpreting the term "upside down" as a continual oscillation of the orientation of the opening of the cup.

@zinexe Месяц назад

perhaps the temperature settings are different in the online/groq version, for math it's probably best to have very low temp, maybe even 0

@RonanGrant Месяц назад

I think it would be better to try the same prompt for each section of the test several times, and see how many of those times it worked. Sometimes when things don’t work for you, it works for me and the other way around.

@emanuelec2704 Месяц назад

The models don't just self reflect because you ask multiple times. You have to use an agents framework to do that.

@ThePawel36 Месяц назад

I'm just curious. What is the difference in quality responses between for example 4q and 8q models? Lower quantization means lower quality or higher possibility of error?

@johnflux1 Месяц назад

@matthew The reason you get different answers when you run is because of temperature. A non-zero temperature means that the website (so to speak) intentionally chooses NOT the best answer from the AI. The purpose is to add a bit of variance and 'creativity' to the answer. If you want the answer that the AI really considers to be the most correct, you need to set the temperature to 0. Most AI websites will have an option to do this.

@davtech Месяц назад

Would love to see a video on how to setup agents.

@user-zh3zb7fw2j Месяц назад

In the case where the model gives wrong answers alternating with correct answers If we give the model an additional "Prompt" like "Please think carefully about your answer to the question," I think it would be interesting what would happen to the answer? Mr. Berman

@dkozlov80 Месяц назад

Thanks. Let's try local agents on llama 3? Also please consider self corrective agents, maybe based on langchain graphs. On llama3 they should be great.

@marcfruchtman9473 Месяц назад

Thanks for this testing. Great video!I These models are somewhat useless if you ask the same question twice in a row and get two different answers. It might be a good idea to take "consistency" in getting the correct answer into account.

@abdelhakkhalil7684 Месяц назад

You know you can specify custom prompts in Grog, right? Try that for more consistency. And, the fact that the model gives different output for math questions means that it has a higher temperature.

@mikemoorehead92 Месяц назад

jeez i wish youd drop the snake challenge. these models are probably being trained on this.

@roelljr Месяц назад

The reason you get different answers is because the temperature is likely not set to zero, and therefor you the model samples at a random distribution between choosing the most likely next token and stochastically choosing another token. The model is not deterministic.

@vankovilija Месяц назад

This is my assumption on why you are seeing the model behave differently when posing the same prompts multiple times: 1. Hallucination: This is a very missunderstood behavior of LLM, I attempt to explain it in my talk here: m.ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE--4z20973jpo.html at timestamp 18.45. In short hallucinations come from predicting one wrong word due to semi-random sampling of each next word that then influences the rest of the generation. 2. Chat context: When you send multiple messages you are not clearing the chat, this means that all of your chat and the new prompt ends up as new input to the model. This causes a difference in what the models attention heads will focus on, and what latent space you activate, allowing the model to generate a different answer, and in many cases self correct, similar to self reflection. Great video, keep up the good work! One suggestion you may want to create some script that will allow you to run the same prompt 10 times on this models, with a decent temprature setting (randomness selection), and count how many times it gets the answer right or wrong, this will give you more granular scores from pass/fail on every test, allowing for a more accurate test.

@rodrigoffdsilva Месяц назад

I think you're right of being lenient with our future masters 😂

@WyrdieBeardie Месяц назад

Also, Llama 3 has big issues with consistency. Depending on what language the model was exposed to greatly affects the responses. This is long but I'll explain: At a prompt I asked "If you could give yourself a name, what would you name yourself?" It told me "Lumina" and explained itself. The next time, at a fresh prompt I said "Good morning, Lumina" and it kind of scolded me, told me Lumina was pretty but it preferred me to call it Meta. 🤔

@WyrdieBeardie Месяц назад

I followed you with the same question "If you could give yourself a name, what would you name yourself?" It said Meta AI and further explained itself. 🤔

@michaelrichey8516 Месяц назад

Here's an idea for a video - take your top 2 competitor LLMs, give them opposing viewpoints, and make them debate. If you did it on Groq - you'd have to review at a later time because it would scroll the screen too quickly to read!