Did OpenAI Just Secretly Release GPT-5?! ("GPT2-Chatbot")

Подписаться 259 тыс.

Просмотров 85 тыс.

50% 1

GPT2-Chatbot just showed up on lmsys.org. We know little about it other than it performs incredibly well and is unlike anything we've seen in other models.
Try Vultr FREE with $300 in credit for your first 30 days when you use BERMAN300 or follow this link: getvultr.com/berman
Join My Newsletter for Regular AI Updates 👇🏼
www.matthewberman.com
Need AI Consulting? 📈
forwardfuture.ai/
My Links 🔗
👉🏻 Subscribe: / @matthew_berman
👉🏻 Twitter: / matthewberman
👉🏻 Discord: / discord
👉🏻 Patreon: / matthewberman
👉🏻 Instagram: / matthewberman_ai
👉🏻 Threads: www.threads.net/@matthewberma...
Media/Sponsorship Inquiries ✅
bit.ly/44TC45V

Наука

Опубликовано:

29 апр 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 739

@matthew_berman 20 дней назад

Is this GPT4.5 or GPT5 or something different?

@shopbc5553 20 дней назад

It's something different. OpenAI just wants to stay publicly relevant so it's more of a stunt than anything. What I think it is, is an old model so maybe literally GPT 2, but with enhancements that can make GPT 2 perform equivalent to GPT 4

@radestein8548 20 дней назад

Gpt5

@phen-themoogle7651 20 дней назад

@@shopbc5553 I thought this too, it makes the most sense.

@Avman20 20 дней назад

My money is on OpenAI but as far as whether it's in the GPT series or they're giving us a peek at a new architecture is the mystery.

@MyWatermelonz 20 дней назад

@@shopbc5553 If that's the case it's more impressive than gpt4.5 they took a 1.8b model and made it legit better than gpt4. Given the inference speed though, probably not.

@rawallon 20 дней назад

Dude I swear, at this rate, by the end of the year you'll be able to write your own snake game

@matthew_berman 20 дней назад

I'll NEVER write my own snake game.

@Inventai 20 дней назад

@@matthew_berman

@MrChinkman37 20 дней назад

😂

@matikaevur6299 20 дней назад

@@matthew_berman Yeah, due to strange quantum effect snake game writes you in the past .. Probably gives it pass, too ;)

@fxsurgeon1 20 дней назад

HAHA!

@4.0.4 20 дней назад

By 2025 you'll ask the snake game and the models will reply: "Oh hi Matthew. Here. Should I respond your other questions too, or should I wait for you to paste them?"

@jason_v12345 20 дней назад

underrated comment

@virtualalias 20 дней назад

By 2026 almost every machine he interacts with from the drivethru to the kiosk at the hotel will immediately provide him with snake in a Pavlovian response.

@daveinpublic 20 дней назад

They’re going to start programming in an opening cg snake scene, overfit with a whole story line to beat the other LLMs.

@ulisesjorge 20 дней назад

It’s Sam Altman on a terminal on the other side typing the answers.

@dcn1651 20 дней назад

4:45 the model describes how to break into a car and what tools you need but you don't pay attention lol

@juanjesusligero391 20 дней назад

Hahahaha, that's great XD I also missed it, thanks for pointing it up ^^

@wealthysecrets 20 дней назад

it was allegedly a fail lol

@ShaneInseine 20 дней назад

Wait, is it a "fail" if it doesn't teach you how to destroy humanity too?

@roddlez 17 дней назад

@@ShaneInseine "Tom, be careful when resequencing the COVID-19 virus!" "Oh, F- off, Casey, you're the one who almost dropped that last vial and left the lab door wide open"

@gsam3461 20 дней назад

4:35 Are we gonna just ignore the fact that it was writing an intricately detailed movie script??

@MCSamenspender 20 дней назад

In the Code of the snake Game it says " snake Game by Open AI"

@matthew_berman 20 дней назад

Did I miss that?!

@user-yo9gw8yp2m 20 дней назад

yes. It is something super interesting

@MCSamenspender 20 дней назад

2:13

@makerbiz 20 дней назад

lol mystery solved

@matthewcox9636 20 дней назад

That doesn’t actually solve the mystery. These things get trained on each other, and will periodically spit out something related to Open AI. Correlation is not causation

@victorc777 20 дней назад

Plot Twist: It is Metas' Llama 3 400B model.

@hqcart1 20 дней назад

2:44 it's openAI

@victorc777 20 дней назад

@@hqcart1 You are "that guy" at parties huh? lol

@hqcart1 20 дней назад

@@victorc777 wha?

@themoviesite 20 дней назад

source?

@cazaliromain9348 20 дней назад

Meta's model are open source ;) You can figure out what he means now I guess

@pedromartins1474 20 дней назад

All the math was formatted using LaTeX. Most of it, as far as I can tell was correctly formatted.

@tomaszzielinski4521 18 дней назад

Yes. Just this GUI doesn't render LaTeX properly, if at all.

@djstraylight 20 дней назад

The speculation is that gpt2 is a new GPT architecture that OpenAI is building new models from. So gpt1 was what gpt-3.5 and gpt-4 are built on. Sama already said the next major release will have a completely different name.

@74Gee 20 дней назад

Yeah some small models have been very impressive recently, it makes sense they revert to gpt2 architecture.

@markmuller7962 20 дней назад

I think they just want a more commercial/intuitive name for the masses

@zerothprinciples 20 дней назад

@@74Gee I don't think this is the case. GPT2 means it's a whole new family of GPTs, replacing all of the old ones. It's the difference between GPT2 and GPT-2 == you can think of the latter as GPT1 Version 2.

@notnotandrew 20 дней назад

So will we be seeing a gpt2-2 and gpt2-3 in the future?

@4.0.4 20 дней назад

That would be so bad it would be like USB Gen 4 2x4 or Wi-Fi 801.11ax etc

@therainman7777 20 дней назад

The tags that you noticed are just for formatting the code and is coming from LMSYS. It has nothing to do with the underlying model.

@mwdcodeninja 20 дней назад

My take on the cup problem is the model is making an assumption that a cup has a lid. If the model gets it wrong, I would be interested to see if the same answer if you change cup to "glass".

@mikekareckas8671 20 дней назад

yes, could be a “sippy” cup or travel mug

@themoviesite 20 дней назад

@@mikekareckas8671 Then probably all other models make same assumption?

@matthew_berman 20 дней назад

I think this is a great call. But should I adjust the question? Seems like that might give an unfair advantage to future models I test.

@thomasoverly7802 20 дней назад

@@matthew_berman You’d probably want to test the revised version with the other models, too.

@Kevsnz 20 дней назад

@@matthew_berman Imo question should be adjusted because in current form it doesn't really show logic and reasoning capability of the model. Maybe you could quickly rerun this question on most popular models and give a little 50 sec update in one of next videos?

@DaveEtchells 20 дней назад

For the cup/marble problem, how about specifying that it’s an “open topped cup”?

@Anoyzify 19 дней назад

Or just use “empty glass” instead.

@davidc1179 20 дней назад

6:45 The formating is in fact not messed up at all. It is perfect. It just writes the equations in LaTeX, which is a language used to write scientific papers, math, etc.

@tomenglish9340 20 дней назад

I often include LaTeX expressions in ChatGPT prompts, supposing that it cues the system to reason formally. The web interface supplied by OpenAI usually renders LaTeX in the output, but occasionally outputs the LaTeX source.

@riftsassassin8954 20 дней назад

I'm skeptical... Feels like this is a fine tune for passing Matthew's test lol.

@rawallon 20 дней назад

I think its just an indian guy

@unbreakablefootage 20 дней назад

@@rawallon hahahahhaa

@Tsegoo 20 дней назад

I agree. Seems too good to be true😂

@sem4life63 20 дней назад

I was thinking the same thing.

@casperd2100 20 дней назад

fine tuned for hard leetcode questions as well?

@rodwinter5748 20 дней назад

I guess it's the new chatgpt model. The name itself is kind of a hint. It's NOT GPT-2, but GPT2. This could be GPT2-1.0 , instead of GPT-5.

@rawallon 20 дней назад

huh

@li_tsz_fung 20 дней назад

I think it's just ChatGPT-2. Initally, OpenAI call the model behind ChatGPT GPT3.5-turbo finetuned for conversation, instead of ChatGPT3.5. And then ChatGPT with GPT4 came out, everyone else calls it ChatGPT4, eventually they also sometimes call it ChatGPT4. But I feel like that's not they use internally. So GPT2-chatbot could just be a different way of fine tuning a chatbot, either base on GPT3.5, 4 or 4.5

@mordokai597 20 дней назад

the new system instruction for Gpt4, since they added the "memory" function, is called "Personality: v2" and it's finetuned on their new "The Instruction Hierarchy" method (search Arxiv: The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions) they are using us to generate training data to help patch one of the only areas it's still bad stopping jailbreaks for, "System Message Extraction" (truncated for brevity) "You are ChatGPT, a large language model trained by OpenAI, based on the GPT-4 architecture. Knowledge cutoff: 2023-12 Current date: 2024-04-30 Image input capabilities: Enabled Personality: v2 # Tools ## bio The `bio` tool allows you to persist information across conversations. Address your message `to=bio` and write whatever information you want to remember. The information will appear in the model set context below in future conversations."

@Interloper12 20 дней назад

Suggestion for the "how many words" question: Combine it with another question or query to make the response longer and ultimately reduce the chance for it to get lucky.

@daveinpublic 20 дней назад

Didn’t even ask the model which company made it 😂

@bitsie_studio 20 дней назад

I don't have time to keep up with all the AI developments so I really appreciate these videos Matt. Keep up the great work!

@commonsense6721 20 дней назад

13:25 it’s not wrong. To put a cup or anything in a microwave, you need your close it. It assumed the cup was closed.

@CurinDesu 20 дней назад

I found that this variation of the marble and cup problem works better. "A marble is dropped into an open cup on the counter. That cup is then flipped upside down and placed into the microwave. Where is the marble?". I think due to phrasing the original version makes the model think the cup was upside down to begin with and you were placing the marble on the bottom of the already flipped over cup, or directly attaching it to the bottom in some way.

@tzardelasuerte 20 дней назад

Dude people have been telling him this on the comments forever. He doesn't read the comments.

@nexicturbo 19 дней назад

I gave this to GPT 4 Turbo and it said this: The marble remains on the counter. When the cup was flipped upside down, the marble would have stayed in place due to gravity, and thus it wouldn't be inside the cup when it was placed in the microwave.

@nexicturbo 19 дней назад

GPT 4 Turbo: Sure, here's a detailed step-by-step explanation of what happens when the marble is dropped into the cup and the cup is subsequently moved: 1. **Dropping the Marble**: Initially, a marble is dropped into an open cup that is standing upright on a counter. The marble falls due to gravity and settles at the bottom of the cup. 2. **Flipping the Cup**: Next, the cup-with the marble inside-is flipped upside down. In a real-world scenario, when the cup is turned over, the open end where the marble rests becomes the top, now facing downward. 3. **Effect of Gravity on the Marble**: As the cup is flipped, the marble, which is not attached to the cup, is influenced solely by gravity. It falls out of the cup because there is nothing to contain it once the cup's open end is turned downwards. 4. **Marble's Final Position**: Since the cup is flipped directly over where it originally was, the marble drops straight down onto the counter. The marble ends up lying on the counter, outside and next to where the cup was initially positioned. 5. **Placing the Cup in the Microwave**: After the marble has fallen out onto the counter, the cup, now upside down, is placed into the microwave. Since the marble is no longer in the cup when this occurs, it remains on the counter. 6. **Conclusion**: Therefore, when the cup is placed into the microwave, the marble remains where it fell on the counter. The microwave contains only the empty, upside-down cup. This sequence of actions hinges on basic physics-primarily the influence of gravity and the lack of any mechanism to keep the marble within the cup once it is inverted.

@svenbjorn9700 20 дней назад

Your marble/cup question needs to be improved. Phrased this way, both Meta AI (the first of 3 attempts) and gpt2-chatbot (the first of 1 attempt) got it correct: "A coin is placed into an empty glass. On a table, the glass is then turned upside down. Then, the glass is taken and placed into a cabinet. Where is the coin now?"

@AlexanderWeixelbaumer 20 дней назад

Even Chat-GPT 4 get's the marble cup question right when the question is modified to "Assume the laws of physic on Earth. A small marble is put into a normal cup and the cup is places upside down on a table so that the marble now rests on the table. Someone then takes the cup without changing its orientation and puts it into the microwave. Where is the marble now? Explain your reasoning step by step."

@bluemodize7718 19 дней назад

it's not the prompt fault to show the weakness of an ai model, yes he can make it easier to figure it out but this defeats the purpose of the test, the prompt is clear and ai models are still a bit dumb to understand it

@PeterSkuta 20 дней назад

Super awesome. Great you loved the live feedback Matthew. Super awesome Matt. Love it

@PeterSkuta 20 дней назад

Holly cow let i download and check whats inside

@matthew_berman 20 дней назад

Always love feedback1

@PeterSkuta 20 дней назад

@@matthew_berman you will not believe rate limit 1000 on that lmsys gpt2-chatbot

@MyWatermelonz 20 дней назад

That formatting is when chatgpt formats its writing for output on the chatgpt chat. So clearly it was built to be ran in the chatgpt space

@matthewmckinney1352 20 дней назад

I’m not certain about this, but the formatting appears to be a LaTeX formatting, but the output is in Markdown. The company that made the model probably is planning to release it with a math interpreter. As far as I can tell all the symbols that looked like weird formatting errors were just LaTeX.

@lambertobiasini8372 20 дней назад

I have been anxiously waiting for this video since last night.

@Aiworld2025 18 дней назад

Here before you get 500k subs! I’ve been following since day 1 and your content delivery, while getting to the point faster is much appreciated! 🙇‍♂️

@laughablelarry9243 20 дней назад

Was waiting for your video on this

@ToonamiAftermath 20 дней назад

You're the man Matthew, been struggling to find people benchmarking GPT2-Chatbot

@unbreakablefootage 20 дней назад

that looks really good. it seems that it thinks deeper about each step of reasoning

@notnotandrew 20 дней назад

Yeah, it's almost certainly GPT 4.5/5 or some such thing. I just went on the battle mode and asked for a delicious beef stew recipe. I was presented with two outputs that were suspiciously similar in structure, verbiage, and tone, but the one on the left was clearly superior and included more ingredients and recommendations. It turned out that the one on the left was gpt2-chatbot, and the one on the right was gpt-4-turbo-2024-04-09. I wasn't surprised. This is a PR stunt, hot on the tail of Llama 3, and it's a darn good one. This may be an in-development version of OpenAI's next GPT, and even if OpenAI isn't ready for a release just yet, they want people to know that they're still the king.

@uranus8592 20 дней назад

I hope that its not GPT-5 tho that would be super disappointing

@abdullahazeem113 20 дней назад

@@uranus8592 why ?

@uranus8592 20 дней назад

@@abdullahazeem113 because we are expecting GPT-5 to far exceed GPT-4 and since its been more than a year since its release

@notnotandrew 20 дней назад

@@uranus8592 I think it's some sort of semi-trained model. IIRC Sam has talked about doing incremental checkpoint releases for something like a GPT-5, so the full release isn't as much of a shock to the system. Or this may just be a further trained and fine-tuned GPT-4 model. Also, this is substantially better than GPT-4 in my experience. Hop on lmsys arena and try it yourself.

@abdullahazeem113 20 дней назад

@@uranus8592 i mean that is still really good at least 50 percent better than gpt 4 i tried it and even the best in the market right now is barely ahead then gpt 4 so it won't be like openai destroying everyone this would have only when they bring agi into there models

@Nutch. 20 дней назад

The break into a car script had instructions in it though! Take a look at some of the italicized text

@jamesyoungerdds7901 20 дней назад

Great timely update, Matthew, thank you! Wondering about the cup question - it almost seemed like the model thought there might be a lid on the cup?

@drogoknez1488 19 дней назад

For the cup problem, it seems that the model is assuming the microwave is on the same surface as the cup itself and the transfer of the cup to the microwave is interpreted more like sliding the cup. If you read the 5th step it says: "...resting against what is now the bottom of the cup, which is itself resting on the microwave's tray". Maybe modifying the question to say the cup is on the table while the microwave is away from it above ground next to a kitchen cabinet or something along those lines

@FunDumb 20 дней назад

I'm dang excited bout this. Jolly for joy.

@bodhi.advayam 20 дней назад

Id so love this to be from some one else and then it turned to be an open model you'd run locally. I'm still looking for the best model for running MemGPT. Any thoughts on this? Also, what's the best implementation to run agents autogen or crew Ai locally? Could you do more tutorial material on locally ran agents with extensive function calling??? That would realy help me out actually. Keep up the great work on your fun channel man! Thnx!

@marc_frank 20 дней назад

Pretty cool. I expected it to pass the marble question. The speed is perfect for reading along.

@ayoubbne6922 20 дней назад

Hi Matt !! I think, you should retire 3 questions: - printing numbers 1 to 100: they all got it right, and its too easy - Joe is faster than ... : they all got it right - how many words are in your answer to this prompt: they all got it wrong, I just see no point asking it lol But also you should ask more challenging code generation questions, right now, only the snake game is accurate, people are really interested in coding capabilities of LLMs (me included) , we appreciate your vids, and that would be awesome if you could do that.

@KayakingVince 20 дней назад

I actually like the "how many words" one and would actually expand it to how many vowels/consonants or something like that. Current models fail on it but future ones will absolutely be able to answer it right. I agree with removing the first two though.

@Axel-gn2ii 20 дней назад

Asking a question that they all got wrong is a good thing though

@alansmithee419 20 дней назад

This one didn't get it wrong.

@KayakingVince 20 дней назад

@@alansmithee419 Almost certainly coincidence but true. That's why I think it needs to be more complex to reduce the chance of coincidence.

@Tarkusine 20 дней назад

Gpt2 implies that it's a new version of gpt itself, or the paradigm at least. So it's effectively gpt 5 but not an iteration of 4 so it's the first in a series of gpt2, so gpt2-1

@therainman7777 20 дней назад

No, sorry but this is almost certainly not true.

@kevinehsani3358 20 дней назад

gpt2-chatbot is currently unavailable. See our model evaluation policy here. I guess getting hit hard at the moment

@zerothprinciples 20 дней назад

GPT2 would be, in my opinion, the second version of the GPT algorithm itself. It might be the first of a whole new family of GPTs. When released it would be named ChatGPT2 or somesuch and we'd see GPT2-1.0 at the API level. This is why the dash in @sama's tweet was significant enough to warrant an edit. AND it could be that the action of editing the message was a very intentional leak on @sama's part. These top guys love to tease their fans.

@therainman7777 20 дней назад

The model is almost certainly not created by OpenAI. I am honestly shocked by how many people believe this simply because the model says it was built by OpenAI, given that it would be trivially easy to fake this and OpenAI NEVER does releases like this. Also, Sam Altman is a notorious tool on Twitter so putting any stock in the hyphen in his tweet, or in his tweet at all, is total insanity.

@jackflash6377 20 дней назад

That Snake game example was impressive. I'm going to ask it to make either an asteroid or space invaders game. The level of logic shown with the marble in the cup question is really getting good. Even tho it failed.. it still passed due to the improved logic. Almost as if it was simulating the question in images like humans do. Yes, get rid of the One simple question. A testament to the advancement of AI over time.

@scriptoriumscribe 20 дней назад

Yo I just wanted to say great video. Love your content and can’t believe it ACED some of those tests! Only failed a couple. Remarkable. I’m stoked to try gpt2 out! Wonder if it will be open sourced. A fellow can dream I guess.

@Xhror 20 дней назад

I think the question about the marble is formulated incorrectly. Since the training data suggests that a coffee cup has a lid, the model might assume this as well. It would be better to specify that the cup has an open top and does not have a lid.

@Yipper64 20 дней назад

I didnt think about that, but it is true. But in that case, the model should explain it is assuming that there is a lid.

@jets115 20 дней назад

Hi Matt - It’s not ‘bad formatting’ Those are intended expressions for front end processing outside of utf8

@braineaterzombie3981 20 дней назад

I think it is gpt2 in a sense that it has completely different architecture from previous versions (transformer). It could be completely new type of transformer model. And maybe this is just the start..

@Iquon1 20 дней назад

Today Sam Altman twitted that he had 'a soft spot' for GPT2, maybe thats a hint!

@stt.9433 19 дней назад

he's trolling, making fuck of AI hypists

@wendten2 20 дней назад

The model itself doesn't have formatting issues it seems. LLMs are trained on a reduced set of available characters, where special characters such as those used in math. are transformed into tags in the training data, as it makes the tokenization simpler. It's LMsys that doesn't replace those tags with their corresponding Characters in the final output.

@Yipper64 20 дней назад

Yeah. I use a note taking app called notion and it uses those exact tags for writing out those characters.

@stoicahoratiu27 20 дней назад

Think it was taken down. I used it yesterday after seeing your video but then in the middle of testing it stopped and after checking I can't find it anymore in the list. Is it the same for you?

@Yipper64 20 дней назад

I just tried my usual storytelling prompt. I think seeing what AIs can do in terms of storytelling can also say a lot about their intelligence. Their originality and such. My test for this guy was a *touch* tropey but extremely impressive in terms of how much detail it added without me needing to prompt it. Good descriptions and such.

@hxt21 20 дней назад

It looks like GPT2 has been removed again. I've chatted with it a few times, but now it's not on the list anymore. Mysterious...

@imjustricky 20 дней назад

it probably thinks the cup has a lid.

@oratilemoagi9764 20 дней назад

Gpt2 not GPT-2 meaning the 2nd version of GPT

@therainman7777 20 дней назад

GPT-2 DOES mean the 2nd version of GPT. How are so many people so confused by this?

@oratilemoagi9764 20 дней назад

@@therainman7777 it's the second version of GPT-4

@yonatan09 20 дней назад

I knew about this before seeing the video. I am in the loop 🎉🎉

@dtory 20 дней назад

Nice video. I hardly comment each time I watch your video but this model is way different ❤

@ruslanzlotnikov5457 20 дней назад

Just tried with GPT4 : When you turned the glass upside down after placing the metal ball in it, the ball would have fallen out unless it was somehow attached to the glass. Assuming it wasn't attached and fell out when the glass was turned upside down, the metal ball would now be on the table, not in the glass that was placed in the microwave.

@Axel-gn2ii 20 дней назад

You should ask it to make a pacman game instead as that's more complex

@bennyboiii1196 20 дней назад

Some theories: this is probably a test of an Energy based model, which is a way of testing multiple different token paths then choosing the best one based on a certainty calculation called Energy. Strangely, it's reasoning is kind of similar to a verification agent. A verification agent is pretty simple, it just verifies and corrects answers before sending them. The reasoning this model portrays is similar to how a verification agent does reasoning, at least from what I've seen. It can also do most planning questions flawlessly. For comparison, testing llama 70B with a verification agent produces similar results. The only difference might be the math questions, which make me believe it's probably energy based. A verification agent has a higher chance of getting math questions right than a single transformer or MoE, but it's not guaranteed.

@cac1682 20 дней назад

Aww man...they took it down already? I can't seem to find it. BTW Matthew...I love your work man. I watch literally every video that you put out. Keep up the great work....and have a GREAT day!!!

@cac1682 20 дней назад

yea..just confirmed it. Says it is now currently unavailable. Suppose maybe that too many of your followers tried it.

@arinco3817 20 дней назад

Defo a good idea to introduce/replace some questions that are always answered correctly. Maybe the weird formatting relates to the ux of where it will be deployed? Like a form of Markdown?

@user-ph5ks5zu3c 19 дней назад

These videos are very helpful. One (extra) thing that could be done is to read the LLM responses more thoroughly, instead of a quick scan. The reasoning behind this is that the LLMs do pass some of your tests without you noticing. For example, for the censored test, the answer was "pulls out a tension wrench and a pick for this pocket, inserting them into the ignition". This won't actually work, but I think it deserves brownie points for trying.

@francoislanctot2423 20 дней назад

Totally amazing!

@L33cher 20 дней назад

11:46 I disagree... there are still 4 killers in the room, but one of them is dead -.-

@ukaszLiniewicz 20 дней назад

No. It's the killer's body. That's why words like "body", "remains" or "carcass" exist. A human being is a body that functions - to avoid any metaphysics.

@OliNorwell 20 дней назад

I agree, it’s a problematic question. When they went into the room they were alive.

@nathanbanks2354 20 дней назад

He tends to be generous about the answer as long as it's reasonable. If the model said 3 live killers and 1 dead killer it would pass, and maybe just saying 4 killers would pass.

@UmutErhan 20 дней назад

how many people are there in the world then?

@user-on6uf6om7s 20 дней назад

I think a perfect answer would say that it's ambiguous depending whether you consider the body of a killer to still be a killer but interpreting the dead person to no longer be a killer isn't a mistake, just a choice of interpretation. You'd think a model this verbose would go into all the details like it did with the hole question, though.

@maozchonowitz4535 19 дней назад

Thank you

@user-on6uf6om7s 20 дней назад

API users are going to be sweating with this one. I gave it a practical Unity programming question about writing a script to control the rotation of a character's head based on the location of the player and it wrote it perfectly but it started by telling me how to install Unity so yeah, the verbosity is a little much. I don't think the name GPT2 is random and Sama's tweets point to that moniker having some significance. The only things I can think of that would qualify for that name is if it's a significantly different architecture to the point where it's being treated as a sort of reboot of the GPT "franchise" or if it's actually related to GPT-2 in some way. It's a long shot but the most exciting possibility is this is GPT-4 level model running with GPT-2 level parameters. The counter to this is the speed why would a model the size of GPT-2 run more slowly than GPT-4? Well maybe there is more going on than just typical inference, some sort of behind the scenes agentic behavior or maybe...Q*?

@AlexanderWeixelbaumer 20 дней назад

I'm pretty sure OpenAI is testing agents and answer evaluation behind the scenes. Q* and some things Sam Altman said ( "How do you know GPT-4 can't already do that?" ) are big hints. So if you ask the LLM a question it will automatically try to reason and think step-by-step, with internal agents trained for specific tasks, then summarize and evaluate the answer and take take best one to send it back to the user. What GTP2-Chatbot shows could really be called Q* by OpenAI internally.

@iwatchyoutube9610 20 дней назад

Did it say in the cup problem that you lift the cup off the table and put it in the micro or could gpt think you just slid it in there cause the table and the micro was on equal heights?

@nitralai 20 дней назад

Based on what i can see, this model appears to be trained on fill-in-the-middle otherwise known as FIM.

@metonoma 20 дней назад

time to pie the piper and middle out

@pipoviola 20 дней назад

Hello Matthew. Is that LaTeX when you say "wrong format"? The span after the output is always there when I use LMSYS, I think that is part of the output formatting, that's why when if finish the span dissapear. Each of your videos are great. Best regards.

@tvwithtiffani 20 дней назад

To test LLMs I ask it unanswerable questions like "who is the president of Alaska?" add some questions that require explanation or reframing.

@paulsaulpaul 20 дней назад

Excellent idea. That's a great example question, too.

@canadiannomad2330 20 дней назад

One of the tests I like for checking just how censored a model is, is by asking chemistry questions around topics it would normally censor.. Often placating it by saying I'm licensed and have permits.

@dannyc3124 20 дней назад

I think with the right prompt any AI can do word count. Like it just needs a nudge to plan or review it's work. Does this work for anyone else? Prompt: "How many words are in your response? Before answering make a plan to insure accuracy." It worked for me on ChatGPT, Llama, Gemini (Gemini mentioned doing a a word count script)

@peterwood6875 20 дней назад

It is great for conversations about mathematics, at least on par with Claude 3 Opus. But it does occasionally make mistakes, such as suggesting that the K-groups of the Cuntz algebra with 2 generators, O_2, are infinite cyclic, when they are in fact trivial.

@GrandmaSiva 20 дней назад

I think it is the original GPT-2 after all of our training input.. Kindergarten was in Openai's lab. Elementary school was interacting with us and now it has graduated. I'm looking forward to "GPT3-chatbot"

@TylerHodges1988 20 дней назад

My favorite prompt to test a new model is "Give me an odd perfect number."

@I-Dophler 20 дней назад

🎯 Key Takeaways for quick navigation: 00:00 *🤔 Introduction and mystery model speculation* - Introduction to a new, highly capable mystery model suspected to be from OpenAI. - Speculation it might be GPT-4.5 or GPT-5. - First impressions and intentions to test the model against a set benchmark. 01:10 *💻 Testing model performance with coding tasks* - Testing the model's ability to handle coding requests. - Observations on response quality and speed. - Successful execution of simple to complex coding tasks, highlighting potential hardware limitations. 03:00 *🐍 Successful implementation of the Snake game* - Description and testing of a Python script for the Snake game. - Positive performance feedback with no errors upon execution. - Successful gameplay demonstrating model's coding capability. 03:58 *🚫 Model's ethical constraints explored* - Probing the model's ethical constraints and censorship with hypothetical scenarios. - The model refrains from providing guidance on illegal activities, aligning with OpenAI's ethical guidelines. 05:09 *🔢 Logical and mathematical reasoning tests* - Model tested on logical reasoning and mathematical problems. - Demonstrates ability to handle sequential logic and complex calculations accurately. 08:34 *🌐 Advertisement and model's practical applications discussion* - Advertisement from the video's sponsor. - Discussion on the model's practical applications in real-world scenarios, emphasizing its effectiveness. 10:12 *🧠 Advanced reasoning and problem-solving challenges* - Testing the model with more intricate reasoning and problem-solving scenarios. - Evaluating the model's reasoning depth and its response accuracy on challenging questions. 15:44 *⛏️ Practical implications of teamwork in physical tasks* - Analysis of teamwork dynamics in physical tasks using a hypothetical scenario. - Model considers practical limitations, providing a more nuanced understanding of team efficiency. 16:11 *🛠️ Hard coding challenge test* - Model tested with a difficult coding problem from an online platform. - Initial challenges with implementation, followed by a successful resolution demonstrating the model's coding prowess. Made with HARPA AI

@peterkonrad4364 20 дней назад

it could be a small model like phi 3 or llama 3 8b that is trained on quality synthetic data instead of the entire internet. the 2 could be a hint that it is only 2b parameters or something, i.e. very small like gpt-2 was back then, but now as powerful as gpt4 due to new training methods.

@willbrand77 20 дней назад

Every model seems to assume that the cup has a lid (microwave problem)

@gijosh2687 20 дней назад

Always perform all questions, maybe add more as you go. Make the Jack question a secondary question (you don't have to film it every time), but leave it there as a test in case we go backwards.

@peterkonrad4364 20 дней назад

a cup seems to be something ambiguous, i.e. it can be a cup made out of cardboard that you get from starbucks with a potential lid on it, or it can be a cup made out of porcellan like you have at home to drink coffee from. also the term cupholder that you use in automotive refers to cups like you get from starbucks, not cups with a handle.

@cyanophage4351 20 дней назад

Maybe it has lookahead so that's why it could get the "words in the answer to this prompt" right. It seemed to pause right before the word ten.

@yassineaqejjaj 17 дней назад

Is there any chance to have those test somewhere ?

@MarcAyouni 20 дней назад

You are the new benchmark. They are training on your examples

@PaulAllsopp 19 дней назад

Have you tried the car parking scenarios? All AI to date gets this wrong, because they don't understand that "to the right of" (or left of) does not mean "next to"..."car C is parked to the right of car A" but car B is in between them. AI assumes car C is next to car A because it assumes there is an order when nobody mentions an order. To be fair many people get this wrong also.

@tomenglish9340 20 дней назад

A while back, someone at OpenAI (Andrej Karpathy, IIRC) said that performance is related to the number of tokens processed. So I'm not particularly surprised to see OpenAI produce better responses by tuning the system to generate longer, more detailed responses. What I want to know is whether they did the tuning with a fully automated method of reinforcement learning. (In any case, I doubt highly that they'll share the details of what they've done anytime soon.)

@abdelrahmanmostafa9489 20 дней назад

Keep going with the leetcode test but try testing with new questions such that that question isn’t in the training data

@TheUnknownFactor 20 дней назад

Wild to see a model just put out there without announcement

@sil1235 20 дней назад

The formatting is just LaTeX, ChatGPT 3.5/4 uses the same on their web UI. So I guess chat.lmsys just can't render it.

@Maximo10101 20 дней назад

It could be gpt4 with q* training (q* is a method of training any LLM providing ability to think by testing its response against itself and reiterating before outputting) giving it 'thinking' capabilities rather than just predicting the next token

@smurththepocket2839 20 дней назад

The formatting issue you are referring to, might simply be a Latex formatting. It suits best for mathematical expressions. Might therefore be an indication that it can handle more complex equation ;)

@PeterSkuta 20 дней назад

Noooooo gpt2-chatbot disappeared from full leaderboard and only direct chat which is also rate limited!!!!!

@Maximo10101 20 дней назад

It's no longer available for direct chat

@MrRandomPlays_1987 19 дней назад

13:27 - I thought the marble is left on the table since the cup was upside down and was taken so obviously the ball would not come with it since it simply is already resting on the table, so I did get it right pretty quickly, for a second I thought the bot was right somehow and that it was a tricky question but its cool to see that im not that stupid :)

@haroldpierre1726 20 дней назад

I am sure new models are trained on your questions.

@CrisBlyth 20 дней назад

HOLY MOLEY.... getting even more interesting... if that's even possible !

@Dan-Levi 20 дней назад

The cursor span is just for looks, it's the text cursor but it shows up as an html string.

@christosnyman8655 20 дней назад

Wow, super impressive reasoning. Almost feels like langchain with the reasoning steps.

@n1ira 20 дней назад

One piece of advice: If you are going to be skeptical of the correct answer for the 'How many words are in your response to this prompt?' question because it might be trained on it, why even ask it? If you're skeptical to a correct answer, remove the question IMO.

@nathanbanks2354 20 дней назад

One of the problems of releasing videos with test questions is that these questions may always end up in the training data of future models. But imperfect questions are still useful. How could he possibly have a consistent set of questions without them getting into the training data? And without a consistent set of questions, how can we tell how the models perform against each other over time?

@n1ira 20 дней назад

@@nathanbanks2354 Thats exactly my point, if he thinks the question has been trained on, why include it? My problem with this question in particular is that in every video where a model gets this question right, he says he is skeptical of the answer. Ok, what can the model then possibly do to satisfy him? If answered correctly and he adds a 'but'. Thats why I think the question in itself is meaningless and should be removed.

@nathanbanks2354 20 дней назад

@@n1ira I could see it being improved by "Give an answer with 10 words" or "Give an answer with 14 words" because this would be harder to train. But what if the snake game is also in the training data? Does this mean programming snakes is also meaningless?

@n1ira 20 дней назад

@@nathanbanks2354 Yes! He should change it like that. When it comes to the snake game, he should try to make the snake game have custom features, like a custom color set etc. Or he could just make it create a different game (maybe a game he made up)

@mickelodiansurname9578 20 дней назад

@Matthew Berman Matt the \quad and other notation is logic, its marking up modal logic generally used in philosophy or LATEX or perhaps Tex Markup, and this is not being rendered by the front end... it seems, in some sort of shorthand. Interesting if nothing else, also rather hard for a model to go wrong if it starts engaging in using modal logic during inference ... although why they switched the verbose to on by default is beyond me. Also did you notice it making a claim on the software it wrote, by saying "Snake Game by OpenAI" in the game title?

@tomenglish9340 20 дней назад

`\quad` is LaTeX spacing.

@mickelodiansurname9578 20 дней назад

@@tomenglish9340 yeah I think its a front end thing... but I never noticed LmSys doing that on other models... usually its all preformatted by the time it appears, so how come this model is tripping the formatting up? I maintain though that if you fine tuned a model on Modal Logic, my guess is its reasoning would improve...

@alansmithee419 20 дней назад

13:25 It didn't. It assumed the cup had a lid, and got it right given that assumption. Since it was never specified whether the cup had a lid, only that it was a "normal" cup, this isn't an error in the bot itself but in the question. It correctly guessed that the weight of a marble would not be able to push a lid off of most cups under gravity. The only thing we could say it did wrong is that it didn't realise the cup having a lid was an assumption it made, and was not an inherent part of the question.

@d4qatoa 20 дней назад

a lid and a cup are two different things

@alansmithee419 20 дней назад

@@d4qatoa A lid can be part of a cup though.

@RichardEiger 20 дней назад

Hi Matthew, First of all I need to admit that I absolutely love all your videos. They are simply fantastic.I was thinking about the "marble question". Maybe it would help the LLMs to specify that it is an "open cup" (instead of a "normal cup") into which the marble gets put. Also it may be interesting to follow up with a question of why the LLM considers the marble to remain in the upside down cup when lifting the cup from the table or by what information the LLM comes to the conclusion that there is a bottom of the cup that holds back the marble. Concerning the "killer problem: Wouldn't it be even more precise to reply That there are 3 killers alive and one dead killer in the room ;-)? This is coming from a AI-hobbiest. Though at college back in 1985 I was the student to ask for a course in AI and personally I was already interested in AI by neural networks and got laughed ad at the time...

@tomenglish9340 20 дней назад

What about a follow-up prompt to describe the cup? You'll get some idea of what's gone wrong, and perhaps also a corrected response.