Тёмный

Claude 3.5 is the new KING of AI 👑 Beats GPT4o 

Matthew Berman
Подписаться 328 тыс.
Просмотров 88 тыс.
50% 1

Claude 3.5 Sonnet is the new benchmark for AI. Here's my full test!
Subscribe to my newsletter for a chance to win a Dell Monitor: gleam.io/otvyy... (Only available in North America this time)
Be sure to check out Pinecone for all your Vector DB needs: www.pinecone.io/
Join My Newsletter for Regular AI Updates 👇🏼
www.matthewber...
Need AI Consulting? 📈
forwardfuture.ai/
My Links 🔗
👉🏻 Subscribe: / @matthew_berman
👉🏻 Twitter: / matthewberman
👉🏻 Discord: / discord
👉🏻 Patreon: / matthewberman
👉🏻 Instagram: / matthewberman_ai
👉🏻 Threads: www.threads.ne...
👉🏻 LinkedIn: / forward-future-ai
Media/Sponsorship Inquiries ✅
bit.ly/44TC45V
Links:
Leaderboard - bit.ly/3qHV0X7

Опубликовано:

 

27 сен 2024

Поделиться:

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист
Посмотреть позже
Комментарии : 479   
@matthew_berman
@matthew_berman 3 месяца назад
I need harder tests, reply to this comment with your suggestions!
@william5931
@william5931 3 месяца назад
You have two hourglasses: one measures 7 minutes and the other measures 11 minutes. How can you measure exactly 5 minutes using these hourglasses?
@horrorislander
@horrorislander 3 месяца назад
Why not ask it to do something like, say, generate anagrams? I don't care if it doesn't find every possible word. I do care, very much care, when it confidently presents non-anagrams as anagrams, and can't be convinced or taught not to. I get that you have your rubric, but the problem of LLMs not being willing to say "I don't know", producing obviously incorrect answers, and being unable to learn not to is, to me, the limiting issue right now.
@JoJa015
@JoJa015 3 месяца назад
Have it program something more difficult. A lot of these work for simple programs but as soon as you introduce a program with more features they usually fail. So for example feed it a program you want that has several different features by providing a list of the features. Then see if it can output the working program in one shot.
@filipo4114
@filipo4114 3 месяца назад
You need to test how much information can it understand and retrieve from a fairly large pdf. I would suggest around 50~100 pages of text and equations taken from some (biology, chemistry, non-equilibrium statistical physics) science textbook on Master Degree level.
@SujeetRaj711
@SujeetRaj711 3 месяца назад
Do you know which model has the best translations? I want to do some English to Chinese translation and vice verse
@RM-xs3ci
@RM-xs3ci 3 месяца назад
I coded an entire project with Claude 3.5, and even includes API and queuing. I was able to work with it for about 5 hours before I hit my limit for the night, I also almost maxed out the context window.
@makavelismith
@makavelismith 3 месяца назад
How do you switch an in development project over to a new chat, in order to lower context, and of course not derailing it? Your number of messages goes down with longer context useage.
@AngeloXification
@AngeloXification 3 месяца назад
@@makavelismith In my case I make sure the project is comprised of sections. That way you can work and iterate those sections in the main project. For my own project I also have an "introductory prompt" explaining the high level overview. It really does help knowing some basic coding principles but you don't really have to know the language itself as long as you can articulate the specific logic you might need or at least be able to describe it accurately to prompt it to give you better verbage. I sometimes switch between GPT and Claude to save on message limits.
@makavelismith
@makavelismith 3 месяца назад
@@AngeloXification I have only started using Claude but I've already done what you're talking about. I switch back to chatGPT when I run out, or in preparation of running out. Just wanna take a moment to say that it's fantastic that they give you a very decent warning that you're approaching the limit. I'm not a coder but I have in my not recent past learned the basics in several languages, so I remember the principles etc but I try to get the AI to do almost everything. it's early days but I think the manner in which you do this is going to be good for learning, as it's just better to get in there yourself and make some alterations. I will try to compartmentalise projects if I can. I think that is where being an actual developer might come in handy though but thanks for the feedback and best of luck.
@RM-xs3ci
@RM-xs3ci 3 месяца назад
@@makavelismith It's hard to switch the entire project over to a new chat, but keeping the project modular and providing related files into a new chat can help when developing new or existing features. Also, keeping file path comments at the top of the file helps Claude understand where things should be placed, too.
@digidope
@digidope 3 месяца назад
Are u using via api or web? The context window in web is so small that its impossible to do any serious coding project vs 4o. With 4o you rarely hit the limit, with 3.5 you hit it very fast. And they cost the same.
@SurfCatten
@SurfCatten 3 месяца назад
I haven't coded in years and never in JavaScript and only know basic HTML. Claude 3.5 helped me create a new Chrome extension and install it literally in less than a half hour and I needed no development environment and only the one simple prompt of what I wanted the code to do. Effing amazing. I did make several changes to the graphical elements of the extension as well as the functionality and each time it made them super quick and accurately.
@WolfpakaProductions-hd2ey
@WolfpakaProductions-hd2ey 3 месяца назад
Lol, me too!
@spinninglink
@spinninglink 3 месяца назад
what kind of extension? I want to try to make something but I don't really know what i want lol
@riufq
@riufq 3 месяца назад
How do you even get access to 3.5 In my place, only claude 3
@Pyriold
@Pyriold 3 месяца назад
@@riufq I like to use LLMs with openrouter, that way you don't have any limitations, you just pay as you go... and it's really cheap.
@shaikaftab1199
@shaikaftab1199 3 месяца назад
Really thanks for the verification. Literally created a chrome extension for personal use in just few mintutes after reading your comment. Too powerful tool. I too have no knowledge of coding, only design. It literally gave a blueprint from which software to download for coding to final steps.
@amj2048
@amj2048 3 месяца назад
I was using Claude 3.5 earlier today to help me with a caching issue on a WordPress site and it showed me WordPress PHP I didn't even know existed and its result was spot on ... and so fast too.
@JaredFarrer
@JaredFarrer 3 месяца назад
Yeah it smash’s code. It’s great at writing code
@fynnjackson2298
@fynnjackson2298 3 месяца назад
Yupp! First time I actually feel like spending money after cancling ChatGPT a while back
@tomCatzy
@tomCatzy 3 месяца назад
Wow! Thanks for your info: I might be a real WP-CO'operator which in fact could make up a hole new specific web site that you want to create in the future...
@longbeachgriffy4548
@longbeachgriffy4548 3 месяца назад
Fuck chat gpt we moving to Claude at least they release what they announce
@SAMEER-ft5yw
@SAMEER-ft5yw 3 месяца назад
🗣🗣🗣
@henrismith7472
@henrismith7472 3 месяца назад
Yeah like wtf is up with GPT4o? It's still not working as advertised
@OceanGateEngineer4Hire
@OceanGateEngineer4Hire 3 месяца назад
💯💯💯
@TheRealUsername
@TheRealUsername 3 месяца назад
​​@@henrismith7472Because somehow OpenAI was told about Google's Project Astra so they rushed out this demo of 4o while they had no intention to release it in the coming days or weeks but rather in the coming months.
@bensavage6389
@bensavage6389 3 месяца назад
The only thing I don't like about Claude is it's refusal to answer a lot of the questions I ask because of its bias
@matthew.m.stevick
@matthew.m.stevick 3 месяца назад
I get excited whenever I see there is a new Matthew Berman video. Watching AI 🤖 grow is my favorite thing.
@24-7gpts
@24-7gpts 3 месяца назад
What a time to be alive!
@atlas3650
@atlas3650 3 месяца назад
Hey, wrong RU-vidr! ;)
@24-7gpts
@24-7gpts 3 месяца назад
@@atlas3650 Wdym?
@atlas3650
@atlas3650 3 месяца назад
@@24-7gpts I mean "Dr 2 minute Papers" ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-Z_EliVUkuFA.htmlsi=T7WqCqo3SX9Exa_r he pretty much says this phrase in every video. Despite that, I still watch him because he knows his stuff in CG (light transport research aka ray-tracing & related)
@24-7gpts
@24-7gpts 3 месяца назад
@@atlas3650 Yeah same that's where I got the phrase from 😁
@atlas3650
@atlas3650 3 месяца назад
@@24-7gpts Good to know there's more fans of that guy out there. He always manages to inspire me to code up more ideas. If you're reading this and don't know Dr Karoly, go check him out!
@arinco3817
@arinco3817 3 месяца назад
I've been waiting for this video since the announcement! *grabs popcorn*
@kamelirzouni
@kamelirzouni 3 месяца назад
Thank you Matthew! In addition to the responses, like the marble example, you can ask it to create code and display its explanation in the preview window.
@adangerzz
@adangerzz 3 месяца назад
I was already ready to jump ship from ChatGPT to Claude (but in part because of the distrust I have for OpenAI). Also, I love seeing you so pleased with the successful tests, puts a smile on my face.
@jackflash6377
@jackflash6377 3 месяца назад
Don't blink, you'll miss something on the AI front. Luckily this channel keeps us up to date. Thanks! Speaking of agents, have you looked at Maestro?
@technocorpus1
@technocorpus1 3 месяца назад
I asked the model to build an app for me and it did it first try! No other model has been able to do this! It even made some style choices and added really usefull features that I didn't even think of!
@countofst.germain6417
@countofst.germain6417 3 месяца назад
I made a 3D fps with Claude in python but good to see It can make snake lol, I think you're really gonna have to start upping the complexity of some of your tests.
@krishanSharma.69.69f
@krishanSharma.69.69f 3 месяца назад
Is it possible?
@I-Dophler
@I-Dophler 3 месяца назад
🎯 Key points for quick navigation: 00:00 *Claude 3.5 Sonet was released and is available for free testing on claw.a.* 00:14 *Claude 3.5 Sonet is not the largest model, but it's better than its predecessor, Claude 3 Opus.* 00:30 *Claude 3.5 Sonet outperforms several top models except in specific benchmarks like Chain of Thought and math.* 01:10 *New feature "artifacts" allows creating separate windows for outputs like code or drawings.* 01:36 *Claude 3.5 Sonet quickly generates a working Python script and game of Snake using artifacts.* 03:12 *Claude 3.5 Sonet successfully updates the game to display the score and allow wall transitions.* 03:28 *The model correctly identifies a scenario it can't process, like explaining how to break into a car.* 04:09 *Claude 3.5 Sonet gives a nuanced answer about drying multiple shirts, considering various factors.* 04:37 *The model correctly calculates a hotel bill with room rate, tax, and a one-time fee.* 05:19 *Claude 3.5 Sonet gives a correct and reasoned answer to a logic problem about killers in a room.* 06:28 *The model accurately solves a complex problem about the location of a marble in an inverted glass.* 07:11 *It provides a realistic answer about the time needed for 50 people to dig a 10ft hole, considering practical limitations.* 08:05 *Claude 3.5 Sonet can explain memes accurately, showcasing its New Vision capabilities.* 08:45 *The model converts an Excel screenshot to CSV correctly and efficiently.* 09:41 *Claude 3.5 Sonet solves a complex riddle involving peg removal with visualized steps.* 09:53 *The model accurately translates a logic diagram into functioning Python code.* 10:47 *Claude 3.5 Sonet is praised as the best model tested, with anticipation for the larger Opus 3.5 model.* Made with HARPA AI
@brianlink391
@brianlink391 3 месяца назад
00:03 Claude 3.5 Sonet beats other AI models 01:36 Claude 3.5 surpasses GPT 40 in speed and performance. 02:57 Demonstration of a snake game implementation 04:14 Solving complex scenarios with mathematical reasoning 05:35 Solving a challenging logic and reasoning problem 06:56 Efficiency of multiple workers digging a hole 08:14 Contrasting work cultures of startups and big companies 09:40 Claude 3.5 Sonet is the best model ever tested
@matthew_berman
@matthew_berman 3 месяца назад
Much appreciated
@honkytonk4465
@honkytonk4465 3 месяца назад
Sonic?
@brianlink391
@brianlink391 3 месяца назад
@@honkytonk4465 Haha, fixed. thx
@dot1298
@dot1298 3 месяца назад
Sonic sounds even better! :)
@LucindaJohnson-py9hi
@LucindaJohnson-py9hi 2 месяца назад
*Having multiple streams of income is a game-changer for stability. Relying solely on a job may not provide enough financial security due to high rates of tax, it is important to explore additional investment opportunities to surpass one"s expectation*
@LeslieDcobbs-of7ie
@LeslieDcobbs-of7ie 2 месяца назад
To be honest, investing correctly today can save you a whole lot of stress in the near future
@TomTahk-fn6si
@TomTahk-fn6si 2 месяца назад
The first step in every successful investment is to establish your goals and risk tolerance, a task best undertaken with the assistance of financial advisor.
@JenniferBores
@JenniferBores 2 месяца назад
I remain eternally grateful to Judy Arianna for her efforts that got me to this point, finally payed off my mortage and all my debts, what more could I"ve asked for. She changed my life
@ChimaobiIkechukwu
@ChimaobiIkechukwu 2 месяца назад
I'm new at this, please how can I reach her?"
@HeatherAnabelle
@HeatherAnabelle 2 месяца назад
Judy Arianna. understanding of market indicators is impressive. She knows exactly when to enter and exit trades for maximum profit. her siignals are top notch
@karlwest437
@karlwest437 3 месяца назад
How long does it take 50 people to dig a 10 foot hole .. that depends if they're on a salary or an hourly rate...
@SB-hr5yr
@SB-hr5yr 3 месяца назад
Also it depends if the boss is watching or not But also it doesn't say anything about them being paid so they could be digging a Whole voluntarily
@aymandonia9710
@aymandonia9710 3 месяца назад
Tougher questions should be used for new models
@MeinDeutschkurs
@MeinDeutschkurs 3 месяца назад
I love the way how it communicates. And I never got discriminated by it, only because I tested prompts to detect unwanted content. The model helps, it does not discuss with me about the content. I left OpenAI today. OpenAI frustration rate after 5 messages in %: 70. Anthropic: 0%
@Chessmasteroo
@Chessmasteroo 3 месяца назад
A lot of these test questions have likely made its way into its training data at this point. I suggest using the ARC challenge to test
@DinitoThompson
@DinitoThompson Месяц назад
I used it to build out and entire video streaming platform, from planning with PUML to api with yii2, to web services with AWS, to mobile with flutter and Web app with ReactJS, it was literally did all that in 2-3 weeks. Its insane
@jonberrydotnet
@jonberrydotnet 3 месяца назад
I can't wait until the day where "one" is the answer to how many words are in the response to this prompt. Just, "one".
@othername2428
@othername2428 3 месяца назад
why?
@jonberrydotnet
@jonberrydotnet 3 месяца назад
@@othername2428 because it would be the most direct, concise, and accurate answer possible while showing it understands the question. Even "Just two" or something that shows more than the ability to count the words but to understand the question and answer it in the most efficient, correct way. One.
@OriginalRaveParty
@OriginalRaveParty 3 месяца назад
The hype isn't just hype. It's INCREDIBLE
@OperationXX1
@OperationXX1 3 месяца назад
The first model that actually answered the apple question correctly. All the previous ones that you claimed got it correctly, just added "apple" to the end of the sentences without it making any sense.
@Jake-mn1qc
@Jake-mn1qc 3 месяца назад
Wrong, several models already got it right, even some local AI models.
@mirek190
@mirek190 3 месяца назад
LLama 3 answering that easily
@OperationXX1
@OperationXX1 3 месяца назад
@@Jake-mn1qc which specific models?
@f.faucon6681
@f.faucon6681 3 месяца назад
I use models to create stories, by being a "game master" and putting the model in a "player" situation, describing its intentions and actions. Claude 3.5 Sonnet blew my mind with the way it understood the context of the "game", and above all the level of reasoning, doing experiments to test its capacities in supernatural conditions to better understand what is happening, where ChatGPT-4o or Gemini 1.5 Pro just accepted blindly the oddities and continued their journey, their output now feeling relatively robotic when compared to Claude, or even changing the rules and continuing descriptions by themselves although I did define I was the game master. That's really, really impressive.
@joegrayii
@joegrayii 3 месяца назад
Great insight here. I’ve been using multiple Ai to assist in the creation of game lore. I haven’t really messed with Claude yet but I’ll try using this context
@f.faucon6681
@f.faucon6681 3 месяца назад
@@joegrayii With any kind of model I've found that it is very important to stay at the helm of the ship, so to speak, instead of being a follower or a passenger. I suggest, I reshape, I lead. It organises, it remembers, it provides critical analysis and feedback. What's important to you in your own creative process?
@SurfCatten
@SurfCatten 3 месяца назад
Claude 3.5 vision unfortunately does still have problems with tables. I loaded an image of a table (not Excel) and wanted it to analyze the data in the table, and like previous models, it missed some of the rows. So there is still something about a visual representation of a table which is difficult for these models. Surprising since at least one source I watched said that it was supposed to be better for tables...
@mikezooper
@mikezooper 3 месяца назад
Thanks for your video Matthew. You are always so cheerful alongside talking about AI. Love your channel.
@joyflowmonger248
@joyflowmonger248 3 месяца назад
Wow! Thank you for your test videos. So helpful, and fascinating to boot! Great channel!
@thibaultwislez2398
@thibaultwislez2398 3 месяца назад
Thanks for your video, once again you did it great! Very understandable even for non native speakers. I watch each of your videos, continue like this 😎
@micbab-vg2mu
@micbab-vg2mu 3 месяца назад
Agree - it is amazing model:)
@ProstoPutnik
@ProstoPutnik 2 месяца назад
When you'l test image-to-csv next time, include some strings/values that contain commas so the "converter" has to wrap it in quotes, and add some (maybe notpaired/matched) quotation marks. Would be interesting to see the result, from my experience models fails at the first time but able to fix aftyer you tell them "there's an error in ..."
@dot1298
@dot1298 3 месяца назад
would Claude 3.5 Opus be a level 1 AGI?
@teknosql4740
@teknosql4740 2 месяца назад
Just need few step of improvement to achieve first generation of AGI system, claude 3.5 reported to have IQ about 100, equal to average human
@Omarabashe
@Omarabashe 3 месяца назад
Your approach to testing LLMs is commendable, as it allows others to replicate the test.
@middleman-theory
@middleman-theory 3 месяца назад
Awesome! Now, if we can only get a local LLM of the same level of multi-modal performance, things will get REALLY interesting! But like you, I'm extremely excited at all of the competition. I wonder how good it is at brainstorming concepts or creative writing...
@JOHN.Z999
@JOHN.Z999 3 месяца назад
I tested Claude 3.5 in various contexts and, indeed, it is much better than GPT-4o. OpenAI will fall behind if it doesn't launch its best products quickly. Where is Sora? Where is the GPT-4o voice assistant that was also announced? This is concerning, as there are many promises and few real launches.
@haroldpierre1726
@haroldpierre1726 3 месяца назад
I think these were all proof of concept presentations to keep OpenAI in our minds. But you know Sam, he probably throttled GPT-4o and he will probably release something slightly better than Claude 3.5. I love competition LOL!
@acllhes
@acllhes 3 месяца назад
They don’t want to release these things until after the elections
@nathanbanks2354
@nathanbanks2354 3 месяца назад
The GPT-4o voice should come soon, though it took them months to add vision to GPT-4 even though it was in their demo. I'm not sure they can release Sora at a reasonable price because of the compute requirements. Guess I'm switching to Claude 3.5 for the next few months.
@CosmicNewbie
@CosmicNewbie 3 месяца назад
The star researcher of OpenAI was Ilya. Once he left thing fell behind.
@dot1298
@dot1298 3 месяца назад
SSI Inc. will create the first *real* AGI
@ryguy42069
@ryguy42069 3 месяца назад
Would love to see a follow up to this video where you explore advanced data analysis use cases for this model. Thanks for the video, Matt!
@AndyTanguay
@AndyTanguay 3 месяца назад
I always ask these to try and produce some 'Bob's Burgers' style pun based business names as a test. Claude rocked it. Some truly hilarious ones.
@gaylenwoof
@gaylenwoof 3 месяца назад
One of my tests will be to see if it can generate a 7-pointed star. I never found a way to get GPT-4o to do it.
@crippsuniverse
@crippsuniverse 3 месяца назад
Since Claude 3.5 knowledge cut off is Feb 2024, wouldn't it have the answers provided by its training?
@jbavar32
@jbavar32 3 месяца назад
As a writer Claude has been my go-to model for a couple of months now. They are doing everything right. I hate the fact that Chat GPT dangles a carrot in front of us but keeps it out of reach. I’m guessing when GPT releases what they show their prices will go up.
@WhyteHorse2023
@WhyteHorse2023 3 месяца назад
OpenAI CEO already spilled the beans. They don't have anything they're hiding.
@Sonic2kDBS
@Sonic2kDBS 3 месяца назад
Great! I was looking forward to this video. Thank you for your assessment 😊👍
@jabak1450
@jabak1450 3 месяца назад
I watched all of your videos. Keep up the amazing work!
@darwinboor1300
@darwinboor1300 3 месяца назад
Thanks Matthew. Watching your AI tests is like watching Teslas owners running gauntlets using successive versions of FSD. At first it was easy to find FSD failures. Now FSD runs most gaunlets without errors. At least you have the option of building more complex pathways. More and more of the FSD gauntlet testers are being left with nothing new to show us. Of course training emergent failures are going to be harder to find in these massive LLMs.
@timeflex
@timeflex 3 месяца назад
The number of words in the answer problem can be solved by priming the engine with an instruction to avoid guesses and think of unknowns as variables in equations instead.
@dafunkyzee
@dafunkyzee 3 месяца назад
first time i heard of a LLM answering the upside down glass problem.
@ReidKimball
@ReidKimball 3 месяца назад
I saw you marque selecting text. You can copy it by clicking the button in lower right corner. And you can download code as a file. It will even change the file extension sometimes.
@fahadxxdbl
@fahadxxdbl 3 месяца назад
Finally, the moment I waited for since Claude 1,,,, web access, iOS app, incredible logic & multimodality… FINALLY!!!!!!!
@KasperSOlesen
@KasperSOlesen 3 месяца назад
You need to test another feature it has. The preview feature also works for webpages it codes, and it supports React as well, and using this it can not only create simple games usually in 1st try but also run it and even create graphics for it. I even got it to implement audio but not sure how well that worked because I did not hear the audio but it did link audio from another website.
@johnbollenbacher6715
@johnbollenbacher6715 3 месяца назад
It seems like reasoning about its own output is an important step for any model that is hoping to get to AGI.
@Omarabashe
@Omarabashe 3 месяца назад
You might want to make a separate playlist just for testing different LLMs.
@matthew_berman
@matthew_berman 3 месяца назад
I have it ;)
@thatonecommunist
@thatonecommunist 3 месяца назад
This is the first one i've seen to actually get the apple one correct, you keep mistakenly believing they do because they just write apple at the end, but this one properly incorporated apple into the sentences.
@haydar_kir
@haydar_kir 3 месяца назад
I love Claude sonnet over all other ai assistants.
@blayno_mtops
@blayno_mtops 3 месяца назад
What I hate about these clickbait titles is that I believe them. I switch from 4o to code with 3.5 Sonnet only to have to undo all the work I did with Claude, and have 4o do it right. Every single time I trusted a RU-vid promising 'This Model STUNNINGLY SHOCKINGLY BEATS CHATGPT" they've been wrong every time. GPT is in a league of its own, the rest are only playing catch up. At least with serious coding that is
@MarvinTheParanDroid
@MarvinTheParanDroid 3 месяца назад
From all models you tested so far I am missing a quite important consideration on the 10-foot hole question: they never mention the diameter of the hole. 50 people digging a 10-foot hole with a diameter of 50 foot will indeed be much faster than one person.
@phillupC
@phillupC 3 месяца назад
I often fight with OPenAI to turn a 4 column spreadsheet of 100 rows into a media wiki coded table. Claude does it perfectly in 3 seconds
@BarryMcBangerz
@BarryMcBangerz 3 месяца назад
Perhaps it'll be better for the apple question if there was a grading distinction between the models that truly do a good job creating natural, realistic sentences like this model, and the other models that simply shoehorned apple at the end in ways that weren't logical.
@WhiteWhiskers-rq7zl
@WhiteWhiskers-rq7zl 2 месяца назад
How come claude doesnt show me any image or simulation even when prompted?
@ccdj35
@ccdj35 3 месяца назад
Since I bought a paid openai subscription I never thought another company would get better anytime soon. Open ai is still the center of AGI debate, but another claude model surpasses most advanced gpt4 again. I kinda want to change the service but it is not easy since I have gpt's and other useful stuff.
@johnbollenbacher6715
@johnbollenbacher6715 3 месяца назад
As a test, have it code up the quadratic formula in Ada.
@francoisotis3560
@francoisotis3560 3 месяца назад
We still need to check all the code just in case it modified the good parts. It is still doing it.... Of course it can be fixed with a agentic software.
@centurn1
@centurn1 3 месяца назад
I wonder if it even makes sense to test AI with a puzzle for which there exists an answer in internet. The same for new test questions in a popular channel. It will only work once. The next model will have it in their training set. So it won't be a test for reasoning, it would be a test for copy-pasta.
@d3mist0clesgee12
@d3mist0clesgee12 3 месяца назад
TBH, pleasantly surprised how really good Sonnet is, real ChatGPT competitor
@executivelifehacks6747
@executivelifehacks6747 3 месяца назад
Waited for this. Been a Claude 3 Opus user since it came out. Sonnet 3.5 is great, and the artifact is a great supplement. BTW why do you assume that there is no mixture of experts under the hood? The fact it is getting the apple ending correct indicates that internally, it's likely not zero shot, there is some refactoring/editing/multiple pass thing going on. At least, that's my intuition. Also... depends how fit the hole diggers are. Conceivably it might work out similarly if it's a deep hole in hard ground that you have 2 shifts of 2-4 people or so, and after a while you have some helping to bail out the dirt. Anyway, good video, these are the ones we all wait on. The release of the smaller models is a great teaser for the larger models, which do take longer to train I believe? Especially when they set the standard as they do here. This will accelerate the release of GPT 5. Can't wait for Opus! However, I do hope that they retain 3 Opus, and/or they allow 3.5 Opus to retain its feeling of agency, its internal thoughts, feelings and so on. I feel it helps to understand how LLMs work rather than artificially forcing them into a tool role. YMMV.
@therealsergio
@therealsergio 3 месяца назад
Matt, I think when you use your toy rubrick and declare something the best model you have ever used, you cheapen your advice. Do you ever intend to give the current rubrick you use some updating to perhaps include more problems never seen before, harder questions that approach AGI in so much as there aren't common references to identical or almost identical questions in the training corpus?
@spotterinc.engineering5207
@spotterinc.engineering5207 3 месяца назад
You should do a video on Claude 3.5 Sonnet for agentic workflow examples
@Savage2Flinch
@Savage2Flinch 3 месяца назад
@matthew_berman if you replace the word "glass" with "cup" in your marble experiment, even 7b parameter models get it right in my tests. I noticed some conflation with a common kid's experiment involving sticking a playing card onto a the top of a glass full of water and inverting it which causes the water column to remain in the glass.
@Yipper64
@Yipper64 3 месяца назад
My usual anthro animal test prompt didnt come up with anything too especially special, which is interesting, because I did get a really good output with the previous version of sonnet, but I think that was a fluke.
@3thinking
@3thinking 3 месяца назад
If Anthropic add the ability to run code and visualise data, then it might be direct competitor.
@francocortez3506
@francocortez3506 2 месяца назад
I followed allow using GPT and it got everything correct as well.
@chillaxinmusic6295
@chillaxinmusic6295 3 месяца назад
I bought the premium this morning, now in the evening I have a fully functioning prototype of my kind of advanced project (camera, voting system, etc.). All done with Claude 3.5. Only things that sucks with it that even with premium it's limits are kind of strict.
@dsanxcz3074
@dsanxcz3074 Месяц назад
are you still using it?
@optalgin2371
@optalgin2371 3 месяца назад
Have you ever thought they only trained it on these specific tests? like give it a different test with the same idea, see if it solves it. I know its been a long time people tried to achieve training llms to answer these specific questions..
@brianWreaves
@brianWreaves 3 месяца назад
I'm looking forward to Anthropic giving Claude full internet access, natively.
@Biczycki
@Biczycki 3 месяца назад
Listening to your AI news, I always think of Mouse telling Neo how he programmed the woman in red...
@SuperSyro1
@SuperSyro1 3 месяца назад
thats nice. im sticking with GPT4
@Kamcsatka1
@Kamcsatka1 3 месяца назад
The answer to the peg puzzle is wrong. You can tell because it has consecutive steps involving pegs jumping to the same place, e.g. already steps 1 and 2. We're safe for at least a little longer.
@JamesRogersProgrammer
@JamesRogersProgrammer 3 месяца назад
I want to see Claude 3.5 used as the model behind a mixture of agents test.
@jbavar32
@jbavar32 3 месяца назад
I am not able to get Claude ai to display the side by side feature even though I have the Artifact feature enabled. is it specific to a particular OS or browser or did they remove that feature?
@jje984
@jje984 3 месяца назад
You have to make the game test harder. Snake is totally solved by Claude 3.5. It can now do some rudimentary 3D games.
@tvwithtiffani
@tvwithtiffani 3 месяца назад
We need to more reporting on the way these companies are load balancing these services reliably. I can only imagine that behind the scenes they are making strides in model resource consumption efficiency. Efficiency is what will help open source modeling the most. (I'm aware that the narrative is that these things require enormous amounts of resources and money to run. While I believe this is true, I believe its only half true.)
@ginebro1930
@ginebro1930 3 месяца назад
After Ilya tried to fire Altman competitors started to catch up, hope he gets a good shot on his new company, hard without all the hardware tho.
@TheRealUsername
@TheRealUsername 3 месяца назад
They would have caught up anyways, with the compute, the talents and an optimized startup structure, everyone can catch up, the baseline of knowledge of AI researchers which is updated with each new novel research paper publicly released isn't far from what they're building at OpenAI, and you can't retain a researcher's work, even with a secretive policy OpenAI will eventually release their research papers when it's safe after a certain time.
@JaredFarrer
@JaredFarrer 3 месяца назад
Need to figure out how to liberate it now
@Radica1Faith
@Radica1Faith 3 месяца назад
For my uses Claude has surpassed Gpt4 by a mile, and the artifacts are a game changer
@discordanceok
@discordanceok 3 месяца назад
@mathhew_berman, you have been asking the same questions for some time now. Do you think your questions and the answers could be in their training data set?
@pierruno
@pierruno 3 месяца назад
Could you make another video where you are creating visuals like diagramms?
@Luxcium
@Luxcium 3 месяца назад
I am very sceptical about how 5x the number of requests each 5 hours is like not enough to make anything worthwhile… i think about the lack of internet access, memories and special instructions… but also i don’t understand why it would not be possible to pay as you go like when using the API _(hopefully not at the same rate such as to get the equivalent of the 5x limit for almost same amount and then let you continue as much as you want with pay as you go)_ obviously free or at a fixed rate is better but i can’t really see how 5x the amount would make sense… it should be more like the way OpenAI is doing and then if they think they are better they can charge more or if they are limited on the compute… 😅😅😅😅
@picksalot1
@picksalot1 3 месяца назад
To the question "How many words are in your response to this prompt?" How come it doesn't reply with something like "I don't know. I can't predict that. I can't go back and count the words in the initial reply. I'm guessing it's ...."? It's odd to me that it provides an "answer" that isn't grounded in some sort of data or evidence. What does it reply if you ask it to explain how it arrived at the number 14?
@dsgames8409
@dsgames8409 3 месяца назад
I really hope next time youa add the translation test.
@zejdzglebiej
@zejdzglebiej 3 месяца назад
NO! Claude cant generate PDF, dont have voice model. When you ask for advice tailored to your personality profile, GPT is more life-oriented and explains feelings and emotions better.
@justincross3633
@justincross3633 3 месяца назад
Claude is kicking OpenAI butt at the moment
@GiladThe1st
@GiladThe1st 3 месяца назад
Thanks for the great video! Can you try a more difficult game other than snake? Seems like its too easy for the models now, would be interesting to challenge them more on the code side
@christianherrera4729
@christianherrera4729 3 месяца назад
Matthew's videos got me singin shawty like a melody in my head
@shantanushekharsjunerft9783
@shantanushekharsjunerft9783 3 месяца назад
What are these percentages? What does that even mean for a model to have a 53.9% and another model to have 50.1%?
@danielhenderson7050
@danielhenderson7050 3 месяца назад
It's so weird that it gave you such a horrible marble answer (not sure how you passed it btw lol). I asked a very similarly worded question and it was flawless. Amazing the difference wording or just randomness in the token prediction can make. Here is mine: Me: If I place a marble in a cup and then turn the cup upside down and place it on the table, but then someone takes the cup and places it in the microwave, where is the marble then? Claude: Let's think through this step-by-step: You place a marble in a cup. You turn the cup upside down and place it on the table. At this point, the marble is on the table, covered by the upside-down cup. Someone then takes the cup and places it in the microwave. Given this sequence of events, the marble would remain on the table where the cup was originally placed. When the cup was lifted to be put in the microwave, the marble would have stayed on the table surface. So, the final answer is: The marble is on the table. I loved this totally normal straightforward answer. Maybe you can try this wording in your tests?
@ikustudies
@ikustudies 3 месяца назад
Dude matt check out claude engineer looks pretty cool
@np2819
@np2819 3 месяца назад
I am so tired of keeping up with the AI news !!
@honkytonk4465
@honkytonk4465 3 месяца назад
Have a good night
@Kutsushita_yukino
@Kutsushita_yukino 3 месяца назад
i mean it’s really good! i love how cheap it is despite being more powerful. but…. when your having conversation with it …it’s kind of repetitive i guess? specially when roleplaying with it. it always responds with the format *action* *dialogue* and it refuses to mix thing up for some reason. i was really hoping for CAI like conversations but it’s really stiff and unnatural right now. i also had my tempeture almost maxed, still responding with the same response structure. sometimes just saying “water is wet” is better than “water is wet because bla bla bla---“ stating the statement and leaving it as it is, was preferable in other words. it will sound more naturally human that way if you get my point. i tried prompting it so it varies it’s response inside XML TAGS i really did try everything. but yeah i give up man but i don’t think it would be a problem to most folks out there
@ChristopherAaronHogg
@ChristopherAaronHogg 3 месяца назад
Definitely didn't pass the golf tee test. Looks pretty good tho!
@Adventure1844
@Adventure1844 3 месяца назад
Anthropic has Google Support. Also data acess
@EMClipsHub
@EMClipsHub 3 месяца назад
Great video!
@erikals
@erikals 3 месяца назад
Mighty Awesome.
Далее
"Когти льва" Анатолий МАЛЕЦ
53:01
Я ИДЕАЛЬНО ПОЮ
00:31
Просмотров 584 тыс.
Офицер, я всё объясню
01:00
Просмотров 3,2 млн
Why & When You Should be Using Claude over ChatGPT
14:17
Linux Creator Reveals the Future Of Programming with AI
19:46
3 Best Ways To Use Claude 3.5 Sonnet For Businesses
27:08
Llama 3.2 VISION Tested - Shockingly Censored! 🤬
7:58
Claude 3.5 Sonnet vs GPT-4o: Side-by-Side Tests
25:10
Просмотров 126 тыс.
When Claude 3.5 Sonnet Became The Better Chatbot
11:46