Other youtubers: OPEN AI JUST UNVEILED STRAWBERRY AS RUMORED: AGI??? NO, ASI AND IT SHOCKS THE INDUSTRY TO ITS CORE! BILLIONS OF JOBS LOST OVER NIGHT! Oh and by the way, the rest of this 40 minute video essay will be about fantasies of a world with superintelligent robots.
Here is what OpenAI won't tell you. It does some more chain of through stuff, even though they say it's not that, and it's still super dumb with very simple things. It's a small improvement using brute force, because they are desperate and have no new ideas. Their voice AI is still not out because it keeps talking like the users, and Sora was beaten by the Chinese, so they are grasping to releasing whatever they can to keep them on peoples minds. They been loosing coders to Claude 3.5, and this will start bringing them back.
Please don’t burn yourself out getting your video out immediately. I think, many of us who watch your videos are always looking forward to your in-depth analysis but we also understand that it might take you a little bit longer to put your information out due to the amount of thought and work you put in.
sounds like an alternative ending to original blade runner _I've seen things, you people wouldn't believe, mm_ _Attack ships on fire off the shoulder of Orion_ _I've watched C-beams glitter in the dark, near the Tannhauser gate_ _All those moments will be lost in time, like tears in rain_ *_stochastic parrots can fly so high_* _time to die_
I haven't even noticed there's a new model. Apparently I'm not following AI development closely enough because no smart algorithms have pushed the info to me. But RU-vid did, and rightly so, because I'm always happy to watch your videos, at this time of day I even have time to watch them. 😄
I used to follow two minute papers quite closely, but with the whole AI boom it started sound a little too hype-based to me. For one, repeatedly calling this one "Einstein in a box". AI Explained is so much more balanced.
@@njpm excited is the wrong word. Its impressed. You can try finding someone els who goes through hours of reading material on the day a product gets released and uploads a detailed 20 minute report video on it the next day early morning.
This is not a paradigm shift for my test cases. 1. It failed my custom ball physics test (not the one everyone else uses, since that is making it's way into the fine tuning/training as seen on the Two Minute Papers video), because it still doesn't understand physics intuitively like a human. It assumes a ball would not fall out of an upside down ceramic coffee cup that is held above a table. The model says "2. Cup is held upside down above the table: Assuming the ball doesn't fall out, it remains inside the inverted cup." 2. It was unable to solve a react.js view port lazyload bug. The issue is that more and more images keep loading as you scroll, so the view port jumps around, because it loads more images before the previous have loaded. It doesn't understand this without me doing most of the work and reasoning for it. It kept trying to adjust image sizes and css that was not the problem. 3. It failed to find the optimal solution to setting a square post at a 45 degree angle to a wall. I said I had to only a tape measure, but it kept wanting me mark out spots and measure to the center of the hole. My solution involved simply turning the post and measuring until both corners are the same distance from the wall. It wanted me to triangulate the hole on the other side of the wall and assumed the post would be in the center of the hole (it's at a 5 degree lean, and is not). It did realize that my solution was better and more direct. 4. I asked it "What is the prime factorization of 1,090,101, providing the prime factors and their respective exponents?" and it got it wrong. It does not have access to python or a calculator, once it does it will likely be a big improvement. ChatGPT-4o gets this one correct, by running Python code. I've seen ChatGPT-4 do some impressive stuff by telling it to solve it with code using brute force, even better than ChatGPT o1 is now. As for what it's good at. It can understand and output much more code. It does better at refactoring and dealing with code that involves multiple files. It's defiantly an improvement, but about the same as Claude 3.5 with it's big context length vs ChatGPT-4, when it comes to coding ability. If you want a real paradigm shift, have it analyze it's own text for assumptions, and follow those up with a question before continuing. This one change made it solve the ball and cup problem 100% of the time. Just ask "Did you make any assumptions? If so, correct them, or follow up with a question if more information is needed." Response when using this method: "Assumption Made: I initially assumed the ball stayed inside the cup. Reality: Unless the cup has a lid or the ball is held in place, gravity would cause the ball to fall out when the cup is inverted." Also "Follow-Up: Was there anything preventing the ball from falling out when the cup was inverted (e.g., a lid, the ball being stuck, or someone holding it in place)? Is there additional information about the ball's behavior during the inversion?" All very important questions about my scenario, that I was not 100% clear about, although I was specific about a ceramic coffee cup because they have no lid and can't be squeezed to hold the ball in. It solves many of the weird responses, because it has a chance to notice them and correct them. It's trained on noticing poor responses just as it's trained on giving the correct response the first time, this means you can increase it's ability by tapping into this knowledge. This will lead to much improved answers, and actual interaction to break down and solve a problem, rather than it taking a best guess each time. Most people don't use it to solve the types of problems they are training it on. They need to make it more deductive rather than more predictive. This way it's better at doing research and finding the correct answer in the noise, rather than knowing the answer outright. It needs a BS detector and to be good at using tools. This is why the LLM will never be better than a calculator, so why are they making it do math from memory? You're welcome OpenAI.
@@mirek190 I was responding to the video that is about the preview, and it says "-preview" on the model name, so yes I know. "60% better in reasoning" there is no standard measure as to what this even means. It can't reason its way out of these problems, because it's not reasoning at all, just predicting text output. Calling it "reasoning" is a joke.
Your channel was the first that I had ever subscribed to which was just over 18 months ago. Then because of the quality, such as today's video I have never missed an episode. Thanks for explaining things so clearly and without the hype.
I'm glad they are trying ideas other than simply scaling. These unique ideas on top of scaling will be needed to reach human level reasoning and analysis.
Yeah, scale smart, not brute force. Reasoning and world-understanding should probably have different system architecture that LLM. But I'm not afraid that we won't get there. There will be a paradigm shift eventually thanks to all the geniuses working on AI research and of course also the AI helpers.
@@wildfotoz too bad, the name already stuck. I, for one, don't mind calling the field Artificial Intelligence, because I'm too used to video game "AI" which is also just algorithms, so I guess I got desensitized to that term being used for something not actually intelligent. And I think a lot of the time, at least something like GPT can give you suspension of belief on that one, especially if one's not familiar with LLMs (or ones of this caliber at least). Same definitely couldn't be said about video game "AI"... on that note, someone should definitely use neural networks and whatnot to deploy in video games, it's a big problem that the computer never really acts that smart and the ways to increase difficulty usually involve giving handicaps to the player or cheats to the computer rather than making the computer act better
I saw another video pop up which said that they released o1, I didnt watch it but waited to watch you video on it, you are the best and also most reliable and super fast in releasing these videos
It seems every model so far because of how they are trained simply cannot conclude that it isn't sure or doesn't know something. Not remarking that it can't pull real URLs and making up some is a great example. That inability shouldn't be punished, the realization that one can't do or doesn't know is crucial to figuring out the facts. I've even tried on a uncensored local model to break this habit with constant reminders that it's okay to disagree or say it doesn't know, but it always leans to the "give the human what they want to hear" mode.
Yeah, I agree. I think it's part of the broader positivity bias as well, where the LLM is expected to provide a helpful, constructive response, but isn't good at figuring out at when it can't really give one, and therefore it tries to force a solution of the sort, sometimes leading to agreeing with you when it should simply tell you you're wrong, or yeah, just state it doesn't know.
Ask your LLM "how would a man without arms wash his hands?" this is the answer I got from chatgpt 4o. "A man without arms could wash his hands using assistive devices or by adapting techniques. For example, he could use his feet, specialized prosthetics, or mouth to operate faucets or use automated systems like sensor-activated sinks. He might also use tools specifically designed for individuals with disabilities that enable greater independence in personal care tasks."
Ah, that's a good one. Thanks for sharing. I just tried 3 different LLMs. None of them pointed out that a person without arms likely doesn't have hands.
@@ticketforlife2103 The reason these LLMs fail at this question has less to do with their capabilities and more to do with limitations. Understanding such questions requires the AI to be capable of inferring deceit, which requires it to be capable of deceit. OpenAI is actively trying to avoid an AI that can be deceptive, ergo it is unlikely that they will ever let an AI be able to answer a question like this.
Love your videos! You're hands down by far the best AI channel-just like the leap from GPT-3.5 to o1-ioi. Straight to the point, without the noise. Keep up the great work!
Your videos are the best in the AI domain. Just when I'm doing my own research, reading the papers, and formulating my impressions, you release a video that often mirrors my own impressions. Thanks as always!
BOSS, you are _on_ this already! I was secretly hoping that there would be an upload from you already, but it seems I was just underestimating you. Relentless!
Thanks for your thoughtful analysis as always Philip 👏🏿♥️. Looks like OpenAI is back to shipping something cool. Only a matter of time before Anthropic launches something big
Amazing video, amazing ending and quote “LLMs are dumpsters and we attach rockets to them”. I’m also almost sure GPT5 will indeed be an avatar based model.
Man, I have been saying for months the scratch pad is the way to get to the next level. That's why people sleeping on the efficiency gains is so annoying. You turn that into multiple passes per given response, and tah dah!
AI can already do my job better and much much faster than I can. The missing component is a user interface. Once someone figures out how to input the data, I'm toast.
The video I was waiting for (how dare you be abroad when o1 was released - holidays aren't allowed ! :) and it didn't disappoint. Balanced and informative as always. 22:34 Yeah, this is what struck me about o1. Designing the system to "reflect" on its responses and "aspire" to better answers is a step _closer_ to it having an actual goal. AGI seems unlikely/impossible _without_ being goal driven (and self-reflective) but clearly _that_ will be when it starts to get really dangerous (in the 'Skynet' sense rather than "just" the economic meltdown and civil unrest sense :).
13:37 "They don't look like they are leveling off to me" I wonder if it has anything to do with the fact that the X axis is in log while the Y axis is linear.
Now we just have to wait for scale through increasing compute and energy so there's longer inference time? Then combine that with bigger training data to get PhD level GPT 5 by late 2025?
The GOAT AI youtuber that first predicted GPT 4 capabilities is here and is explaining a new type of model. Your analysis is on a different plane. Been waiting for this one since the moment o1 was released. Finally ❤❤🎉
Thank you Phillip for this update on o1 mini, and as quickly as you have, I can't imagine the amount of work you have to do to stay on top of these constant evolutions in ai, it really is greatly appreciated, looking forward to your forthcoming follow-ups on this product and I imagine the other announcements that might pull you off track for a moment, thank you and whoever the elves are who might assist you in this work, take care of yourself and be safe, peace
Great video, I love your informed take on these releases. Interestingly, I just tried the ice cubes in a frying pan question on this model and it got the right answer (unlike your test in which it got it wrong). I ran it 3 times in separate chats and got the same correct answer each time. Here are the last steps in its reasoning: ... "Ice cubes in a hot frying pan (especially one frying a crispy egg) will melt quickly-typically within a minute. By the end of minute 3, any ice cubes placed before or during that minute would have melted. Calculate Remaining Ice Cubes at the End of Minute 3: Considering rapid melting, it's realistic to assume that no whole ice cubes remain in the pan by the end of minute 3. Choose the Most Realistic Option: Given the melting rate of ice in a hot pan, the most realistic number of whole ice cubes remaining is 0.
The fact we have two scaling laws on top of each other now is so mind boggling. Plateau? Yeeeaaah, I don't think so anytime soon. There is so much crazy stuff this now has unlocked. Imagine a GPT-5 class Omnimodel where any output modality has an o1 like reasoning chain. That's a totally different world of capabilities. 2025 is gonna be fun, allthough I doubt we get such a full force system before the end of 25, if at all, who knows if gov officials deem it to dangerous (or costly?)
I hope so, but in truth we never know if there's a plateau. Many people say there is, many others say there isn't. I'm just observing from afar and following, because I don't really know the answer, and I'll wait to see where it goes
17:34 “As models become larger and more capable, they produce less faithful reasoning on most tasks we study.” I don’t even get why people, especially those in the field, would think that what these models are outputting as “chains of thought” (a complete misnomer to me) _are_ the steps the model is taking. People can give reports of what they “think” when responding to this or that question and it’s not clear even in _those_ cases what connection, if any, those thoughts have to the response. These LLMs are a step removed from that (if not more)-they don’t have inner verbal behavior, which, again, may or may not be really relevant as “steps.” To use a human analogy, it’s all going on at a “neuronal” level-these LLMs _can’t_ report what is going on there (it’s like expecting people to give accurate verbal reports on some unconscious physiological process in the brain). It’s almost like some pre-scientific theory of reasoning.
I think that's the problem with "interpretable AI" in general. The word "interpretable" seems to be used in two different senses. One is mechanical in the sense of reducible to some mathematical model or axiomatic foundation - like a lower dimensional description of the neural net, precisely characterisible dynamics in the parameter space, results exactly verifiable by logical systems like Lean, or even interpretablility by another AI system (like the study by OpenAI using GPT4 to "interpret" GPT2 neuron behaviors, which I find quite ironic in that it seems to hint at some kind of infinite regress. Though by analogy you can say it's as mathematical logic accepts analyzing arithmetics in axiomatic set theory as a standard practice, so it technically still counts as a type of mechanical reducibility). The other of course is human understandability, which by itself is hopelessly nebulous. Language is just a passable approximation to the amorphous goop sandwiched by neuronal activity from below and all possible mathematical descriptions from above. If we run an LLM which feeds formal outputs to Lean to verify, then we have this precise structure where the system is tethered to the mechanical computation of neural nets on one end and a precise axiomatic system on the other. The linguistic manifestation of "deduction" in the middle does not have any precise status, but just imitates the arbitrary constraints of human thought we bootstrapped it from. Reducing to either end feels insufficient, but what else can we do? Often one has to give up either precision or "understandability". And even if the individual logical steps in a precise reduction feels understandable, the whole thing could still very well be not due to combinatorial explosion of logical complexity. Just like how natural languague has theoretically unbounded recursion depth but use more than 3 layers and you start to lose track. In the end these CoT steps are like comments written by human programmers for other humans to read: They usually provide some approximation, in the goop layer, to the underlying precise logical process, but ultimately are not causally linked to them. They might help debugging. That's all there is to it.
@@YT-gv3cz Thanks! I appreciate that highly technical explanation, even if it is well beyond my capacity to understand. (My background is in behavioral science, not computer science.) “In the end these CoT steps are like comments written by human programmers for other humans to read…” I’m not even sure _that’s_ true-again, the steps might just be _emulating_ those. The closest things I can think of in human terms-and it’s not all that close-are (1) how, in split brain experiments, the verbal hemisphere of the brain produces plausible, yet _wholly fictitious,_ explanations of what the non-verbal hemisphere is doing (it _can’t_ know because the connections between the two hemispheres are severed) _except_ here the LLM is not even observing its own output (it’s not clear _what_ it’s observing, if it _is_ observing anything or if it _can_ observe anything, i.e., “reflect”) and, (2) more generally, as I said in the first comment, how people _can’t_ give accounts how the neurons are collectively firing to give rise to whatever behavior they give rise to-human brains are simply not wired like that. I guess these models _could_ be reporting the interim results of what the various internal calculations are as “steps”-does the architecture even _allow_ that?-but it seems a lot more likely (to me, anyway) that they’re just producing what would be the most plausible steps (whether or not those have any connection to the actual processes), just as they produce _any_ verbal output. Obviously, _CoT_ _works_ (that’s why people are interested in it) but that the model is “reasoning” better is at best a description (and it might be a highly misleading one at that) and not an explanation as to what is going on. I’m not sure what we _can_ do but what I think would be good to _stop_ doing is viewing these steps as necessarily having any connection with how these models come up with the outputs they produce. The best we can say, it seems to me, is when the models are prompted for “steps,” _that’s_ the output they produce. It’s simply more LLM output to be explained.
o1 is gpt4 turbo with reasoning capabilities. It's why it costs similarly to run as turbo does in the api, why turbo jailbreaks that don't work on gpt4o still work on o1, and why it costs 3x more to run than gpt4o.
Wouldn't be surprised. GPT 4 Turbo is a significantly smarter model than 4o. I don't care how many benchmarks 4o wins at, Turbo's reasoning skills are just better.
Man, there's not that many things that are greater than technological progress. An upgrade for an AI model that significantly pushes the frontier is almost orgasmic :) Cant wait for the next significant upgrade
The best phrase to express my thoughts about the future are coming from a Russian meme: "Страшно, очень страшно, если б мы знали что это такое, мы не знаем что это такое".
brilliant work. as always. you never hype unless it really is appropriate. my most trusted channel on AI out there. keep up the good work💪 cheers from Germany. Mat
@@aiexplained-official I'd call it Mannin, but I'm sure you know it by its English name of Isle of Man [tho Isle of _Mann_ is, indigenous-culturally at least, our preferred spelling] As we're only a population of
I am no where close to fully keeping up with the AI game, but your vids are so perfect for helping someone like me stay abreast on the industry. Thank you!
From memorizing answers to memorizing answer programs - step change from implicit to explicit. It is stillimited to what's in its dataset though and can't handle novelty (the arc agi blogpost explains his well).
Phillip, I think you should consider converting some of your videos/work into podcasts, I don't actually listen to podcasts yet, many other people do, but I realized that as I was listening to your video while doing dishes, that your commentary is so good that it doesn't always require the screenshots. I'm so happy for you man, you're kickin' ass.
here in the Philippines, I literally said can't wait to go to sleep and then wake up to AI Explained on o1 preview. You're so quick dude! You're the bomb.
"If your domain doesn't have starkly correct 0/1, yes/no, right answers/wrong answers, then improvements will take far longer." I feel like this is an interesting area for discussion. So the obvious question is; if there aren't 0/1, yes/no, right/wrong answers in these domains, then how are we adjugating the results? If there are no objective results that we can reference, then how do we know that our assessments of the “correctness” of the outputs are, themselves, objectively correct?
Very good video my friend, was eagerly awaiting your first impression on this. Looking forward to more in depth testing Also, take care of yourself always :)
Consider this your cheat sheet for applying the video's advice: 1. Understand that o1 excels in STEM fields and can be used to solve complex problems in areas like physics, math, and coding. 2. Test o1's capabilities by throwing challenging problems at it, but remember to double-check its reasoning as it is still under development. 3. Stay informed about future updates and advancements in o1's capabilities. 4. Exercise caution when deploying o1 in sensitive domains, as it's important to use this technology responsibly and ethically.
I’ve done one prompt so far with o1-mini. Basically, “this code runs/works, but isn’t doing quite what I want. What I want is [several things it’s not doing now].” Claude had failed this enough that I had decided I’d fix it myself. Then o1 dropped, and fixed it in one shot. It’s an anecdote, but a promising one. And honestly, I don’t care if it can’t do things that are easy for me if it can do things that are hard, annoying, or tedious. A hammer can’t wash the dishes - doesn’t mean it’s useless.
Was gone for the weekend, I come back to this. Crazy how fast things go in AI huh? No but seriously that's kinda crazy because now they can use this new method on a 100x system and if that doesn't yield crazy results, I don't know what will.
There is one important thing to understand about o1: judging by everything, it is a rather small model. I think less than a hundred billion parameters. It is clearly smaller than Omni. And this means that the basic model itself is rather weak, it obviously does not have a very developed model of the world, etc., simply due to its small size. The Srtawberry algorithm is good, but if it works on small "brains", then the performance of the entire system will not be very high.
It's just gpt4 turbo with reasoning capabilities. It's why it costs similarly to run as turbo does in the api, why turbo jailbreaks that don't work on gpt4o still work on o1, and why it costs 3x more to run than gpt4o.
@@TechnoMinarchist That makes less sense than gpt4o being the base still, since they did add in training for the model to be capable innately of CoT, it's possible for it to deviate from gpt4o at the seams while being based on it. There is slim chance that they would go back to the older massive gpt4 model to use this inference costly method; that is why it costs more, it's literally generating so much more data, not necessarily that it's based on gpt4. Add in the naming and there's a 'mini' model, it's just less reasonable to think it's gpt4 based.
as always, a pleasure to see you go through any subject! I confess I wasn't hyped for this one. but after this video, I think I'll subscribe for chatgpt plus for a month and try it again. last time I did so, was about a year ago and it was pretty underwhelming. hopefully now I can get some cool things done with this new model (mainly interested in programming, being not a programmer)
Can you imagine what the world will be like 10 years from now? Feels like everything is going to change drastically , in an existential way. I believe we are in a truly existential paradigm shift. I know most of you reading this agree. It's starting to become real though which is crazy. Feels like a movie
Yup! I’ve been saying that since 2017….i met head of google ai in 2017, he said it will be 50 yrs before we have an ai that can reason, logic, chain of thoughts, see, hear, and talk (that’s what AGI was define back then), and now we have it! What a time to be in
I would love to hear how models based on proper first-principles reasoning are being developed. Clearly, this stochastic parrot has room to fly higher, but at the end of the day, it is still a stochastic parrot. If we want models capable of developing new science, then we need actual first-principles reasoning. I suspect that won't be transformer-based. I know people are working on this.
It's been trained to do it's reasoning the exact same way I was trained to do math and science in Asia. 1. Copy the teacher's methodology exactly. 2. Show your solutions or you don't get credit. 3. Right minus wrong in your final score.