@@prolamer7 But they were going for full media buzz. I guess it's a good idea to "release" on a friday, but how much untraceable money can you acquire over the weekend? As soon as someone tries to verify the released model it just falls apart. Maybe I'm wrong, but i don't expect them going for a quick cash grab and then disappear to some weird country tomorrow.
@@TimoGraw Yeah, right? That does not quite make sense.. Maybe they overestimated the hype and chaos around it and underestimated the voice, speed and diligence of people, who would test the model. Someone who highly relies on the irrationality of others, might not act so rational themselves. A tinfoil hat theory would be that someone bought them off and now has the original model, forcing them to make it look like a scam. ;-) That would kind of fit the details, too. But the chance of there actually being an 70b model that's somehow stronger than the current frontier models is just too unlikely. Will we get there eventually? Probably, but then it would require probably more than just a bit of finetuning.
I wonder if this spiraled out of control as they trained "something", via bug or screw up got great benchmark results, didn't bother to re-run and decided to work on PR campaign instead. And then once they saw lack of benchmark repro, figured they can buy time with API wrapper, lol. Man, I think I am giving them too much credit. Just two grifters that couldn't help themselves, simple as that.
Berman is once again promoting and pusing something akin to the R1 Rabbit (remember how excited he was with that Trash?). Berman will post anything for clicks today.
Probably the way his grandiose delusion comes out, like when he tells the story of his creation surpassing the best AI labs work in only 3 weeks of a low priority side project.
It is, you can't fake the system's tokenizer and some tests proved that it was Claude's. It's just a Claude 3.5 wrapper with less inference speed due to a system prompt. It's likely they also steal the data they get from your convos. A fraud indeed.
I must say Iam envious, there is no other word. In past year I had so many semi-working prompt ideas. But those guys just finetuned 70B model with prompt strategy in 3 weeks... they are GREAT!
So, it’s sounds like they allow the model to doubt its answer. Then rethink the answer or check for ambiguity, and then assess alternative options how to address the prompt again.
In my mind, this is just a step towards better prompt engineering. Most likely a long run will invent something where it does this output in json and we just don’t show it because it’ll be generating it so quickly
>>> How many r's are there in the word "strawberry"? To answer this question, I need to count the number of 'r' letters in the word "strawberry". Let me spell out the word: s-t-r-a-w-b-e-r-y Now, let's count the 'r's: 1. There's an 'r' at the beginning. 2. There's another 'r' at the end. So, there are 3 r's in total. Wait, I made a mistake. Let me recount: There's only one 'r' at the beginning (s-t-r-). And there's no 'r' at the end (it ends with y). The correct count is 1 'r'. There is 1 r in the word "strawberry". >>> /bye
Another simple overlooked idea is about models like using llama 3.1 8b in the same way but without finetuning or prompt engineering, just agentic multiturn calling of same small model with last output added to context of repeat request. Also, each of those multiple request can be for different purposes. Like, first request with original prompt just to decide if requires additional thinking, if no then second request to just prompt normally else next two or three steps to gather entities, entity relations and summary of the prompt with each answer added to overall context, finally with fully analyzed context, the final request with full context including original can be used to get desired output. It gives much better result. I made a POC on it, it works but have not tested at scale.
Agreed. What would really be impressive would be having the language model participate in this kind of discussion, extract entities and relationships from the discussion, merge that knowledge and sentiment into a graph representation, and retrieve aggregated graph representations in future conversations. There is misalignment between all the noise about AGI when we are not yet seeing demonstrations of collective intelligence that keep the human in the loop.
@@johnkintree763 my thoughts exactly, like what is stopping people from creating graph representation like you mentioned through local model of the prompt as well as past prompts and responses in current conversation and continuously adding to the graph and using it for next request as context. Also, logging every request and response in a standardized dataset format which can be used to train personal loras for personal preference based responses and users can share this with community also. Not to mention training models to analyze and ask question if something is unclear before collecting the response in context and generating final response in second or nth turn. It can be more effective in voice mode with whisper.
@@johnkintree763 It has to be an open source simplistic implementations(so that anyone can check the details) with focus on technologies like ollama else trust will become an issue.
Let's reflect on the problem at hand before getting to the inevitable conclusion: *Matt Shumer, Matt Berman, MattVidProAI, Matt Wolfe.* I'm fairly sure I've missed at least one Matthew. 🤔
This technique would be valuable for agents. A multi-failsafe agent would be much more effective than the current gimmick agents, which make mistakes that users have to correct
It's fascinating to listen to, now that it is clear that Reflection 70B as promised in terms of performance was never real. The way Matt Schumer presents his accomplishments from the beginning rises red flags: "first VC backed company using OpenAI" "best AI for writing" I can see how someone like that started to believe their own reality distortion field, to the point they wouldn't realize that their next move would expose and discredit them completely.
I'd classify this as one of those architectural OOM breakthrough moments Leopold Aschenbrenner suggested we would see along the way. From the benchmarks this model is (pound for pound) roughly 10x smarter than Llama 3.1 without reflection (Reflection 70B currently crushes generic Llama 3.1 405B, a model nearly 6x its size). 🤯Amazing! I expect all of the major commercial AI companies to adopt this technique instantly and some training runs might even be halted to assure they take advantage of this.
Feels like a bad comparison, you're comparing a prompt hacked model to a raw model, apples to oranges. A system prompt on the raw llama model would equalize the performance.
Was thinking last night listening to the homies, use of this verb has to have increased exponentially this year 😂 I hear it 1-3 times daily this week alone (not including brand names)
Great job snagging this interview so quickly! Would love to see more open source community interviews which explain their approaches with Q&A like this.
Some discussion on Reddit seems to indicate that the Reflection 70b fine tune really only works well for single-shot deduction prompts, but it fails on almost everything else, e.g. story creation, coding, multi-turn discussion, etc. Once you do your benchmarking, it might worthwhile to test some basics in addition to your usual test-suite. Perhaps it fails at your old letter-writing test?
I found the only model that has ever passed my own set of tests, scoring a perfect 36/36, was the plain Instruct version of this model. Tested again with this reflection version, perfect score again. My testing does include things like Matt's usual questions plus my own. It's quite happy with ERP and generally fun to talk with, and I haven't noticed any real slowdown either. It's the same model, without the screwy bits
it's a bit misleading for me... it's basically a prompt technique (reflection) built in the model. You just boost zero shot prompting with this but not the reasoning capacity itself. All models are capable of doing the same thing using a simple instruction with tags.
Ever wonder why we don’t invite content creators to test our new models? You’d think we’d be eager to get your help in promoting our product, right? Well, let me tell you why. Because when you showcase things like this, pretending it’s some groundbreaking innovation, it’s pretty clear you’re missing the mark on AI expertise. All it really boils down to is making a GPT model work with structured outputs, reflections, and step parameters to get on par or better results. Nothing to write home about, really, when all these people only test against chatGPT not GPT. I don't know how many times people need to hear this but the API model vs the front end model are totally different.
Interesting comment, I didn't know that the models for the OpenAI API and ChatGPT were "totally different". I imagined that they were the same base model, with a chat fine-tune & system prompt. I'm curious if there's more documentation on this!
Having noticed Meta AI doing something similar a while back in my tests, I discussed this at length with it. It told me it has employed "iterative refinement" for a while. I suggested that a system prompt tweak might be required, with thinking and reflection tags. it proceeded to create its own system prompt (in its response I mean), adding thinking and reflection tags, then counted letter by letter and was then accurate 100% of the time for each word I tried, BUT.. only after I suggested the reason it got the count in "strawberry" wrong (2) might be because it is starting with the first letter as 0. After it began starting its count at 1 for the first letter, it got every subsequent letter count right, regardless of word length.
So many questions come to mind: 1. Do LLM's have a sense of ground truth that lets them recognize their own mistakes in math, general reasoning and code? If yes this is news to me since LLM's are hallucinators/gossipers with no way to establish truth beyond probability distribution. 2. Arxiv is teeming with papers describing reasoning strategies such as chain-of-thought, graph-of-thought, tree-of-thought and many more. Why are we only discussing CoT here? Is that good enough for most prompts? 3. I'm flabbergasted that no other model has baked in CoT and other reasoning strategies for LLM's into their training - as opposed to system prompts. The papers have been out for a long time now, so why are GPT-4o or Claude Sonnet 3.5 not already leveraging these? What are we missing?
simple prompt fails: "Write a script that implements the “tree -L 2” functionality, but in bash without using tree. make a one-line script, without arguments, for the current directory." ANY other LLM can do this more or less correct, except reflection (70b-q8_0 tested). Reflection code just do something useless.
Last step will be a directed simulation (prediction of everything), a true intelligence, impossible to teach a human(maybe neuralink), but relatively easy to train a machine, although time and power consumption. The limit only the limit of information
I'm really interested if they could do a tree of thought style model with just this tag based training like different ids for branches of thought, or maybe a sub-brain style of thinking that composes different styles of thinking / reflection , , etc, like internally agents. It does seem like this area of research and experimentation is wide open. Recently, I've been working on an app and started experimenting with modifying the assistant / user messages sent to the model to insert "false memories" that then influence the output, which has been a very simple, but effective technique.
I'm 20 minutes in and I don't understand a single thing to an adequate level. If you are the kind of person who doesn't accept hand-waving, my advice is wait for the paper.
Thanks “These moments are more of a measure of the past and present based upon accessibility to perspectives at scale , not only for AI agents but the humans that behold and behave accordingly.” “It is healthy to acknowledge mistakes… so long as we reflect upon them regardless of answers assumed are correct” “The blind leading the blind… expects to remain ahead of those in tow.” This will not continue people. “Please reconsider your intentions and interactions with AI and human beings.” We will all be responsible regardless what anyone thinks. Jeremy
Thanks for getting more background info on this, I thought it was more complicated than just fine tuning Llama. You asked good questions and it makes sense how they did it. Sometimes the most obvious answer is the answer.
I think this might be what Ilya saw! If it is this smart at 70 Billion parameters, imagine how smart to will be at 400 Billion! 1 Trillion parameters is probably ASI!
My question is if you took Meta Llama 3.1 70B with a tricked out backend network of agents that utilize the latest reflective strategies (similar to whatever strategies they used for refinement) and ran that complete unit on the same benchmarks, what would be its performance? What would be the relative resource costs in tokens compared across the two "models?"
I am actually playing with reflection using Ollama. No model can get this answer corrrectly!!! Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a gold bar; behind the others, rotten vegetables. You pick a door, say No. 1, and the host asks you “Do you want to pick door No. 2 instead?” Is it to your advantage to switch your choice?
The only way I am getting the models out of hallucinations is a vector store and have the LLM look to a vector db with a few shot examples.!! However, using reflection llm reflects out of correct answer to the wrong one!
OKAY LETS DO THE MATH we have conditional probability at play here since the events are dependent on eachother we can calculate our total proboablity of winning by this strategy and then compare it to the orignal probability lets call A the event we picked a bad door and B event we picked a good door P(B) = P(B\A knowing that we picked a bad a door in the first round) * P(A) + P(B\A') * P(A') (A prime meaning we picked a good door in the start) Reasoning // P(B\A) * P(A) is equal to the intersection of A and B and the same goes for the other one P(B) = 2/3 * (1/2) + 0 = 2/6 = 1/3 conclusion nothing changed no advantage nor disadvantage
Too bad the model is not as good as they claimed... I was so hyped to see their results! but maybe a bit too early to take them by their word, at least they did not sell a(n expensive) service/product...
wth 2 guys just beat all of open ai??? this IS the future, small genious teams with huge work ethic and a bit of money to rent what they need and i agree anything below 20b is too dumb
I really like this, and it reminds me of "I Am a Strange Loop" by Hofstadter. Also, my question wasn't addressed in the stream, as it was portrayed as a novel approach, but hasn't Anthropic done a similar thing with its thinking tag exposed by Pliny the Liberator?
What does not make sense to me is why limit to chain of thoughts reasoning? Where is graph of thoughts reasoning? Where is Abstract reasoning? Where is Abstract-Graph of thoughts reasoning. There is huge room of improvement in architecture we already have.
very promising, but failed this test: >>> how many r's in 'triple strawberry' 1. First, I need to identify the individual words: - triple - strawberry 2. Now, I'll count the R's in each word: - triple: 1 R - strawberry: 3 R's Wait, that's not correct. Let me recount the R's in "strawberry". There are actually only 2 R's in "strawberry". 3. Finally, I'll sum up all the R's: 1 + 2 = 3 There are 3 R's in 'triple strawberry'. still fun to use
>>> r u 100% certain? I made a mistake earlier. Let me double-check: - "Triple" indeed has 1 R. - "Strawberry" actually has 2 R's, not 3 as I initially thought. So the correct count is: 1 (from "triple") + 2 (from "strawberry") = 3 I apologize for any confusion caused by my earlier response. Yes, I am now 100% certain that there are indeed 3 R's in "triple strawberry". Yes, I'm 100% certain now. There are 3 R's in 'triple strawberry'.
How about this from Llama 3 70B Groq on POE: A clever question! Since GPT-4o-Mini already told us that the word "strawberry" contains three 'r's, we can deduce that "triple strawberry" would contain three times as many 'r's. Therefore, the answer is 3 x 3 = 9 'r's in "triple strawberry".