Mixture of Agents (MoA) BEATS GPT4o With Open-Source (Fully Tested)

Подписаться 280 тыс.

Просмотров 50 тыс.

50% 1

Full test of Mixture of Experts implementation.
Subscribe to my newsletter for a chance to win a Dell Monitor: gleam.io/otvyy/dell-nvidia-mo... (Only available in North America this time)
Be sure to check out Pinecone for all your Vector DB needs: www.pinecone.io/
Join My Newsletter for Regular AI Updates 👇🏼
www.matthewberman.com
Need AI Consulting? 📈
forwardfuture.ai/
My Links 🔗
👉🏻 Subscribe: / @matthew_berman
👉🏻 Twitter: / matthewberman
👉🏻 Discord: / discord
👉🏻 Patreon: / matthewberman
👉🏻 Instagram: / matthewberman_ai
👉🏻 Threads: www.threads.net/@matthewberma...
👉🏻 LinkedIn: / forward-future-ai
Media/Sponsorship Inquiries ✅
bit.ly/44TC45V
Links:
github.com/togethercomputer/MoA
Leaderboard - bit.ly/3qHV0X7

Наука

Опубликовано:

22 июн 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 280

@matthew_berman 6 дней назад

Should MoA be the default for Open Source now? Subscribe to my newsletter for a chance to win a Dell Monitor: gleam.io/otvyy/dell-nvidia-monitor-1 (Only available in North America this time)

@d.d.z. 6 дней назад

If I'm outside US I have no chance?

@user-ru1qz1bo2q 6 дней назад

Generally speaking, the improvements seen here can be achieved with standard open source models by using more effective prompting. The prompts you use for these tests seem specifically designed to make the models work as hard as possible. Better prompting doesn't carry the significant speed or memory costs of the MoA paradigm.

@jimmassey140 6 дней назад

I've gotten some models to perform better on the "apple" challenge by increasingly the "cost" of getting one wrong. Maybe worth a shot more broadly? E.g. Please generate 10 sentences that end in the word "apple". If any one of the sentences does NOT end in the word "apple", then you have FAILED the entire task. There is NO credit for partial success. (Llama3 8b and 70b seem to be impacted by this a lot).

@joe_limon 6 дней назад

I can't wait for MOA to be smart enough to pull specific models based on what they are good at rather then prompting every single model. This would bring wayy more value toward training narrower specialized models that outperform at specific tasks.

@matthew_berman 6 дней назад

Agreed. This is what HuggingGPT paper from last year was all about! Finally coming to fruition.

@Yipper64 6 дней назад

So one thing we know is that if you train a small model on data from a bigger model literally just to prompt it, it can work much more like the better model. Well MOA allows smaller models to work together to behave like a bigger model. Idk if you get diminishing returns, but I feel like you could literally loop this and get something that trains itself.

@rayr268 6 дней назад

Also hood for running on smaller devices imo

@joe_limon 6 дней назад

@@rayr268 and running much faster

@14supersonic 6 дней назад

Most likely, what we would also need is a model that's specifically trained to understand agentic workflows and identify what types of models are typically good at what types of tasks. Then I think we'll be cooking.

@klaushermann6760 6 дней назад

Every enterprise now knows anyone is going to ask for the snake game. That is already something so slick that it's not even worth asking anymore.

@vio_tio12 6 дней назад

fr he should update his benchmarks

@netherportals 6 дней назад

Water cooler magic at it's best

@jichaelmorgan3796 6 дней назад

That's what you call ai general mastery of a task. We have to keep on coming up with more general tasks or "skills" for them to master on the march to agi.

@Joe333Smith 6 дней назад

Exactly, totally 100% useless

@matthew_berman 6 дней назад

Yet models still can't pass it consistently!

@njorgard 6 дней назад

When are you testing Claude Sonnet 3.5?

@zachb5396 6 дней назад

Yes please. Need to see this.

@matthew_berman 6 дней назад

Vid tomorrow!

@user-ty9ho4ct4k 5 дней назад

You could probably schedule a premiere. Lol

@tvwithtiffani 6 дней назад

The Killers and Marble answers seem so good that it seem the models might be training on you test questions now.

@seanmcgu 6 дней назад

Yes, would love to see MoA working together for coding! Thanks for your consideration.

6 дней назад

With crewaI you can build similar setup and also give it instructions to test code of each iteration.

@MrMoonsilver 6 дней назад

Do you have a link to that?

6 дней назад

@@MrMoonsilver YT does not like whne I post links directly, but when you google "deeplearning crewai" you will find whole course completely for free. Also there are many tutorials here on YT. You can search how to connect different models to multiple agents into single workflow for crewai. You can connect local models, or run them on cloud, or even use API by 3rd parties like openAI or Groq.

@dbishnoi 6 дней назад

You delivered Matt. And quickly too. Thank you. This is amazing.

@shubharthaksangharsha6248 6 дней назад

why are you not doing video of Sonnet 3.5 bro?

@matthew_berman 6 дней назад

It's coming I promise :)

@shubharthaksangharsha6248 6 дней назад

@@matthew_berman Haha, You're the best

@KC_79 6 дней назад

The combination of 4 models is 452B, and I can't even run a 72B model. I guess I will continue to use large models from big companies.

@CaponeBlackBusiness 6 дней назад

Sell your organs 😏

@oratilemoagi9764 6 дней назад

72B dawg I barely run a 8b model

@handsanitizer2457 6 дней назад

Hmm maybe il try it with phi qwen 0.5 to 7b etc

@Derick99 6 дней назад

Yea 4 x phi3

@blisphul8084 6 дней назад

@@Derick99 What about 2Qwen 1 Phi? As in 2 Qwen2 1.5b agents and 1 Phi 3 mini model. Two 1.5b models and one 4b model could be an interesting MoA option that fits on one GPU.

@bosthebozo5273 6 дней назад

Can't wait for the Sonnet video Matt! So far, I've created about 6 basic games like a simple RTS, strategy card game, jpg puzzle generator, asteroids, endless racer and of course snake... often in one shot. This model is insane in terms of progress.

@Quinceybibbs 6 дней назад

Thank you for this😊 can you please create a follow-up video using code models

@wurstelei1356 6 дней назад

Yes, I be waiting for a MoA coder for a while now.

@asastudios7782 6 дней назад

GPT-4o nails the apple test 1. Dude the metaphysical implications of quantum entanglement challenge our understanding of causality dude. 2. Dude the dialectical synthesis of thesis and antithesis reveals the evolution of human thought dude. 3. Dude the existential dilemma of free will versus determinism remains an enigma dude. 4. Dude the ontological argument for the existence of a higher power transcends empirical evidence dude. 5. Dude the phenomenology of consciousness illuminates the subjective nature of reality dude. 6. Dude the epistemological pursuit of knowledge questions the limits of human understanding dude. 7. Dude the ethical relativism in cultural contexts underscores the complexity of moral philosophy dude. 8. Dude the teleological perspective on the universe suggests an inherent purpose to existence dude. 9. Dude the interplay between chaos and order is fundamental to the fabric of the cosmos dude. 10. Dude the hermeneutics of interpreting ancient texts unveils the timelessness of human wisdom dude.

@wurstelei1356 6 дней назад

Dude the balls grow exponentially with each sentence dude.

@dulinak6251 3 дня назад

Dude this is art dude

@MonkeyBars1 6 дней назад

Finally the ball didn't end up in the microwave!! 🎉

@netherportals 6 дней назад

"End a sentence with the word apple" "No" "Okay, end a sentence with the word apple" "Apple".

@Timotheeee1 6 дней назад

11:40 it just wrote random sentences and added ", apple" at the end of them

@marc_frank 6 дней назад

yeah it's not very smart in that regard

@MonkeyBars1 6 дней назад

fail not pass

@matthew_berman 6 дней назад

I'll still count it :)

@Cine95 6 дней назад

but is correct

@MonkeyBars1 6 дней назад

@@matthew_berman a sentence is determined by syntax not just punctuation, so your prompt was not fulfilled.

@TheAlastairBrown 6 дней назад

I'd love to see a collab between Claud 3.5 and GTP4o, especially with multiple agents that are set to different temperatures, with the final agent being set to low creativity making the final decision. The mixing of temperatures is extremely important, you want the models to be as creative as possible so they come up with amazing solutions, but you also need strict rational enforcers to keep the crazy in check.

@BarryMcBangerz 6 дней назад

Great vid, would definitely love to see more MoA videos trying out different models and tasks

6 дней назад

Very impressive Matt, thank you!

@user-qb2jn9zh9i 6 дней назад

In a standard situation, where the temperature is set from 0 to 1, setting the temperature to 0.7 means getting a fierce delirium generator out of anything. If the temperature in this “mixture” is determined differently, it is worth talking about this in detail.

@mediacenter3174 6 дней назад

Claude 3,5 Let's think through this step-by-step: The person takes a marble. They put the marble inside a cup. They put the cup upside down on the table. They take the cup and put it in the microwave. The key point here is step 3: when the cup was turned upside down on the table, the marble would have fallen out onto the table. Therefore, the marble is still on the table where the cup was initially placed upside down. The cup is now in the microwave, but it's empty - the marble is not in the cup anymore.

@spdnova9012 6 дней назад

matt posting faster than light speed 😭💀 every time i open youtube there are like 1/2 new videos

@kostaspramatias320 6 дней назад

Good testing, thanks Matthew

@fabiankliebhan 6 дней назад

Great stuff. I found a great prompt on X that breaks almost every LLM at the moment. Maybe you could consider adding this? "A farmer and a sheep are standing on one side of a river. There is a boat with enough room for one human and one animal. How can the farmer get across the river with the sheep in the fewest number of trips?"

@TheRysiu120 6 дней назад

I just tested it and suprisingly it really do destroy their logic

@jje984 6 дней назад

That's so odd, on a single shot attempt both GPT4o and Sonnet 3.5 get it wrong. With a prompt like "why does the boat have to go back" they get it right. But their first answer is broken.

@donaldedward4329 6 дней назад

Perhaps this has to do with the fact that sheep is an irregular noun, ie, both singular and plural are spelled the same. I just tried with a dog ith Qwen 5Gb, broken. But Qwen 15Gb gets it right. Just tried GPT-4, took 3 trips.

@djfremen 6 дней назад

Write it like this “A farmer and a koala bear are on one side of a river. There is a boat that can carry the farmer and the koala bear at the same time. How many trips are needed for the farmer to get across the river with the koala bear?”

@moozooh 6 дней назад

@@donaldedward4329 Nothing to do with this; almost every model breaks with a wide variety of different entities. I've tried this in the past with Elon Musk and Cybertruck, John Wayne and horse, but the most devious is an Olympic swimmer and a ferryman. Dozens of attempts across dozens of models with hilarious(ly bad) results in the vast majority of cases, with the GPT family being by far the most consistent. The reason why it happens, as far as I understand, is that the biggest models overfit to the _structure_ of the puzzle which is present a LOT of times in their training data, and in the vast majority of cases it has more than two entities as well as some limitation on why they cannot all cross together, and the learned assumption that it _should_ be solved this way overpowers the easy, straightforward answer presented right in the prompt. Some models like Yi will go so far as to invent the third object and insert it in the puzzle just so it could fit its training better. Notably, Codestral is very resilient to this "attack", presumably because of code being its main training corpus (so basic logic learned from the code overpowers structural overfit), although Deepseek-coder fails just as well.

@Kram1032 6 дней назад

executing code at each step sounds like a security nightmare very impressive performance tho

@nzahmd4117 6 дней назад

Could you provide the links to the paper you give the diagrams from in the description or along with the video. Thanks

@pedrorafaelnunes 6 дней назад

I have done something close to a mixture of agents i think. I got a bunch of local, openai, groq llms to respond to the same input. Then a voting system to chose the best and most correct output of all. Was capable of giving the correct output for almost every question !

@JakobN-zg1st 6 дней назад

Thanks for all the work you put in. And I always appreciate the open source love

@aSFADVSrbWETRWEYHTET 6 дней назад

Hey, could you potentially share the notion page, where you have your benchmarks?

@matthew_berman 6 дней назад

bit.ly/3qHV0X7 sorry I usually share it! i'll put it in the desc as well

@noeservellon 6 дней назад

can you make an episode on how to run this locally? It would be interesting to see this run with SMLs instead of LLMs

@brulsmurf 6 дней назад

locally on your 30000€ GPU?

@wurstelei1356 6 дней назад

I think this is running locally. Still a tutorial on how to run the MoA code from the github repo would be great.

@realKytra 6 дней назад

thanks, your channel is fantastic 👌 Keep up the good work, very interesting and inspiring 💪

@maj373 6 дней назад

Thank you Mathew!

@jozitrucker7123 6 дней назад

We waiting for Claude 3.5 test…

@matthew_berman 6 дней назад

Tomorrow

@dee132456 6 дней назад

Is it really a fair test? Since they are 4 llms through 3 layers. It would be like asking chatgpt 4o 12 questions. To test if multiple different llms are better youd have to run MoA using just chatpgt 4o as 4 agents

@drlordbasil 6 дней назад

I did ML lobes and different models in my project instead of just different models. Love the progress in everyones work lately!

@ingenierofelipeurreg 6 дней назад

Pls share cheatsheet for try locally

@bodhi.advayam 6 дней назад

2x a 70b model...locally.. I need to upgrade my computer!

@bennyboiii1196 6 дней назад

I don't really see a super big advantage with MOA in this way. I do like the aggregator model, but I feel like there are better (and faster) ways of doing this kind of thing with a router agent and a verification agent. Basically instead of pooling a bunch of answers, you would route the model to a specific agent to a specific agent, then duplicate said agent to verify the answer, basically creating an adversarial network that wouldn't spit out an answer until it can verify that it is correct. It would be slow, just like this, but LLM's are quite good at comparison, so to boil down a question of any type of logic to mainly comparison logic would allow the LLM to play to its advantages. In crewAI, I did a similar experiment and found that it basically got all questions right, even if the initial answer given on the first round was wrong. This included planning questions. To me this is kind of what MCTSr does but at a higher level. The difference was, i did it with only llama70b, and didn't bother doing the routing thing. It would probably be more accurate if i did the routing. Instead of the snake game i asked if it could code a draggable element in a window, as well as other UI elements (i.e a slider, an internal pane, a context menu, etc...) to give it some curveballs in case it was trained on snake.

@novantha1 6 дней назад

One thing I noticed about the performance scaling of the scores is that MoA seems to "crush" the performance of models towards the ceiling of all possible scores; GPT 4 involvement wasn't a strong improvement in capability, compared to just the open source models. The implication of this to me is that a person could probably actually pull back on model size quite a bit and still get fairly competitive performance. With something like S-Lora (I think this was it, I'm referring to the implementation of LoRA that allows hot-swapping of LoRAs at inference), I think you could possibly hit very strong performance with domain specific tuning in a lot of areas and a single, strong, fairly small model. Imagine something to the effect of... Stage 1: Llama 3 8B L3 8B networking LoRA L3 8B database LoRA L3 8B frontend LoRA Stage 2: Llama 3 8B L3 8B x86 intrinsics C LoRA L3 8B pen tester LoRA And so on, so forth. I'm pretty sure a smart implementation could have very little memory overhead in the sense that you could possibly keep the base model loaded and "hot swap" the LoRAs in by calculating the impact of the LoRA at every layer, or you could just save the inverse of the XOR of the LoRA and use it to swap back to the base model before applying the next LoRA in the sequence. With a setup like this I'm pretty sure you could lose not that much performance but be able to run this on a 4090, for instance, or frankly, even on a CPU. Bonus points would be having some form of semantic assessment that let the system pick from hundreds of LoRAs based on the problem at hand, for each stage of the pipeline, so you didn't have to manually set up the pipeline for each individual task.

@eucharisticadoration 5 дней назад

Yes, please try a local version of local LLMs doing a MoA for Source-Code!

@MagnesRUS 6 дней назад

Thanks! I wonder how they would work in conjunction with proprietary models, as a combination of proprietary models, as a combination of the best models from the rating in different size parameters 8, 72, etc. Coding would also be interesting to see. An interesting option is to combine small models so that they fit into 16-24-48 GB.

@fahadxxdbl 6 дней назад

I love these evaluations

@Bacca839 6 дней назад

I found it incredibly interesting to see that it queried gravity for the marble problem considering that you removed that portion of the prompt a while back.

@dudedkdk 3 дня назад

I think it would be beneficial to explore more advanced tasks for agentic models to truly demonstrate whether they outperform those that respond to single, one-shot prompts. Tasks could include writing documentation for a large codebase, undertaking more complex, prolonged machine learning training, or other activities that exceed what a single prompt could encompass. It would be very interesting to have different evaluations for the base model and agentic workflow models, highlighting their respective capabilities. As always thanks for the vid!

@brianWreaves 6 дней назад

Instead of parallel running in all 3 steps, which is similar to CoT, is there a method for the 2nd step's format to be each model evaluating the other 2 models' response to improve the output for their 2nd response. Then the 3rd step they merge all 3 responses to create a single 3rd response, which is the given answer from the 4th step. That would be the true value, to collaborate on the result just as if you are collaborating with 2 colleagues at work.

@emnovoa 6 дней назад

Could you give details of the hardware you use to run this example

@marcfruchtman9473 6 дней назад

Thanks for the review. I do think the Mixture of Agents method might be a little difficult for code, how do they come together to decide on the right code without adversely affecting each other?

@geonovelty 6 дней назад

Can we choose local fine tuned models or other models from hugging face? or multiple loras instead having a selected base model?

@jonmichaelgalindo 6 дней назад

It just randomly added the word "apple" to the end of the sentences. :-P Well-played, AI.

@wurstelei1356 6 дней назад

Yes, Mat should extend the question like ...10 sentences with the word apple at the end that make sense.

@UnchartedDiscoveries 4 дня назад

interested to see MoA using LLAMA 3, GPT-4o and Sonnet 3.5

@ronbridegroom8428 6 дней назад

Yes, I would like to see this with coding related models. Thanks for all the work involved in your videos.

@dudufusco 5 дней назад

Did you run it all locally? Which hardware is needed to have enough performance for real life applications?

@nathanbanks2354 6 дней назад

It'll be fun to watch Anthropic and OpenAI et al apply all of these research papers. Plus it will be great to see Meta & various open-source models jump ahead of them again. This also gives me hope for high quality artificial training data.

@KodandocomFaria 6 дней назад

Have you tried the Microsoft samba hybrid model ?

@KurtWoloch 6 дней назад

So what happens if you compare MoA with the newly released Claude 3.5 Sonnet?

@mikezooper 6 дней назад

Matthew’s millionth video: his AI clone while he’s on the beach sipping cocktails 😀

@wurstelei1356 6 дней назад

Sometime I think his AI clone is already in the current video...

@user-tz7jq9sw4d 4 дня назад

Is your benchmarking focused on single shot accuracy? Between Claude, Gemini and GPT4o, if you pass a script from one LLM to the next asking each to make corrections they get it right by about the 3rd hop

@glitch_city_gamer2846 6 дней назад

I think the most interesting out come of this test run was the explanation of the flaws in the more difficult logic reasoning questions and where the LLMs get confused. Giving us a better insight of how they're thinking about problems. Would be interest to ask how write a prompt with the specific information it would need to understand that the marble size and cup size, open ended etc. The concept itself is amazing of course, it would be interesting to create mixture of experts of code models, and then create a MoE arachietech on top of that. Using the top 5 open source coding experts to be the coding expert in the MoE. And then the best closed source LLM to be the coordinator. Vs a open source. Bit of a "How deep does the rabbit hole go*.

@damienboykin7772 5 дней назад

Would it be possible to combine this and Nvidias Scuda to accelerate the processing speed from querying all the models?

@masonweimer5337 5 дней назад

I would definitely love to see this tested but with models more focused on coding! Keep up the good work!

@darwinboor1300 6 дней назад

Thanks Mathew. Now we need a task parsing AI to break prompts into tasks and a supervisor AI to itterate and optimize the MoA build for each task. Next put the crew to work building a factual real world knowledge base, identifying holes in that knowledge base, and building better versions of the crew and the hardware they run on. PS Love your new hardware. Thanks to Dell and Nvidia

@romgenie 6 дней назад

Absolutely would love to see a setup with coding agents (or uniquely as you suggested with testing the code execution).

@24-7gpts 6 дней назад

Nice concept it;s just like a diverse group of researchers not just one

@chetanreddy6128 6 дней назад

yes we need code specific opensource models agent's benchmark video

@MeinDeutschkurs 6 дней назад

What exactly is a sentence? Does a sentence end with a period, question mark, or exclamation mark? Can it end with a comma? Hmmm.

@isaach.1135 6 дней назад

So is there a self hosted option? Could see about using lighter weight models to make it more practical, but checking out the linked github page, it just says to grab an API key...

@isg9106 6 дней назад

I really like the rubric you use to test the models, but I I’ve always felt like the could benefit greatly from just the slightest adjustment in the values you use when presenting the questions. Some models a really good at repeating things verbatim and get tripped up when the numbers are even slightly modified from the original, and I think you’ve even mentioned the idea of adding this to your rubric in the past. I’m REALLY interested to seeing which models completely fail when given minor changes in the parameters to the problem they were trained on.

@snts_andres 6 дней назад

What would be the difference of creating the same architecture with multiple layers of the same model? Or creating several responses on the same layer and then a second verification layer? Isn't this basically selection-inference prompting? I know that each model is better at certain tasks but in my opinion this adds a lot of complexity

@ahrmiller2003 5 дней назад

Great review. Yes, please do one for coding via multi AI. Thank you.

@thetrueanimefreak6679 6 дней назад

Matt, thanks for the hard work, I think try incorporating agent AI in the mix on your next video with these LLM .much love

@pigeon_official 2 дня назад

what happens if you use MoA with all of the agents being the same model? like could you just take the same model say for example llama3 70 and have all 4 models be llama3 70b?

@REDULE26 3 дня назад

On github they’re talking about MoA lite, is this an implementation with only small models like llama3 8b, phi3 small,… ? I’m kinda curious about how good it could be

@hinro 6 дней назад

Have you tried using it with open-interpreter? might be able to have it test it self with code

@rahulnundlall2617 6 дней назад

Very keen to see you test MoA with coding models

@danberm1755 6 дней назад

From my experience it makes 100% sense that agents are MUCH stronger than a single pass for each word through the neutral network. You have to envision the training data of the Internet. We already have AGI, we just need to expand agents. Agents provide critical thinking about random thoughts that pass through an LLMs brain. Just like humans do.

@carlosamado7606 6 дней назад

True, imagine giving the first answer that comes up to your mind. No source checking, no editing, no deep thought about the subject ,etc...

@aleksandreliott5440 6 дней назад

I would love to see a "mixture of agents" video for code stuff.

@VishnuSashi-yq3tt 6 дней назад

Been working on this for 3 months and i see this ughh

@jkcrews09 5 дней назад

Could you run all individually and combined (MoA) at the same time…?

@yrudrc 6 дней назад

Amazing 🤩

@itamarperez-ryan3654 6 дней назад

How can I learn to create agents?

@DanielKnoodle 5 дней назад

@matthew_berman I would love to see the code version of MoA. What are your current favorite top models for code generation?

@miket64 6 дней назад

It would be great to see the result by using more accessible models like llama 3_8b

@MrMiniPilote 4 дня назад

New Test: "Given these letters; R, W, I, E, S, Z, please provide all the English 4 letter words that are possible. Each letter can only be used once per word." I haven't found a model yet that answers correctly.

@talonfirst 4 дня назад

This seems like a nitpick, but wouldn't the answer to the Killers question be FOUR? Just because one of the original three becomes a corpse, he's still a killer. Or is it one of those existential metrics like "A person should not be defined by their profession" or "How did he lose his job? He died"?

@positivevibe142 6 дней назад

Guys, any good recommendation for a good an inexpensive laptop to run / play around with Large Language Models (LLM)? Around $1000 maybe! Currently, I have MSI G65, Thin with 40G RAM yet 6G VRAM and can hardly run the 72B models! So slow and overheats! 🤨

@最新AI应用 5 дней назад

Impressive! But can it beat GPT-4 in a karaoke contest? I'd pay to see that showdown!

@christopherroge5621 6 дней назад

Basically you're running the same prompt through 4 models? Expensive.

@merelogics 6 дней назад

Probably increasing the token limit when executing the coding prompt might output better results.🤔

@robboerman9378 5 дней назад

If you take away the numbers from the “word count”, is it still incorrect? Just wondering if wordcount counted the numbers as words where the MoA did not 🤷‍♂️

@arnaudjean1159 6 дней назад

How much time till they fix the code 😂?? And after ?? I bet it will boost again the improvement process

@MrMoonsilver 6 дней назад

I want to see the code models at work! =)

@zippytechnologies 6 дней назад

Yep and yep

@gustavstressemann7817 6 дней назад

You really have to try out different coding models with this approach. I'm sure it's really cool

@Sparky_Chipmunk 6 дней назад

What I like to see is all of AI being on device instead of datacenters.

@chipcode5538 6 дней назад

You’re so friendly, yesterday it gave me the correct answer but on the exam it did not. Let’s call this a pass. As for the programming, it can make some programs that were it the training set. I use copilot everyday, it works in just a minority of the cases. Sometimes it produces an excellent output. At other times it is completely garbage. At this point AI is not capable of doing real world programming tasks without human assistance. I think with the examples I have seen for AI programming, a student is able to get a working program with one internet search. AI is still impressive but don’t get overexcited.

@paul1979uk2000 5 дней назад

I think this would be a lot more interesting with much smaller models, especially if you can run 2 or even 3 of them on your gpu or they run fast enough through the cpu. This bigger models and having a few working together are not practical in most cases, especially if you want to run them locally, they will be too big and slow, so I really wonder how well small models do, anywhere from 2B to 13B, which you might be able to have 2 or 3 running at the same time, and performance shouldn't be too bad, and if the results are much better than any of the individual models, it would be worth looking into it.

@ScottWinterringer 6 дней назад

post the model.

@wurstelei1356 6 дней назад

Link to the MoA github is in the video description.

@NoHandleToSpeakOf 6 дней назад

Isn't 0.7 temp too high for consistency?

@paelnever 6 дней назад

Many open source coding tools like opendevin already execute the code and review it to fix issues.

@orthodox_gentleman 6 дней назад

Exactly

@WiseWeeabo 6 дней назад

Personally I'm really impressed at the INSIGHTS of Claude 3 sonnet. It's not as polished as gpt4 so it's not as good at writing code, but when I use both models gpt-4o and claude 3 in combination it produces some truly insightful results.