I haven't seen a single model get the hex problem right. That was the best butterfly so far, of the models I've seen you test, that is. Thanks for all the reviews and keep up the good work.
I think gpt-4o mini has like a 50% chance of getting the math problem when I try on lmsys. And, perhaps slightly higher with bit of performance prompting.
Deepseek is cheap , and it works extremely well in aider, large context and large outputs, aider --deep to me is the best option other than paid ones , and reasoning quality is very good. sometimes better than sonnet ... i never see that rate limit stuff...and the web 300 500 lines of context it takes no problem , try and paste and see ..it can generate whole ideas long answers ...very rarely cuts the outputs, or prints half file and stops ...
Deepseek to me is the best but since they are from China, EU, and US companies don't like mentioning them or using their models. Once they release a multimodal model it's going to have an impact. Deepseek is made for coding and I think that's the best approach. Instead of making a big model that's good at everything you just focus on making your model good at one thing and that's what they're doing with Deepseek It's a coding model first. Yes I sound like a fanboy but I think you should also improve your prompt because the more detail you give the better the result. How about not doing only zero-shot benchmarks? They also mentioned in the tweet at 0:55 that you should update the system prompt and temperature and this is something that nobody does when testing models. A +200B model shouldn't fail at the first 2 questions in your benchmark. It's like using a text-to-image model like midjourney, stable diffusion without changing the sref or seed values. You should adapt the system prompt and temperature based on the question type like question 11 about generating the SVG code for a butterfly, it sounds like a coding question but it's more of an artistic question first that needs a different system prompt. I saw a huge improvement with my local models when using Claude 3.5 Sonnet system prompt.
I‘m confused. I cannot understand your relatively positive rating for a model that size and so many fails. could you please compare the results of the current model to the former models? It all feels so random. Btw: Single zero shot does not say anything about the consistency.
It's not underwhelming at all. The first 2 questions that it fails is language questions which is not a strong of suite of Deepseek because it's a chinese model and it's mostly trained on chinese language, not english. It excels in coding, apart from the Game of life which could be fixed with some simple extra prompting. It's better than Llama-3.1 405b in my mind. Because, when models go above >30b parameter almost anyone can't run it locally and in that case you have to consider the pricing of inference instead of it's parameter and stuff like that and In the inference pricing, it's worth the cost.
@@AICodeKing ok, so it‘s not native in English. Maybe we should try to translate the prompt with the help of qwen2:72b (or google translate API) first, before feeding it to DeepSeek V 2.5? Sounds strange, but maybe this is better. The back translation could be a challenge, but why not? More than 1.5 years, I translated image prompts from n to English, to get better results at image generation.
I agree, we have plenty of models that will answer all the fluff questions. how about digging into some serious coding and make a benchmark for that. I use AI for a few minor things besides coding but coding is the majority of my use. Also.. one shot code answers are cool and all but not real world. How good can it do if you give it multiple shots? Can it make efficient and working code?