I've built a similar system, but I noticed that judge model sometimes hallucinates and gives high marks to obviously wrong solutions. I tried to make a jury of multiple judges (different big models) this improved judging quality, but made it 8X slower. Also, with multiple judges you will need to fuse their judgements to some consensus, it's just pretty slow and all models do hallucinate.
One of the problems almost all models suck at is the puzzle "a fox, a chicken and a sack of grain" or ("wolf, goat and cabbage problem"). All models recognize that it's a classic puzzle, but only few can give a coherent solution without weird glitches
aya:35b blows everything out of the window. Not ten times better then chatGPT but one hundred times better. It's slow as it's 35B run locally but, I love it. Besides that I use llama3 for most everyday tasks..
In The Bubble sort evaluation, all the models that were eval as wrong (MIstral, Codestral..etc) had a syntax error in line 1 because it included the output text as a line of code as for the code itself it was sound on all..so it is not a proper eval as you need to check your code as to why it worked for a couple but not the others as a simple syntax error that wasnt part of the LLM's code but yours does not make for a proper eval. Other than its a cool idea
What is the sense to estimate many models by some more powerful model if this is required for each problem so it would be much faster to just ask GPT-4 for an answer of the problem
Because chatgpt can not be run locally. If you can evaluate what the best small local model is for a task, then you can use that model locally on your pc. If you have sensative code or senstative information, you dont want to pass this through chatgpt since openai will take your data, so you run locally. Not to mention, running locally is completely free, where using chatgpt api is gonna cost you. The whole point of the test is basic test examples, so then you can pick the model to do a similar more complex task
@@TwoWayOrbitalStation the issue I see is that results might be different for slightly changed tasks. Means for this current task you get right result, but if you try to get answer for similar but different then answer might be wrong. So if you want to use small models locally then need to have some different way to estimate results without ChatGPT
@@JohnDoe-zx8bu In this system, ChatGPT (being the best model) provides a rough approximation of “best” solutions from those provided, saving you tokens on getting it to provide its own lengthy results. Can also use this in a multi-agent loop with whatever LLM it picked to improve output entirely on user side, no additional tokens. This is just where documentation and an understanding of what you’re asking of the LLMs comes into play. If you’re allowing the blind to lead the blind, of course it’s going to be horrible. However, if you need a bit of help doing this One Thing you just can’t get right, then problem solved. We are not at the stage of these tools being autonomous fix-alls for every problem. I’ve seen many people saying things like “ChatGPT almost broke my computer because i tried to get it to help me do this thing”, but the reality is, THEY almost broke their computer bc they had no fking idea what ChatGPT was telling them. Accountability is on the end-user here to determine what is and is not a useful output, and how it then can be applied.