I use this 32b model primarily for coding now. It's done so well, that I wonder if they trained it against claude 3.5 coding output, because it is very good. I wish one of these companies would make a hyper focused coding corpus model so that it can fit into 48gb vram at very high precision
It went off the rails because you're keep reusing the same open web ui chat and overflew the default ollama context size for any model, which is 2k. Use different chats for different topic - it will save you from pushing all the chat history to model, even though the previous messages are no longer relevant to what you're asking them afterwards. And increase context size to something like 8k
wdym "failed" the first 2 tests. Does the game actually work if you put a png image in the corresponding folder? Btw there's a solid argument that could be made that if the scenario you proposed was the "best" the entire human kind was able to put together as a plan, the correct thing to do not to save us
Size vs Quantization. When I can fit a larger model with q2 quantization, I can fit a smaller q8 model. Comparing the two, the larger q2 model is much more likely to give me gibberish. The only advantage that I find with larger models in a lower quantization is that it can handle a larger system prompt better.
Yeah my bottom is q4 from what I have seen returned. Have you checked q8 vs fp16 on many models? The gains at fp16 feel in llama3.1/3.2 does not seem to make a big difference
@@DigitalSpaceport I don’t find any difference in quality between q8 and fp16. Even q6 tends to be about the same for my use case. Below q6 I can tell a difference with long inputs. One thing I do see a difference in is the size of the output. q8 and fp16 seem to output about the same amount, but q6 will often output less, which can be a problem for me if I have a large output structure in the system prompt.
Good to get your observation on that. q8 does seem like the sweet spot and qwen 2.5 q8 is mind-blowingly good if I need to have 4 models loaded in for RAG.
Add some tool using tests. Like web searching and text summarization. Open Web Ui comes with web search and community tools you can equip. Ask a question like to find you some good AMD Ryzen laptops made in 2024 and compose a comparison table with all the specifications and prices
I've found the qwen2.5 7b model to be the best of the current crop of 7b models. I've tried Llama3.1 7b, Internlm2.5 7b and mistral 7b. My second placed choice is then Interenlm model. Great video by the way. Nice to hear an honest opinion about the benchmarks. They are completely 'gamed' and pretty much meaningless. The only way is to gauge them yourself, as you have done here. Good work.
Exactly my experience at the 7b size. My use case builds quite large prompts, and they all struggle at this size, but InternLM was my go to. I find that qwen2.5 and InternLM are about the same, but I prefer qwens output and formatting.
The issue of the ethical question is very difficult. Maybe unsolvable given that we want these things to be agents. The problem is if you can give it just the right ethical scenario and the AI will do as you say, then any bad actor could simply lie to their AI and have it go on a killing spree. That's not too desirable either. But then how do we ensure that these things are making reasonable decisions when we give them any level of autonomy? I'm not sure how we resolve that contradiction.
In my opinion, primary initiator decision making scaffolding being the entire chain is an issue I can see leading to real problems. Unassociated and independent evaluative recommendation systems not running on homogenous base models that feed into an arbiter weighting is a decent solution that can work better. We already have machines making life and de@th decisions autonomously at scales we likely don't see day to day.
Hi, I tried, on LM Studio, several versions of Qwen+Calme 70 , 72, 78b with all sorts of quant where Q5 and Q6 seems to perform best but I didn't find any that have a sufficient conversational speed. 3090 seems to work. While I have read the definition of K S K_M K_S and so on... I didn't really fully absorb the concept yet and from a model to the next, the "best performing model for my hardware" isn't always the same... The cozy spot is around 16gb even thought the device have 24gb... What am I missing? What settings should I tweak?
I'd interested in this too. There is so little specific content for 24gb vram machines. The demand should be there since it is the only affordable solution for most.
It would be helpful to se ollama ps and see how much is actually on the GPU's and how much is run by CPU. I noticed that the four 4090's only ran on 1/4 compute utilization, and seeing the execution context might shine some light on the discrepancy. Please consider including this. This is especially important with GPU's with less VRAM.
I am mindful of this in all tests and all models reviewed here fit fully in VRAM I do check. Yes the workload is split into 4 and each gpu runs at 1/4 speed on 1/4 the workload. This is how llama.cpp does parallelism currently. That is the model processor for ollama. vllm enables an alternate way to do parallelism that may significantly improve on that which I will test here.
Oh its because I tested the P2000 as I have one on hand. ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-CJ2NHOCQ2yY.html Im pretty much testing everything I've got around. I don't have a P5000 but yeah extra VRAM and more cuda cores FTW!
Do you think this 72B model can be run with 64GB ram and RTX-3060? btw, I wouldn't use a red background colour for PASSED, but use green instead. Red seems more suited for a word like FAILED.
If I was benchmarking just for numbers that are high, probably. We have plenty of those benchmarks and synthetics though. Im interested in how usage for normies like me is though. I often dont make a new clean new chat topically and the ones I do eventually meander off the original conceived topic. It isn't scientific testing, but rather mimics normies usage patterns. It is purposeful in that choice.
@@DigitalSpaceport Good intent on your part. But I think we should propagate using LLMs well. And making new chats for new topics is simply something everyone has to learn. RU-vid videos like yours should educate on this.
the ai-model maddness is going somewhere too complicated. as end-user i have to evaulate freaking lots of parameters, hardware needs, fine-tunes, dedicating the role for the model, the main title that its trained, top-p value, top-k value, penalty, temperature, etc, etc, etc... I need an ai model to help me find the most easy-going one for my needs! And by the way they are bigggggg in size and hardware needs are going crazy. Someone please collect all the ai models knowledge in one, and create an easy interface with few parameters.. Now reaching using and hosting ai model is getting more expensive and complicated rather than owning a real brain.
haha, fair enough, but idk what you've got given that supercomputer setup. M-Ultra chips actually seem like the economic option for mid-sized LLM inference, but I haven't been able to see enough testing results to confirm that--they're weirdly difficult to find
Its easy to toss hay from the sidelines which is why I urge everyone to get into youtube themselves. It is a very humbling journey as a solo producer. Especially hard is those watching who catch some detail you must have missed but you have no idea what they are talking about because they dont context it.