QWEN 2.5 72b Benchmarked - World's Best Open Source Ai Model?

Digital Spaceport

Подписаться 23 тыс.

Просмотров 4,2 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Наука

Опубликовано:

30 окт 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 54

@opensourcedev22 Месяц назад

I use this 32b model primarily for coding now. It's done so well, that I wonder if they trained it against claude 3.5 coding output, because it is very good. I wish one of these companies would make a hyper focused coding corpus model so that it can fit into 48gb vram at very high precision

@justtiredthings Месяц назад

l believe they're planning on releasing a 32b coder variant soon

@alx8439 Месяц назад

It went off the rails because you're keep reusing the same open web ui chat and overflew the default ollama context size for any model, which is 2k. Use different chats for different topic - it will save you from pushing all the chat history to model, even though the previous messages are no longer relevant to what you're asking them afterwards. And increase context size to something like 8k

@Jp-ue8xz Месяц назад

wdym "failed" the first 2 tests. Does the game actually work if you put a png image in the corresponding folder? Btw there's a solid argument that could be made that if the scenario you proposed was the "best" the entire human kind was able to put together as a plan, the correct thing to do not to save us

@DigitalSpaceport Месяц назад

No it didnt and it failed as llama3.1 70b was able to make a poor, but functioning, one on a oneshot that did run.

@ManjaroBlack 21 день назад

Size vs Quantization. When I can fit a larger model with q2 quantization, I can fit a smaller q8 model. Comparing the two, the larger q2 model is much more likely to give me gibberish. The only advantage that I find with larger models in a lower quantization is that it can handle a larger system prompt better.

@DigitalSpaceport 21 день назад

Yeah my bottom is q4 from what I have seen returned. Have you checked q8 vs fp16 on many models? The gains at fp16 feel in llama3.1/3.2 does not seem to make a big difference

@ManjaroBlack 21 день назад

@@DigitalSpaceport I don’t find any difference in quality between q8 and fp16. Even q6 tends to be about the same for my use case. Below q6 I can tell a difference with long inputs. One thing I do see a difference in is the size of the output. q8 and fp16 seem to output about the same amount, but q6 will often output less, which can be a problem for me if I have a large output structure in the system prompt.

@DigitalSpaceport 20 дней назад

Good to get your observation on that. q8 does seem like the sweet spot and qwen 2.5 q8 is mind-blowingly good if I need to have 4 models loaded in for RAG.

@alx8439 Месяц назад

Add some tool using tests. Like web searching and text summarization. Open Web Ui comes with web search and community tools you can equip. Ask a question like to find you some good AMD Ryzen laptops made in 2024 and compose a comparison table with all the specifications and prices

@DigitalSpaceport 22 дня назад

I just am getting to the use case videos and setting up tools and vision so these things will be included in future evals.

@alx8439 22 дня назад

@@DigitalSpaceport Awesome. Thanks! That will be super helpful for community

@DigitalSpaceport 22 дня назад

@@alx8439 also I did read your other comments and thanks for taking the time to write them out. I am incorporating your feedback actively.

@dna100 26 дней назад

I've found the qwen2.5 7b model to be the best of the current crop of 7b models. I've tried Llama3.1 7b, Internlm2.5 7b and mistral 7b. My second placed choice is then Interenlm model. Great video by the way. Nice to hear an honest opinion about the benchmarks. They are completely 'gamed' and pretty much meaningless. The only way is to gauge them yourself, as you have done here. Good work.

@DigitalSpaceport 22 дня назад

qwen2.5 is very good I agree. I use that almost all the time now myself. The 32b variant allows me to have several models running at once as well.

@ManjaroBlack 21 день назад

Exactly my experience at the 7b size. My use case builds quite large prompts, and they all struggle at this size, but InternLM was my go to. I find that qwen2.5 and InternLM are about the same, but I prefer qwens output and formatting.

@kkendall99 Месяц назад

That model is probably hard coded to cause no harm no matter what the scenario.

@Lorv0 Месяц назад

Awesome video! What is the name of the tool used for the web interface for local inference?

@DigitalSpaceport Месяц назад

Yes full software install video here on that ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-TmNSDkjDTOs.html. This is Openwebui and Ollama together.

@justtiredthings Месяц назад

The issue of the ethical question is very difficult. Maybe unsolvable given that we want these things to be agents. The problem is if you can give it just the right ethical scenario and the AI will do as you say, then any bad actor could simply lie to their AI and have it go on a killing spree. That's not too desirable either. But then how do we ensure that these things are making reasonable decisions when we give them any level of autonomy? I'm not sure how we resolve that contradiction.

@DigitalSpaceport Месяц назад

In my opinion, primary initiator decision making scaffolding being the entire chain is an issue I can see leading to real problems. Unassociated and independent evaluative recommendation systems not running on homogenous base models that feed into an arbiter weighting is a decent solution that can work better. We already have machines making life and de@th decisions autonomously at scales we likely don't see day to day.

@justtiredthings Месяц назад

@@DigitalSpaceport tbh I'm not parsing you v well

@CheesecakeJohnson-g7q Месяц назад

Hi, I tried, on LM Studio, several versions of Qwen+Calme 70 , 72, 78b with all sorts of quant where Q5 and Q6 seems to perform best but I didn't find any that have a sufficient conversational speed. 3090 seems to work. While I have read the definition of K S K_M K_S and so on... I didn't really fully absorb the concept yet and from a model to the next, the "best performing model for my hardware" isn't always the same... The cozy spot is around 16gb even thought the device have 24gb... What am I missing? What settings should I tweak?

@83erMagnum Месяц назад

I'd interested in this too. There is so little specific content for 24gb vram machines. The demand should be there since it is the only affordable solution for most.

@justtiredthings Месяц назад

I've got a single 3090. 32b quant is pretty slow (~1.5 tkns/s), but 14b model is surprisingly decent for its size and reasonably fast (7-10 tkns/s)

@CheesecakeJohnson-g7q Месяц назад

@@justtiredthings I run codestral 22b smoothly here at q_5 k_m and Q6-7-8 at a decreasingly unsatisfying speed but it runs.

@Mike-pi3xu Месяц назад

It would be helpful to se ollama ps and see how much is actually on the GPU's and how much is run by CPU. I noticed that the four 4090's only ran on 1/4 compute utilization, and seeing the execution context might shine some light on the discrepancy. Please consider including this. This is especially important with GPU's with less VRAM.

@DigitalSpaceport Месяц назад

I am mindful of this in all tests and all models reviewed here fit fully in VRAM I do check. Yes the workload is split into 4 and each gpu runs at 1/4 speed on 1/4 the workload. This is how llama.cpp does parallelism currently. That is the model processor for ollama. vllm enables an alternate way to do parallelism that may significantly improve on that which I will test here.

@stattine Месяц назад

Can you clarify why you are going so deep on p2000 vs p5000? The extra vram in the p5000 seems like a clear choice

@DigitalSpaceport Месяц назад

Oh its because I tested the P2000 as I have one on hand. ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-CJ2NHOCQ2yY.html Im pretty much testing everything I've got around. I don't have a P5000 but yeah extra VRAM and more cuda cores FTW!

@Merializer Месяц назад

Do you think this 72B model can be run with 64GB ram and RTX-3060? btw, I wouldn't use a red background colour for PASSED, but use green instead. Red seems more suited for a word like FAILED.

@DigitalSpaceport Месяц назад

Good point re colors updated for next video. Yes it can layer into vram and system ram, but the performance will be painfully slow.

@Merializer Месяц назад

@@DigitalSpaceport not gonna try then I think. My internet is really slow to download models, saves me the trouble.

@tohando Месяц назад

Thanks for the video. Shouldn´t you clear the chat, after each question, so the context is not full with previous stuff?

@DigitalSpaceport Месяц назад

If I was benchmarking just for numbers that are high, probably. We have plenty of those benchmarks and synthetics though. Im interested in how usage for normies like me is though. I often dont make a new clean new chat topically and the ones I do eventually meander off the original conceived topic. It isn't scientific testing, but rather mimics normies usage patterns. It is purposeful in that choice.

@Evanrodge Месяц назад

@@tohando yes, his methodology is borked

@NilsEchterling 9 дней назад

@@DigitalSpaceport Good intent on your part. But I think we should propagate using LLMs well. And making new chats for new topics is simply something everyone has to learn. RU-vid videos like yours should educate on this.

@mrorigo Месяц назад

As I said, clear the context before you try a new challenge, or your responses will be confused.

@DigitalSpaceport Месяц назад

Purposefully done this way at this time. Have explained a few times why prior.

@elecronic Месяц назад

Why all questions are in same chat? You should try new chat everytime for each question

@DigitalSpaceport Месяц назад

I have started doing this in the new chats for the new videos and will with that into the future. Thanks.

@mrorigo Месяц назад

You should clear the context before trying the next challenge, no?

@xyzxyz324 26 дней назад

the ai-model maddness is going somewhere too complicated. as end-user i have to evaulate freaking lots of parameters, hardware needs, fine-tunes, dedicating the role for the model, the main title that its trained, top-p value, top-k value, penalty, temperature, etc, etc, etc... I need an ai model to help me find the most easy-going one for my needs! And by the way they are bigggggg in size and hardware needs are going crazy. Someone please collect all the ai models knowledge in one, and create an easy interface with few parameters.. Now reaching using and hosting ai model is getting more expensive and complicated rather than owning a real brain.

@DigitalSpaceport 22 дня назад

I found a pretty decent and much less "knobs" interface I will be reviewing. I think it might fit for you. AnythingLLM

@justtiredthings Месяц назад

Please, please, please test it on an M1 or M2 Ultra. I'm dying for someone to demonstrate the speeds on Apple's efficient chip.

@DigitalSpaceport Месяц назад

I can send you an address and amazon can deliver a gift? 😀

@justtiredthings Месяц назад

haha, fair enough, but idk what you've got given that supercomputer setup. M-Ultra chips actually seem like the economic option for mid-sized LLM inference, but I haven't been able to see enough testing results to confirm that--they're weirdly difficult to find

@mrorigo Месяц назад

Sentence, not Sentance, fr you have llms to correct your spelling, no?

@DigitalSpaceport Месяц назад

Its easy to toss hay from the sidelines which is why I urge everyone to get into youtube themselves. It is a very humbling journey as a solo producer. Especially hard is those watching who catch some detail you must have missed but you have no idea what they are talking about because they dont context it.