LocalAI LLM Single vs Multi GPU Testing scaling to 6x 4060TI 16GB GPUS

Подписаться 93

Просмотров 3,3 тыс.

50% 1

An edited version of a demo I put together for a conversation amongst friends about single vs multiple GPU's when running LLM's locally. We walk through testing from a single to up to 6x 4060TI 16GB VRAM GPUs.
Github Repo: github.com/kkacsh321/st-multi...
See the Streamlit app and results here: gputests.robotf.ai/

Наука

Опубликовано:

23 мар 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 32

@NevsTechBits 6 дней назад

Thank you for this contribution to the internet Brother! I have learned from this. Subbed, liked and commented to show support. Good stuff sir! This will help us all.

@RoboTFAI 6 дней назад

I appreciate that!

@andre-le-bone-aparte 5 дней назад

Just found your channel. Excellent Content - Another subscriber for you sir!

@RoboTFAI 3 дня назад

Welcome aboard!

@andre-le-bone-aparte 5 дней назад

Question: Do you have video going through the Kubernetes setup of using multi-gpus? Would be helpful for those just starting out.

@RoboTFAI 3 дня назад

No, but I def could put one together. Thanks for the idea!

@maxh96-yanz77 2 месяца назад

Your test very interesting. I use the old GTX 1060 with 6Gb run Ollama or LM Studio, the very significant peformance impact is about size of LLM model. And BTW I try to find someone that test RX 7900xt using ROCM, Did not find any in entire youtube.

@RoboTFAI Месяц назад

I agree it's about LLM qaunt/size much more than performance. I don't have any AMD cards to test with, but would if I did. I do have several generations of Nvidia cards I might run through some tests with.

@blackhorseteck8381 Месяц назад

The RX 7900 XT (not the XTX) should be about on par with the RTX 3080 TI to 3090 (around 23 tk/s) info sourced from a Reddit post (can't send links on YT sadly). For a point of comparison my RX 6700 XT hovers around tk/s. Hope this helps, and isn't too late of an info :)

@malloott 21 день назад

One number is missing, thanks it helps for me! @@blackhorseteck8381

@six1free Месяц назад

Thank you very much for this, it must have cost a small fortune

@RoboTFAI 28 дней назад

Thanks! It did cost a bit, but not as much as my other node you folks haven't seen yet - coming soon!

@six1free 28 дней назад

@@RoboTFAI what single card would you recommend, considering I'm looking for the 75% - not state of the art mega bucks.... but able to keep up. 4080? will 4070 have hope? (specifically thinking huge context)

@RoboTFAI 28 дней назад

Hmm always depends on use case, and models you want to run - and if you have lots of RAM to load into if not enough VRAM. I don't have a 4070/4080 to run tests with.....but I could run some tests of 4060's vs 4500's (more enterprise level that can be had fairly cheap these days for what they are)

@six1free 28 дней назад

@@RoboTFAI you don't think a 4080 would have been cheaper than 6x 4060's? ... or hindsight?

@RoboTFAI 28 дней назад

I am sure it would have been if looking for large VRAM single card with speed, however that is not my specific use case with these cards. I use these to run multiple smaller LLM models (7b/13b) in parallel and not concerned about having them move as fast as possible just decently. Agent based/bot workflows/etc and playing around. For much larger models I use in my daily work, I have another GPU node with several A4500's (20GB VRAM) for running 70b+ models at speed. That's why I run this all in Kubernetes so multiple services can use multiple LLM models, across different GPU's as needed in parallel. And yes I am crazy....

@jackinsights Месяц назад

Hey RoboTF, my thinking is that 6x 16GB 4060 Ti's for a total of 96GB of VRAM will allow you to run 130B params (Q4) and easily run any 70B model unquantisied.

@RoboTFAI Месяц назад

Yea I run mostly 70b+ models in my work/playing for more serious things. It however depends on the quant, and how much context you set the model up for. Newer Llama 3 models with huge context will eat your VRAM real fast. If you mix offloading to VRAM, and RAM (slower) you could run some very large models.

@akirakudo5950 29 дней назад

Hi thanks for sharing great video! Would you please also share a hardware list for using this test if you had a chance? I am very interested in how GPUs are connected on a mainboard.

@RoboTFAI 28 дней назад

Sure I do have the specs listed on the app, but happy to do a follow up video on the lab machines I use for these type tests and the hardware involved for doing that many GPUs on a single node.

@akirakudo5950 28 дней назад

Thanks for replying! I look forward to watch your new videos, cheers!

@aminvand 12 дней назад

Thanks for the content, so two 16GB GPU will act as 32GB for the models ?

@RoboTFAI 12 дней назад

Yes! Most software for running LLM's/training/etc support splitting the model between multiple GPUs, the newest version of LocalAI (llama.cpp under the hood) even support splitting models across multiple machines over the network.

@jeroenadamdevenijn4067 21 день назад

I'm curious about my use case which is coding (needs high enough quantization). What t/s would I get with Codestral 22b 6-bit quantized with a moderate context size? 2x 4060TI 16GB should be enough for that and leave plenty of room for context. And secondly, what would be the speed penalty when going for 8-bit quantized instead of 6-bit? Around 33% or aim I wrong?

@RoboTFAI 21 день назад

I could throw together a quick test with some of the 4060's and codestral 22b at different quants levels. Tokens per second isn't just about that hardware, but also model (and model architecture MoE/etc), and context size, etc, etc.

@CelsiusAray 29 дней назад

Thank you. Can you try larger models?

@RoboTFAI 28 дней назад

Certainly can show some tests with bigger models, if you have specifics let me know! For this test 13b was just enough to fill one card and show results as split. Happy to do models vs model comparisons also.