LocalAI LLM Testing: How many 16GB 4060TI's does it take to run Llama 3 70B Q4

Подписаться 813

Просмотров 6 тыс.

50% 1

Answering some viewer questions and running Llama 3 70B Q4 K M with the 4060Ti's - how many does it take to run it?
Just a fun night in the lab, grab your favorite relaxation method and join in.
Recorded and best viewed in 4K

Опубликовано:

6 сен 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 58

@dllsmartphone3214 Месяц назад

We need more channels like this that perform and showcase proper testing and hardware requirements for different models. Good job. I hope you will produce more related and useful content like this in the future. I wish your channel massive growth!

@RoboTFAI Месяц назад

Wow, I appreciate that! Just playing around in the lab and hoping people get something from it

@JimmyHacksThings Месяц назад

You're a mad man! I've been working on a GPU node for my home k3s cluster, and getting into hosting a few jupyter containers. My hardware GPUs are mostly AMD 7900XTXs, but I look forward to testing out ollama and the 70B model on multiple GPUs. Also trying to do some locally hosted fine tuning. If all else fails, I can roast a few marshmallows. Hope the pizza was good!

@RoboTFAI Месяц назад

Sounds like you are also a mad man, prob with a good beard or mustache! If you are rocking those AMD cards - the viewers here would love to see some collab/results - lots of questions about AMD....

@gaius100bc Месяц назад

Ah, nice. That's what I was looking for. Not TPS, we already could predict that, but the combined power draw of 3 or 4 cards is what I was looking for. I expected power draw to be fairly low, but I was still surprised by all cards not pulling more than 200W at any given time during inference. Thanks for the test!

@RoboTFAI Месяц назад

Yea I concur, these 4060's sip power - and was a large reason I choose them for my specific needs originally. Still surprises me though.

@jackflash6377 Месяц назад

Outstanding! So with a small investment a person could run a 70B model locally. In the past I have not been so happy with the Q4 size. It would be very interesting to see a comparo with the Q6 or Q8 model. Thanks for the time spent and the valuable info.

@RoboTFAI Месяц назад

Sure no problem we can do higher quants!

@JazekFTW 28 дней назад

Can the motherboard handle that much power consumption from the 3.3V and 5V from the pcie slots for the gpus without powered risers or extra pci-e power like other workstation motherboards?

@InstaKane 6 дней назад

Can you test the phi-3.5 model? Would I be able to run it with 2 RTX 3090?

@firsak 13 дней назад

Please, record your screen in 8k next time, i'd like to put my new microscope to good use.

@rhadiem 10 дней назад

I would love to know what the cheapest GPU you'd need to use for a dedicated text to speech application running something like XTTSv2 for having your LLM's talking to you as quickly as possible. I imagine speed will be key here, but how much VRAM and what is fast enough? Inquiring minds... I mean we all want our own Iron Man homelab with JARVIS to talk to right?

@RoboTFAI 8 дней назад

I don't work much with TTS, or STT - but we can go down that road and see where it takes us

@rhadiem 10 дней назад

It seems to me, looking at the output of the 4060's running here, that the 4060 is a bit too slow to be a productive experience for interactive work, and is better suited for automated processes where you're not waiting on the output. I see you have your 4060's for sale, would you agree on this? What is your take on the 4060 16gb at this point?

@RoboTFAI 8 дней назад

I think the 4060 (16GB) is a great card - people will flame me for that but hey. It's absolutely useable for interactive work if you are not expecting lightning-fast responses. Though on small models the 4060 really flies for what it is and how much power they use. Lower, lower, lower your expectations until your goals are met..... I did sell some of my 4060's but only because I replaced them with A4500's in my main rig so they got rotated out. I kept 4 of them which most days are doing agentic things in automations/bots/etc while sipping power or in my kids gaming rigs.

@milutinke Месяц назад

Thank so you much for this

@RoboTFAI Месяц назад

Thanks for watching! I hope there is valuable information for the community, or at least some fun going on here

@TazzSmk Месяц назад

190W with 4 gpus seems pretty decent for performance it gives !

@RoboTFAI Месяц назад

I concur!

@fulldivemedia Месяц назад

thanks for te great content, can you recommend any of these mobos for local ml and light gaming and content creation? msi meg x670e ace proart x670e rog strix x670e-e gaming wifi

@GermanCodeMonkey Месяц назад

Thank you very much for this 🙂

@RoboTFAI Месяц назад

My pleasure 😊

@fooboomoo Месяц назад

I would be very curious how well those AMD cards will run

@Viewable11 Месяц назад

AMD cards are one third slower than Nvidia for LLM inference.

@JimmyHacksThings Месяц назад

"For Science": Running llama3 7B on a single 24gb 7900XTX card with the prompt: "Make me a good Gobi Manchurian recipe." yielded: Prompt eval count: 22 Prompt eval rate: 1193.7 tokens/s Eval count: 783 tokens Eval duration: 8.3s Eval rate: 94.31 tokens/s. Server specs: Mobo: MSI MPG Z490 ATX CPU: i9-10850K RAM: 96gb OS: Ubuntu Server 22.04 - Integrated GPU used for console so the AMD is dedicated to ollama workloads Drive: 1TB NVMe Samsung SSD 970 EVO PSU: 1000W Corsair RM1000x I need to get a 4U case and a more capable motherboard to extend my GPU count.

@koraycosar1979 Месяц назад

Thanks 👍

@RoboTFAI Месяц назад

Thank you too

@pLop6912 Месяц назад

Good day you have on all 4060 all x16 lines PCIe, I would like to see a test with cut down to x8 lines and how it will affect the speed

@RoboTFAI Месяц назад

4060ti's only use/support 8x - and these tests are all running at 8x

@246rs246 Месяц назад

Do you know if the speed of DDR5 RAM on new motherboards is fast enough to partially store models on it?

@RoboTFAI Месяц назад

I run some models not fully offloaded on machines with only DDR4 in them - it's not fast by any means. I don't have any machines with DDR5 atm... expect my MBP - and that wouldn't be a fair comparison. We can test mixing CPU (RAM), and GPU (VRAM)

@IamSH1VA Месяц назад

Can you also add *Stable Diffusion* to your tests?

@RoboTFAI Месяц назад

I don't have a ton of experience in the image/video generation side, but we absolutely could start doing those type tests and learn together

@ecchichanf Месяц назад

I use 7900XTX and 2 7600XT to run 70b Q4_K_M. I get 5-7 tokens per second but I still play with this setup. So RTX 3090/RTX4090 and 2x RTX 4060ti would be enough to run 70b. 2x fast 24GB cards like RTX3090/4090/7900XTX would be better for the speed.

@bt619x Месяц назад

What are your thoughts/experiences on individual 4060 cards vs sets of 3090 cards with NVLink?

@RoboTFAI Месяц назад

Wish I had some thoughts on it. I don't have 3090's in the lab to test with (there may or may not be one in very near future).... the A4500's support NVlink but I haven't bothered to go down that route for inference and just let cuda/etc do it's job. I could make assumptions on speed/etc more so for loading, but they would just be that assumptions.

@Phil-D83 Месяц назад

Intel arc a770 any good for this? (With zluda?)

@RoboTFAI Месяц назад

I don't have any to test with, but I do believe llama.cpp, etc, etc support Intel ARC with SYCL localai.io/features/gpu-acceleration/#intel-acceleration-sycl

@user-yi2mo9km2s Месяц назад

One 4090 + 192Gb DDR5 and you go.

@RoboTFAI Месяц назад

Yep more than enough for most things depending on your needs, and how deep your budget is.

@iamnickdavis Месяц назад

200w, crazy

@RoboTFAI Месяц назад

Agreed!

@maxh96-yanz77 Месяц назад

Thx so much, your experiment show that 70B parameters just using 3x 16GB GPU is optimal in term cost/performance. Can we assume using 3x RX 6800XT 16GB (second use with cost about 150USD cheaper) more and less can handle 70B parameters ?

@RoboTFAI Месяц назад

I am sure they would handle it as far as offloading, can't say what kinda of speed of course. I am also not experienced with ROCm, or splitting on AMD cards but just to make a point I could dust off 24GB M40's from 2015 and run Llama 3 70b on them, or even run it purely on CPU with 48GB+ of ram....just wouldn't be quick at all. I can show you folks that if you really want!

@maxh96-yanz77 Месяц назад

Thank u Mr. @@RoboTFAI it looks cool .. I use 1060 6G with RAM 64G, it works but token/sec was very bad. I am very confuse choosing either 4060 ti with 16G or Rx 6800 xt 16G or even RX 7900GRE 16G. Because I like to try LLMA3 70B moodel. with minimal cost , 3x RX 6800XT i think is affordable. The contraint choosing RX xxx , as i saw on tomshardware benchmark.. When you used for Stable Diffusing .. it's very-very sucks! .

@Steamrick Месяц назад

Hmm... I think rather than 3x 4060 Ti 16GB I'd prefer trying for 2x RTX 3090. Similar total price point, same overall memory and should be about 70% faster.

@connmatthewk Месяц назад

That's what I'm running and I'm extremely pleased with the results.

@RoboTFAI Месяц назад

Very plausible depending on needs, pocket book, power bill, etc - but prob exactly why people are watching this guy make questionable decisions with my money? All trying to figure it out

@mjes911 Месяц назад

Still selling 4060s?

@RoboTFAI Месяц назад

still have a few - reach out robot@robotf.ai or ping me on reddit/etc

@BahamutCH Месяц назад

Test with Exl2 format instead of GGUF. =) RTX for exllama, not for llama.cpp. =)

@TheMrDrMs Месяц назад

hmmm to go 2x 3090 or 3x 4060Ti....

@RoboTFAI Месяц назад

checkout the newer video on the channel where I bring a 3090 into the lab, maybe will help inform ya?

@yaterifalimiliyoni9929 Месяц назад

Is the model actually giving coherent useful answer?? Accuracy should be tested. Whats the point of running a model thats inaccurate.

@RoboTFAI Месяц назад

Yep - Llama 3 is really pretty good. I don't focus on accuracy here (so far) as that is fairly subjective depending on what you are using specific models for and a tough subject to broach. I could attempt it but I would suggest folks like @matthew_berman (www.youtube.com/@matthew_berman) who I think does a really good job at comparing open source models when they get released.

@yaterifalimiliyoni9929 Месяц назад

@@RoboTFAI thanks for the reply. I really appreciate the test your doing. I'm new and purchased my my 3060 laptop thinking i coukd run some models. Quickly realized not enough (vram) power or these models suck and thought running locally was hype. I think watching his channel is what lead to yours being on my time line.