@@quinxx12 AFAIK to run LLMs effectively the data needs to be held in VRAM, as GFX cards can process data significantly faster than CPUs. I didn't fully understand this video btw so I assume its some kind of hack to run the 70b model in memory and CPU processed.
How old are you guys? I assume 20 or younger Because 10 years ago 16 gb ram was okay, and today macbooks still have 8 gb RAM. 10 years in the future this will still hold up
I don't want to disappoint you but I am quite sure you will get the same 1.4t/s running 70b parameters model purely on a CPU and it will use half of the memory. So theoretically you will be able to run 180b models on CPU (q4_K_M version). The thing is that on current PCs not a compute power is the limiting factor but the memory bandwidth and since both iGPU and CPU using the same memory you will get very similar speeds. Make a follow up video, may be I am wrong if so I will be happy to learn that.
Strix halo will have 256-bit memory controller. DDR6 will be 2xDDR5 speed. Potential we will 4X memory bandwidth in 2-3 years. Expensive Mac studio ultra has 800 GB/memory bandwidth right now.
@@Fordance100 I was talking about Ultra 5 125H. Strix Halo iGPU will be in its own league and I am impatient to see its test results. Mac Studio has unified memory, which is basically soldered on the chip. As I understand Intel is also going to employ this approach for ultra thin series. Let's see, let's see.
This is my favorite subsubsubgenre because figuring out how to run LLMs on consumer equipment with fast & smart models is hard today. Gaming GPUs (too small), Mac Studios (too expensive) are stop gap solutions. I think these will have huge application in business when Groq-like chips are available and we don't have to send most LLM requests to frontier models.
It's not just the computation-sepeed of the 4090, its VRAM has an extremely high bandwidth (and is therefore so expensive but also crazily power-hungry). Apple Silicon has not "just more RAM", Pro/Max/Ultra each double the base M-series memory-width/bandwidth. So the M2 Ultra gets closer to a 4090's >1TB/s with its 800GB/s bandwidth. LLM token-generation is mainly dependent on memory-bandwidth. THIS (and power-consumption) is why many buy a Mac Studio instead of multiple 4090s, if they do just LLM inference and not machine-learning. But NVIDIA is nearly without peers for ML because of its raw compute-power.
The first price you could pay for a model that is not overworked meaning, over 70gigs of memory would be running a double A6000 48gb that would cost you around 8000US. An 80gb H100 starts at 30.000US and an 80gb A100 would also cost you more likely over 20.000US. We talk about video card only ! Video Card, well, not really, but a PPU. So that 800 that can somewhat mimic / work the 70bil LLM, is quite okay and actually fascinating. For long tasks and creative work, it would probably overheat as for a 50 page document, it would take hours, but 800US is - you are right - not cheap. It is like ultra cheap !!!
Been considering two macbooks each with 128gb to run Q4 Llama 3.1 405b, and that would run over $10k. If you ask chatgpt to build out a system that can run full size 405b, it'd be $500k+
@@ShihanQu why would you choose. MacBook. You could go with a normal pc and a double Nvidia card or get a quality arm server with two processors and like 24 cores each and it would work like magic
Running Ollama with Phi3.5 and multimodel models like minicpm-v on an Amazon DeepLens, basically a camera that Amazon sold to developers that is actually an Intel PC with 8GB of RAM and some Intel-optimized AI frameworks built in. Amazon discontinued the cloud-based parts of the DeepLens program so these perfectly functional mini-PCs are as cheap as $20 on eBay. I have 10. :)
Hmmm, besides RAM/VRAM size, its mostly RAM bandwidth for token-generation, which determines llama.cpp's speed (the 4090 has >1TB/s, the M2 Ultra has 800GB/s) . GPU-horsepower is mainly useful for (batched) prompt-processing and learning. And for RAM-size, its not just the model! With the large-context models like llama-3.1, RAM-requirements totally explode, if you try to start the model with its default 128k token-limit. But cool video, thanks!!!!
Is there a bottleneck for discreet cards moving memory though versus a shared memory bus that can load it faste, as in the 4090 bandwidth is only within itself and he regular ram utilized to feed it being 3-5x slower? Unfortunate that Nvidia will most likely never make something in the middle of the 4090 - A4400 for ML and ai people
@@xpowerchord12088 its simpler, yet complicatd: A transformer has to go through its entire compute-graph for each end every token it generates. So it has to pump ALL billions of parameters as well as the transformer's KV-cache (which can get to additional many GBs for 128k context-sizes) from memory (RAM or VRAM) via the (multi-level but small) on-chip caches to the ALUs (in the GPU or CPU or NPU). Token-generation (unlike prompt-processing) is not batched, so this has to be repeated for EACH and every token it generates. Modern CPUs (with their matrix-instructions), GPUs and NPUs have very many ALUs calculating/working in parallel. Because of this its not calculating, but pumping the parameters/KV-cache from memory to these ALUs becomes the bottleneck. Current NVIDIA (e.g. 4090) is able to pump more than 1TB/s with ultra-fast and wide RAM. Apple Silicon uses 128Bit wide RAM in the M, 256 in the M Pro (except for the M3 Pro, which is crippled), 512 in the M Max, and 1028 Bit for the M Ultra. Combined with the RAM's transaction/s this yields 120-133GB/s for the M4 with its LPDDR5X (new Intel/AMD and Snapdragon X do similarly), but faster for the Pro and up to 800GB/s for the M2 Ultra (with its older LPDDR5). Hope this clarifies, sorry for the long-winded explanation.
It's a software issue. Ultra processors are optimized to use the NPU for artificial intelligence, NOT the GPU. You're using the wrong part (a very slow GPU instead of a very fast NPU), but I understand it's because the software you're using doesn't allow you to use the NPU. You should also SHOW how many tokens the CPU alone can process (NOT using the GPU) to compare the performance. I insist, you’re using a very slow GPU, maybe even the CPU is better. In science, you always have to check all the factors experimentally and not take anything for granted. Good luck 🍀
The same library seems to support multi gpu setups. Hence an 8x Intel Arc Pro A60 totalling 96 GB of VRAM could in theory be attempted and still be more competitive than a MacStudio from a TOPS-per-dollar perspective. Don’t expect the same size, silence and power efficiency though…
For a small lab maybe. Functionality would be way down though. An M3 Max with 96gb would be a better all around deal for an individual. You should see the pro level Nvidia cards that can be linked each with 48gb of ram. Too band Nvidia will not jump in this world when it has the gaming and pro sector nailed
Thanks for testing this out. Thought about testing this for myself using the Minisforum version of this mini PC. There seems to be another way of running LLMs using the actual NPU of the Ultra CPUs instead of the Arc GPU when running it via OpenVino. I would be very much interested in some more testing on linux+openvino.
I just discovered these mini PCs(sue me) and i gotta tell you these are pretty darn good. I just ordered one to run my website cuz digital ocean’s prices are too high compared to these in the long run
1.43 t/s is kinda OK, but realistically, it's not very useful. I think a bang for buck situation would be to use a couple Tesla P40s to get like 5t/s. It won't look pretty, but if you chuck it in the garage or something it's not a problem.
thank you very much! I have the newest 70b model on both an MSI laptop AND a MSI desktop each with 64 Gb DDR5. They run somwhat slow, but useable and FASTER than your Demo !😮
Same CPU and GPU? Anyway, please give token rates, quantization, and relevant machine specifications. I would assume better performance in the desktop at least because of better cooling versus a mini PC.
Running Llama 3 (8B 4Q) on Ubuntu 24.04 with a Ryzen 7 6800H and it's iGPU (680M)! I get about 30 tok/s consistently. Ditch Windows if you want the best results.
Yes, I also find Linux allows for unique configurations that can not be done in Windows in order to run large LLM's with less memory requirements. Windows 11 is still bloatware that hogs resources that could be better used elsewhere for LLM's. Heck, can run windows faster in a Linux Virtual system than windows installed directly.
you totally can run such models with 4090. Just some part of it will be offloaded to the CPU, but still should be much faster than 1,4 tokens per second and this mini PC costs $800 for 32GB of RAM version. In no measure that's cheap. This is $100 more than I've paid for used 3090. On 8B models I'm getting over 70 tokens/s, and on 70B it varies between 2 and 3 tokens/s I know that performance-per-watt-wise miniPC is better, but there is no room for expansion, and adding more GPUs. And the lower power consumption is not really worth the difference
@@AZisk Cool, especially since you do some development @ Win but for some reason use WSL2 even when not needed (it can be quite a bit slower for some stuff then "native"...)
I knew window was going to win at this LLM thing. I knew that Intel Arch would also win. Nice one. I want to to know that some of us might not have access to Mac machine ever due to location. And the fact that you can just buy an upgrade of RAM is amazing.
my mini-pc, for general usage, is a Ryzen 7 APU with integrated AMD graphics and 64GB DDR4 RAM, 56GB of which has been set as dedicated to graphics in BIOS. It's slow, it's AMD, but it runs stuff in GPU and still alot faster than CPU only (still sucks at running Cyberpunk 2077)
.. keep wondering how a modern AMD desktop CPU *G model, with a load of CPU cores and a decent integrated GPU and 128 or 256GB of fast DDR memory available would handle things. Certainly the cheapest way to get (close to) 256GB of memory on a GPU that I can think of - you could have a rack of them for cost of the Nvidia GPUs you would need to get to that 256GB
@@jelliott3604 more cores don’t help you’ll actually get better performance with hyperthreading off. singlecore benchmarks are a better indicator for llm/ai as it’s about clockspeed/turbo boost x RAM bandwidth throughput
@@jelliott3604 AMD keeping AVX512 in their consumer line is gonna make the competition really interesting for CPU-centric builds tho maybe as soon as this next gen refresh. Intel making all the wrong moves
@whodis5438 my gaming box, the one in the nice case with all the ARGB lighting, is another Haswell-E CPU (i7-5690X) and LGA-2011 board with hyper-threading turned-off and that octa-core 3GHz processor clocked up to just under 4.6 GHz on all cores
I’ve been using an Intel i7-1255u in a mini pc to run GPT4All with some pretty good results, as long as you stick with smaller highly quantized models.
In theory, DDR5 should allow for up to 256GB with 2 sticks, or 512GB with 4 sticks, EVEN MORE with ECC RDIMMS and LR-DIMMS but those for sure would require a different processor and motherboard.
if your only unit of measure for success is that it can run it regardless of how quickly, i made llama 3 7b run on a khadas vim4 pro using ollama. every cpu core spikes and pins at 100 for the majority of the output, but that's expected of an iot sbc.
I have a core i7 9750H and am running llama3.1 model pretty well. I'm just now getting into AI models and learning about this stuff and it's pretty crazy. I want to scale up and mess with this stuff but finances are the limit lol. It's crazy to think that in 8 or so years, we'll likely have something far better running on our phones without a problem. It doesn't have to be perfect, just "good enough. " to help people with their work.
"but what's impressive is that this tiny little bo can run a 70B LLM... like a snail. So if you're really REAAALLY patient, this is possibly a solution for you" Lol
While the 4090 can't run the whole model, it can still speed up the process significantly as you will offload some layers to the gpu. BTW a year ago I was able to run llama2 70B on my laptop with 6900HS 8 core cpu. and I only have 24gb of ram, so it was using swap memory (virual memory) aka the internal ssd. I was getting one token output every 10 secs. I only had a 3060 6gb so I couldn't offload much to the gpu.
It can load whole model (Google exllama2). And it can do so much more. Particularly use q4_0 on kv-cache bringing it down from 40Gb to 10Gb on 128k context
Wife reading Alex's credit card statement: $309 - Very comfy chair... check, "he needs it, poor thing, works too hard making content for all those geeks." $2000 - Nvidia Video Card 4090... check, "2000 dollars for a video card???????? well cheaper than a macbook, ok." $500 - Microsoft X-Box Game Pass Ultimate 1 year Subscription... check, "Alex, dear, are you playing those "online" games with these fans of yours?"
No. NPU's only get used when a particular software uses it's api for it. Ie photoshop, apple intelligence, copilot. NPU's are proprietary and underutilized and stuck in hype train.GPU's are much better at this aspect.
From my understanding NPU's are for apps that apple gives the api for. It's proprietary. They are mostly for hype right now and not used in running Ai's. Mostly in apps that do ai and video imaging
@@univera1111 NPU's are basically GPU's from my understanding, but are proprietary and and not used for this. Not much actually uses the NPU aside from the os and companies who get the api like adobe.
I was running Llama 3.1: 70b on a old server 2x xeon chips 128gb running at 1333hz ....total cost for server = $125 off facebook marketplace. (Poweredge R710). Responses took awhile but it ran.
That is so weird. I wanted to ask you to perform exactly that! I was thinking about upgrading my labbox to run llama. But I was not sure if that was anyhow better than a paid subscription or if it would work at all. Thank you! The only thing I would love you to try is to setup a llama inside a virtual machine on this very box. It might be significantly more challenging task though. :)
Sorry, this system is useless, 96Gb ram is not enough for running useful AI models, even my PC has Ryzen 9+192Gb DDR5+4090 installed that can't run 130b+ in good speed.
I find that 4-bit quantization makes models borderline incoherent. I don't think it's actually useful to run a giant model if you have to heavily quantize it.
Oh come on Alex, I'm sure this chair is very nice and all, but I don't believe for a second that you were using the cheapest possible version of the Herman Miller Aeron (as there are 2 different versions of lumbar support for the Aeron, one of which is a $13 eBay upgrade if you didn't get it from the factory). I feel like you removed the lumbar support just for that shot :/ Edit: sorry, I just got an Aeron so I'm defensive lol
Nvidia cards can now use system RAM as VRAM. Of course it it quite a bit slower, but it makes some tasks possible that previously were not. My RTX 4070 TI Super has 16 G VRAM, but if I look in Settings | Display | Advanced Display Settings | Display adapter properties for Display 1, I see; Total Available Graphics Memory: 32729 MB Dedicated Video Memory: 16384 MB System Video Memory; 0 MB Shared System Memory: 16385 MB Note: I have 32GB of system RAM So don't feel that you will be limited to only 24 G VRAM on your 4090 if you have some system memory available.
Great vid Alex! Please 🙏 consider another vid where you do an Ubuntu install. I’m running ollama on the GMKTek M3 16GB RAM and am very pleased with inference with Gemma2 and Llama3.1 using open webui as a front end on any device. Thanks for your awesome content 👏
Technically, nvidia could allow the user to use normal system ram as vram. Similar to swapping. They could also use memory compression for vram. It’s pretty usual for system ram. Maybe they already do this for vram too. I’m not sure. Yes it’d be slower if they used these techniques, but it’d be better than not being able to run the task at all.
Tehnically you could just mount X amount of RAM as a volume and use that as swap-disk as well though, no need for Nvidia to do anything. If they allow swapping any volume could be swapped to.
Pretty sure the limiting factor here is just the amount of memory. I can run a 70b model with my 3070ti, only a very small portion of it is offloaded to the gpu since it's 12GB card, the rest is on the cpu and it's...comparably slow. maybe a bit faster.
Hey Alex, is there a direct correlation between the amount of RAM and the number of parameters a system would support? I’m just thinking of getting an M4 Mac mini and wondering what a difference it would make to get 16 or 32GB of RAM. What kind of LLM would I be able to run on this small system?
Yes, there is a direct correlation. For a non-quantized model I generally assume I need about twice as much ram as parameters, but I could be totally off base. You can get an estimate by looking at the size of the model download file un hugging face. On a Windows machine, I believe only 75% of the ram is available to the internal GPU, which is why he only had 55 GB of ram available and not 96. You can see that he still used the whole 96 though.
On a 32 GB machine you should be able to run a 7 or 8 billion parameter model. Some people say that they can run a heavily quantized 33 billion model. I even saw one claim about running a 70 billion model. However, even the 33 billion model that was heavily quantized was only 7-8 tokens per second. I think you can run the same model on a 16 GB machine if it was heavily quantized, but my guess is it would take up most of the system so you couldn’t do much else on the machine. I would buy as much RAM as you easily can: that is my plan. These models are very ram hungry. Plus, you may be running multiple models at the same time, or at least having them in at the same time, once the baked in apple models come out.
P.S. ALSO, very sorry about repeated replies ! Although I am very careful, Google YT has been accepting my carefully worded replies, but then -- later when I again access a channel -- some of my replies are simply 'gone'. Days later -- when I access this and other channels again -- some (but not all) of my replies "reappear". This current comment may not survive, but this problem happens more with replies to individuals. Often, I will receive Gmail notifications of replies that showed on YT days later. Quote: "I have a bad feeling about this!".
*Financial planning is like navigation. If you know where you are and where you want to go, navigation isn't such a great problem. It's when you don't know the two points that it's difficult*
Hello, I am very interested. As you know, there are tons of investments out there and without solid knowledge, I can't decide what is best. Can you explain further how you invest and earn?
Same, I operate a wide- range of Investments with help from My Financial Adviser. My advice is to get a professional who will help you, plan and enhance your management skills. For the record, working with Ricky Wen, has been an amazing experience.
I'm favoured, $90K every week! I can now give back to the locals in my community and also support God's work and the church. God bless America,, all thanks to Mr Ricky Wen
Although it's cool to see that it works at all, I can't think of how it would be usable with such low output speed. Besides maybe confirming that the model runs? I'm comparing it with my M1 Max MacBook as a reference for 70b, which provides usable generation speeds (reading speed between 5-8 token/s depending on quantization)
AI suggests that there is a way you can use the 4090 and somehow instead have it use your system RAM. The 4090 being a lot faster on the AI calcs maybe faster than your little box method. I would like to see you run 3.1 70B, and also Reflection 70B (assuming that model is legit)
4 bit quant though. Why? 16 or bust. 8 if you're in a pinch. Soon though, give it another year or two. Models need to perform a bit better, and hardware needs to get a bit better, or just cheaper.
I have Core Ultra 7 155H in my laptop. Unfortunately, I can't run the 70B parameter model since the ram is soldered (because it's, well, a laptop), but I have run the 8B parameter Llama
Task manager isn’t reporting allocated VRAM anymore for iGPU? I max out at 16 GB on my Ryzen 5750G with 64 GB RAM installed in the BIOS. Maybe the new systems are truly HSA. But it looks like there is a copy of the data in the non GPU section? I wish AMD didn’t ditch on their HSA efforts 15 years ago.
I’m interested if my 2019 intel with 128gb of ram will finally have the opportunity to use all of it. Most I ever really needed was like 55 or so. My 16gb m1 hates me. Haha.