Cheap mini runs a 70B LLM 🤯

Alex Ziskind

Подписаться 248 тыс.

Просмотров 84 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Опубликовано:

17 сен 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 273

@shapelessed 8 дней назад

I wouldn't necessarily say that this PC can "run" a 70B model. It can walk one for sure...

@warsin8641 8 дней назад

Still replying faster than all my people trying not to reply right away to not look desperate 🤣

@quinxx12 4 дня назад

I didnt quite get what the limiting factor is for running the model faster. Processor speed?

@edism 3 дня назад

@@quinxx12 concurrency and memory bandwidth.

@sascha6841 3 дня назад

With a rollator walker 😁

@cwiniuk2778 2 дня назад

@@quinxx12 AFAIK to run LLMs effectively the data needs to be held in VRAM, as GFX cards can process data significantly faster than CPUs. I didn't fully understand this video btw so I assume its some kind of hack to run the 70b model in memory and CPU processed.

@WonderSilverstrand 3 дня назад

In 10 years, videos like this will be nostalgic

@fusseldieb 3 дня назад

It will be like watching people spinning up 56k modems and getting amazed at the internet.

@WonderSilverstrand 3 дня назад

@@fusseldieb haha yes

@imacg5 2 дня назад

it will be a reminder of what humans look like for the machines

@lockin222 День назад

How old are you guys? I assume 20 or younger Because 10 years ago 16 gb ram was okay, and today macbooks still have 8 gb RAM. 10 years in the future this will still hold up

@fusseldieb День назад

@@lockin222 You forget that most technological advancements aren't linear, but logarithmic.

@perelmanych 7 дней назад

I don't want to disappoint you but I am quite sure you will get the same 1.4t/s running 70b parameters model purely on a CPU and it will use half of the memory. So theoretically you will be able to run 180b models on CPU (q4_K_M version). The thing is that on current PCs not a compute power is the limiting factor but the memory bandwidth and since both iGPU and CPU using the same memory you will get very similar speeds. Make a follow up video, may be I am wrong if so I will be happy to learn that.

@Fordance100 3 дня назад

Strix halo will have 256-bit memory controller. DDR6 will be 2xDDR5 speed. Potential we will 4X memory bandwidth in 2-3 years. Expensive Mac studio ultra has 800 GB/memory bandwidth right now.

@perelmanych 3 дня назад

@@Fordance100 I was talking about Ultra 5 125H. Strix Halo iGPU will be in its own league and I am impatient to see its test results. Mac Studio has unified memory, which is basically soldered on the chip. As I understand Intel is also going to employ this approach for ultra thin series. Let's see, let's see.

@pythonlibrarian224 6 дней назад

This is my favorite subsubsubgenre because figuring out how to run LLMs on consumer equipment with fast & smart models is hard today. Gaming GPUs (too small), Mac Studios (too expensive) are stop gap solutions. I think these will have huge application in business when Groq-like chips are available and we don't have to send most LLM requests to frontier models.

@univera1111 4 дня назад

Excellent explanation

@haganlife 2 дня назад

Been running a local LLM (Mistral 7B and gemma-2-2B) on my iPhone 15 Pro for about a year. Output is instant.

@ps3301 8 дней назад

Most hardwares are still not designed for running ai. Average Joe won't buy 192gb Mac to run llm. 4090 doesn't have enough vram to run most llm.

@ThePgR777 8 дней назад

Maybe 5090 will have 48 gb VRAM

@frankwong9486 8 дней назад

My average Joe uni classmate bought a max out MacBook pro with near hundred GBs of ram to run LLM and he is happy with it 😂

@MiesvanderLippe 8 дней назад

Apple is the ONE company to actually push local LLM's. Surely they'll upsell you to a 40GB language model if it makes any sense.

@andikunar7183 8 дней назад

It's not just the computation-sepeed of the 4090, its VRAM has an extremely high bandwidth (and is therefore so expensive but also crazily power-hungry). Apple Silicon has not "just more RAM", Pro/Max/Ultra each double the base M-series memory-width/bandwidth. So the M2 Ultra gets closer to a 4090's >1TB/s with its 800GB/s bandwidth. LLM token-generation is mainly dependent on memory-bandwidth. THIS (and power-consumption) is why many buy a Mac Studio instead of multiple 4090s, if they do just LLM inference and not machine-learning. But NVIDIA is nearly without peers for ML because of its raw compute-power.

@Raskoll 8 дней назад

@@ThePgR777 Nah it's 28gb

@angryox3102 3 дня назад

Cheap? It costs $800. We have different definitions of cheap.

@levi_athle 3 дня назад

The first price you could pay for a model that is not overworked meaning, over 70gigs of memory would be running a double A6000 48gb that would cost you around 8000US. An 80gb H100 starts at 30.000US and an 80gb A100 would also cost you more likely over 20.000US. We talk about video card only ! Video Card, well, not really, but a PPU. So that 800 that can somewhat mimic / work the 70bil LLM, is quite okay and actually fascinating. For long tasks and creative work, it would probably overheat as for a 50 page document, it would take hours, but 800US is - you are right - not cheap. It is like ultra cheap !!!

@ShihanQu 2 дня назад

Been considering two macbooks each with 128gb to run Q4 Llama 3.1 405b, and that would run over $10k. If you ask chatgpt to build out a system that can run full size 405b, it'd be $500k+

@levi_athle 2 дня назад

@@ShihanQu why would you choose. MacBook. You could go with a normal pc and a double Nvidia card or get a quality arm server with two processors and like 24 cores each and it would work like magic

@ShihanQu 2 дня назад

@@levi_athle because two 4090 would only be 48gb vram, where two maxed out m3 macbooks is 256gb shared vram. And q4 llama 405b needs over 200gb vram.

@quinxx12 23 часа назад

Guess what, AI crap is expensive..

@_RobertOnline_ 7 дней назад

Yes, keep exploring these alternatives to running expensive GPU cards or Apple silicon

@univera1111 4 дня назад

That's why I like this channel.

@rbus 2 дня назад

Running Ollama with Phi3.5 and multimodel models like minicpm-v on an Amazon DeepLens, basically a camera that Amazon sold to developers that is actually an Intel PC with 8GB of RAM and some Intel-optimized AI frameworks built in. Amazon discontinued the cloud-based parts of the DeepLens program so these perfectly functional mini-PCs are as cheap as $20 on eBay. I have 10. :)

@andikunar7183 8 дней назад

Hmmm, besides RAM/VRAM size, its mostly RAM bandwidth for token-generation, which determines llama.cpp's speed (the 4090 has >1TB/s, the M2 Ultra has 800GB/s) . GPU-horsepower is mainly useful for (batched) prompt-processing and learning. And for RAM-size, its not just the model! With the large-context models like llama-3.1, RAM-requirements totally explode, if you try to start the model with its default 128k token-limit. But cool video, thanks!!!!

@AZisk 8 дней назад

i definitely need to include bandwidth in my next vid in the series

@xpowerchord12088 8 дней назад

Is there a bottleneck for discreet cards moving memory though versus a shared memory bus that can load it faste, as in the 4090 bandwidth is only within itself and he regular ram utilized to feed it being 3-5x slower? Unfortunate that Nvidia will most likely never make something in the middle of the 4090 - A4400 for ML and ai people

@andikunar7183 8 дней назад

@@xpowerchord12088 its simpler, yet complicatd: A transformer has to go through its entire compute-graph for each end every token it generates. So it has to pump ALL billions of parameters as well as the transformer's KV-cache (which can get to additional many GBs for 128k context-sizes) from memory (RAM or VRAM) via the (multi-level but small) on-chip caches to the ALUs (in the GPU or CPU or NPU). Token-generation (unlike prompt-processing) is not batched, so this has to be repeated for EACH and every token it generates. Modern CPUs (with their matrix-instructions), GPUs and NPUs have very many ALUs calculating/working in parallel. Because of this its not calculating, but pumping the parameters/KV-cache from memory to these ALUs becomes the bottleneck. Current NVIDIA (e.g. 4090) is able to pump more than 1TB/s with ultra-fast and wide RAM. Apple Silicon uses 128Bit wide RAM in the M, 256 in the M Pro (except for the M3 Pro, which is crippled), 512 in the M Max, and 1028 Bit for the M Ultra. Combined with the RAM's transaction/s this yields 120-133GB/s for the M4 with its LPDDR5X (new Intel/AMD and Snapdragon X do similarly), but faster for the Pro and up to 800GB/s for the M2 Ultra (with its older LPDDR5). Hope this clarifies, sorry for the long-winded explanation.

@xpowerchord12088 7 дней назад

@@andikunar7183 Thanks for the concise and educational response! Appreciate your time.

@dmitrymatora442 7 дней назад

It only explodes if you are not using context quantisation

@Techonsapevole 8 дней назад

Also Amd Ryzen 7 7700 can run 70b without gpu but at 3tokens/s

@dr.ignacioglez.9677 6 дней назад

It's a software issue. Ultra processors are optimized to use the NPU for artificial intelligence, NOT the GPU. You're using the wrong part (a very slow GPU instead of a very fast NPU), but I understand it's because the software you're using doesn't allow you to use the NPU. You should also SHOW how many tokens the CPU alone can process (NOT using the GPU) to compare the performance. I insist, you’re using a very slow GPU, maybe even the CPU is better. In science, you always have to check all the factors experimentally and not take anything for granted. Good luck 🍀

@AZisk 6 дней назад

thank you 🙏

@dr.ignacioglez.9677 6 дней назад

@@AZisk NO, thank to you for all your hard work in making such an interesting and useful video.

@DunckingTest 8 дней назад

soooo excited to see you testing the new lunar lake intel cpus

@AZisk 8 дней назад

me too. coming soon hopefully

@deucebigs9860 8 дней назад

Please keep this series going!

@shawnparker2692 8 дней назад

Alex doing a commercial made me laugh- plus never knew he was in bare feet 😂

@maxxflyer 2 дня назад

bro, this is the fucking videos we need. why everyone talking and never do videos like this?

@unokometanti8922 8 дней назад

The same library seems to support multi gpu setups. Hence an 8x Intel Arc Pro A60 totalling 96 GB of VRAM could in theory be attempted and still be more competitive than a MacStudio from a TOPS-per-dollar perspective. Don’t expect the same size, silence and power efficiency though…

@xpowerchord12088 8 дней назад

For a small lab maybe. Functionality would be way down though. An M3 Max with 96gb would be a better all around deal for an individual. You should see the pro level Nvidia cards that can be linked each with 48gb of ram. Too band Nvidia will not jump in this world when it has the gaming and pro sector nailed

@destiny_02 8 дней назад

10:48 not impossible, llama.cpp can do partial acceleration, running some layers on GPU and remaining layers on CPU.

@adnctz 8 дней назад

Wow, we’re having a Hack Week, and I was thinking of this-nice timing!

@alexo7431 2 дня назад

thank you for your experiment, great job

@ptrxwsmitt 8 дней назад

Thanks for testing this out. Thought about testing this for myself using the Minisforum version of this mini PC. There seems to be another way of running LLMs using the actual NPU of the Ultra CPUs instead of the Arc GPU when running it via OpenVino. I would be very much interested in some more testing on linux+openvino.

@Nemesis-db8fl 8 дней назад

I just discovered these mini PCs(sue me) and i gotta tell you these are pretty darn good. I just ordered one to run my website cuz digital ocean’s prices are too high compared to these in the long run

@4.0.4 6 дней назад

1.43 t/s is kinda OK, but realistically, it's not very useful. I think a bang for buck situation would be to use a couple Tesla P40s to get like 5t/s. It won't look pretty, but if you chuck it in the garage or something it's not a problem.

@davidtindell950 8 дней назад

thank you very much! I have the newest 70b model on both an MSI laptop AND a MSI desktop each with 64 Gb DDR5. They run somwhat slow, but useable and FASTER than your Demo !😮

@everythingofgames4658 8 дней назад

@@davidtindell950 so, what are the gpus use in there

@paultparker 8 дней назад

Same CPU and GPU? Anyway, please give token rates, quantization, and relevant machine specifications. I would assume better performance in the desktop at least because of better cooling versus a mini PC.

@vaibhavbv3409 8 дней назад

Which CPU and GPU

@davidtindell950 7 дней назад

@@paultparker Sorry for Delayed Response to Replies. Have Muti-Project DEADLINES ! MSI CodR2B14NUD7092: Intel Core i7-14700F, NVIDIA GeForce RTX 4060Ti. 64GB DDR5 5600 and 2TB M.2 NVMe Gen3 !

@davidtindell950 7 дней назад

@@vaibhavbv3409 Sorry for Delayed Response to Replies. Have Muti-Project DEADLINES ! MSI CodR2B14NUD7092: Intel Core i7-14700F, NVIDIA GeForce RTX 4060Ti. 64GB DDR5 5600 and 2TB M.2 NVMe Gen3 !

@blackhorseteck8381 8 дней назад

Running Llama 3 (8B 4Q) on Ubuntu 24.04 with a Ryzen 7 6800H and it's iGPU (680M)! I get about 30 tok/s consistently. Ditch Windows if you want the best results.

@HitsInSandbox 2 дня назад

Yes, I also find Linux allows for unique configurations that can not be done in Windows in order to run large LLM's with less memory requirements. Windows 11 is still bloatware that hogs resources that could be better used elsewhere for LLM's. Heck, can run windows faster in a Linux Virtual system than windows installed directly.

@DamianTheFirst 8 дней назад

you totally can run such models with 4090. Just some part of it will be offloaded to the CPU, but still should be much faster than 1,4 tokens per second and this mini PC costs $800 for 32GB of RAM version. In no measure that's cheap. This is $100 more than I've paid for used 3090. On 8B models I'm getting over 70 tokens/s, and on 70B it varies between 2 and 3 tokens/s I know that performance-per-watt-wise miniPC is better, but there is no room for expansion, and adding more GPUs. And the lower power consumption is not really worth the difference

@maxvamp 8 дней назад

I am a huge fan of the minisforums PCs. Extremely similar in form factor. Sounds like soon we will be having a AMD/ARM/Intel AI benchmark race. :-)

@AZisk 8 дней назад

I've got one of them too, video coming soon :)

@TommieHansen 6 дней назад

@@AZisk Cool, especially since you do some development @ Win but for some reason use WSL2 even when not needed (it can be quite a bit slower for some stuff then "native"...)

@univera1111 4 дня назад

I knew window was going to win at this LLM thing. I knew that Intel Arch would also win. Nice one. I want to to know that some of us might not have access to Mac machine ever due to location. And the fact that you can just buy an upgrade of RAM is amazing.

@YouTubeGlobalAdminstrator 2 дня назад

Miniforum are unreliable and have BIOS bugs.

@AZisk 2 дня назад

@@RU-vidGlobalAdminstratorsounds like a little update and god to go

@JoeBurnett 8 дней назад

That would also be a great cheap option to use Stable Diffusion, Flux.1, ComfyUi, etc. for aspiring artists.

@shApYT 6 дней назад

Artists want real time generation like SD turbo to actually be able to draw with it. Something like a 3060 would be better.

@user-yi2mo9km2s 2 дня назад

@@shApYT Even my Ryzen9+4090 can not do it. lol

@djayjp 8 дней назад

Why not use the new processors with huge TOPS perf instead...?

@jelliott3604 8 дней назад

my mini-pc, for general usage, is a Ryzen 7 APU with integrated AMD graphics and 64GB DDR4 RAM, 56GB of which has been set as dedicated to graphics in BIOS. It's slow, it's AMD, but it runs stuff in GPU and still alot faster than CPU only (still sucks at running Cyberpunk 2077)

@jelliott3604 8 дней назад

.. keep wondering how a modern AMD desktop CPU *G model, with a load of CPU cores and a decent integrated GPU and 128 or 256GB of fast DDR memory available would handle things. Certainly the cheapest way to get (close to) 256GB of memory on a GPU that I can think of - you could have a rack of them for cost of the Nvidia GPUs you would need to get to that 256GB

@whodis5438 8 дней назад

@@jelliott3604 more cores don’t help you’ll actually get better performance with hyperthreading off. singlecore benchmarks are a better indicator for llm/ai as it’s about clockspeed/turbo boost x RAM bandwidth throughput

@whodis5438 8 дней назад

@@jelliott3604 AMD keeping AVX512 in their consumer line is gonna make the competition really interesting for CPU-centric builds tho maybe as soon as this next gen refresh. Intel making all the wrong moves

@jelliott3604 4 дня назад

@whodis5438 my gaming box, the one in the nice case with all the ARGB lighting, is another Haswell-E CPU (i7-5690X) and LGA-2011 board with hyper-threading turned-off and that octa-core 3GHz processor clocked up to just under 4.6 GHz on all cores

@S-Technology 8 дней назад

I’ve been using an Intel i7-1255u in a mini pc to run GPT4All with some pretty good results, as long as you stick with smaller highly quantized models.

@davidtindell950 7 дней назад

Sorry for Delayed Response to Replies. Have Muti-Project DEADLINES ! MSI CodR2B14NUD7092: Intel Core i7-14700F, NVIDIA GeForce RTX 4060Ti. 64GB DDR5 5600 and 2TB M.2 NVMe Gen3 !

@timeflex 2 дня назад

I can feel mobile phones with 128Gb+ of RAM approaching already.

@mentalmarvin 8 дней назад

Now I'm curious how well your mac can run the 70b model

@denvera1g1 День назад

In theory, DDR5 should allow for up to 256GB with 2 sticks, or 512GB with 4 sticks, EVEN MORE with ECC RDIMMS and LR-DIMMS but those for sure would require a different processor and motherboard.

@norlesh Час назад

To save others from having to sit to the end the entire post title should have been - Cheap mini runs a 70B LLM unusably slow.

@Puneeth-d6h 8 дней назад

Alex, Intel lunar lake cpu laptop are out.Please review and share your experience in development environment

@emiliochang3734 8 дней назад

I don't think they're for sale right now. What we've seen so far are reviews thanks to brands like Asus and Acer partnering with some RU-vidrs

@ryanchappell5962 16 часов назад

They need to start making GPUs with DDR slots. It would be slower for gaming but great for LLM and image generation

@danielpicassomunoz2752 8 дней назад

Ew, windows

@vin.k.k 8 дней назад

While at it, install LM Studio. It now supports Volkan.

@rootnotez День назад

Probably good form to put a link in the description to the post you based this video off of. 👍

@brymstoner 2 дня назад

if your only unit of measure for success is that it can run it regardless of how quickly, i made llama 3 7b run on a khadas vim4 pro using ollama. every cpu core spikes and pins at 100 for the majority of the output, but that's expected of an iot sbc.

@ChadZLumenarcus 3 дня назад

I have a core i7 9750H and am running llama3.1 model pretty well. I'm just now getting into AI models and learning about this stuff and it's pretty crazy. I want to scale up and mess with this stuff but finances are the limit lol. It's crazy to think that in 8 or so years, we'll likely have something far better running on our phones without a problem. It doesn't have to be perfect, just "good enough. " to help people with their work.

@Corteum 4 дня назад

"but what's impressive is that this tiny little bo can run a 70B LLM... like a snail. So if you're really REAAALLY patient, this is possibly a solution for you" Lol

@SK-bl1lp 6 дней назад

Alex, try to check it out with minipc or laptop together with eGPU like 4080 or 4090. Thank you!

@fontenbleau 6 дней назад

You need to try EXO cluster project for Macs, kinda the only way to juice out that uber-expensive hardware as Ai server, but it's super pricey way.

@HashimHS 8 дней назад

While the 4090 can't run the whole model, it can still speed up the process significantly as you will offload some layers to the gpu. BTW a year ago I was able to run llama2 70B on my laptop with 6900HS 8 core cpu. and I only have 24gb of ram, so it was using swap memory (virual memory) aka the internal ssd. I was getting one token output every 10 secs. I only had a 3060 6gb so I couldn't offload much to the gpu.

@tdreamgmail 8 дней назад

Totally not worth, but thanks for the information.

@dmitrymatora442 7 дней назад

It can load whole model (Google exllama2). And it can do so much more. Particularly use q4_0 on kv-cache bringing it down from 40Gb to 10Gb on 128k context

@RomPereira 8 дней назад

Wife reading Alex's credit card statement: $309 - Very comfy chair... check, "he needs it, poor thing, works too hard making content for all those geeks." $2000 - Nvidia Video Card 4090... check, "2000 dollars for a video card???????? well cheaper than a macbook, ok." $500 - Microsoft X-Box Game Pass Ultimate 1 year Subscription... check, "Alex, dear, are you playing those "online" games with these fans of yours?"

@AZisk 8 дней назад

what’s x-box? is that a new intel laptop?

@RomPereira 8 дней назад

@AZisk yeah right

@Balidor 8 дней назад

Why the NPU is not used? Isn't this an actually use case?

@xpowerchord12088 8 дней назад

No. NPU's only get used when a particular software uses it's api for it. Ie photoshop, apple intelligence, copilot. NPU's are proprietary and underutilized and stuck in hype train.GPU's are much better at this aspect.

@univera1111 4 дня назад

@@xpowerchord12088 I bet there is an NPU card module you can plug into the PC.

@xpowerchord12088 4 дня назад

From my understanding NPU's are for apps that apple gives the api for. It's proprietary. They are mostly for hype right now and not used in running Ai's. Mostly in apps that do ai and video imaging

@xpowerchord12088 4 дня назад

@@univera1111 NPU's are basically GPU's from my understanding, but are proprietary and and not used for this. Not much actually uses the NPU aside from the os and companies who get the api like adobe.

@lincebranco1520 2 дня назад

this is very true. it works vey nice. would be interested to see it with RTX 4090 or any other like RTX 4060, or RTX 4070.

@GoddamnAxl 3 дня назад

I've never seen anyone sell a chair that well to his target audience I'm in tears. 😭

8 дней назад

I am mesmerized by the keyboard,

@dvpzy 3 дня назад

What a nice chair!

@PatrickOMara 2 дня назад

Got this working last year on the AMD chip of their mini pcs.

@mystealthlife6991 2 дня назад

I was running Llama 3.1: 70b on a old server 2x xeon chips 128gb running at 1333hz ....total cost for server = $125 off facebook marketplace. (Poweredge R710). Responses took awhile but it ran.

@RomanKiprin 8 дней назад

That is so weird. I wanted to ask you to perform exactly that! I was thinking about upgrading my labbox to run llama. But I was not sure if that was anyhow better than a paid subscription or if it would work at all. Thank you! The only thing I would love you to try is to setup a llama inside a virtual machine on this very box. It might be significantly more challenging task though. :)

@vaidphysics 3 дня назад

Wake me up when you can run a 70b model on a device which costs less than $800

@user-yi2mo9km2s 2 дня назад

Sorry, this system is useless, 96Gb ram is not enough for running useful AI models, even my PC has Ryzen 9+192Gb DDR5+4090 installed that can't run 130b+ in good speed.

@TJ-hs1qm 8 дней назад

Thanks!

@x3haloed 2 дня назад

I find that 4-bit quantization makes models borderline incoherent. I don't think it's actually useful to run a giant model if you have to heavily quantize it.

@zachzimmermann5209 7 дней назад

Oh come on Alex, I'm sure this chair is very nice and all, but I don't believe for a second that you were using the cheapest possible version of the Herman Miller Aeron (as there are 2 different versions of lumbar support for the Aeron, one of which is a $13 eBay upgrade if you didn't get it from the factory). I feel like you removed the lumbar support just for that shot :/ Edit: sorry, I just got an Aeron so I'm defensive lol

@AZisk 7 дней назад

i was using the cheapest HM. Bought it in 2016 and didn’t realize there was another version with lumbar.

@AZisk 7 дней назад

my HM came without the lumbar btw. It’s how I bought it.

@kaislate 3 дня назад

Next up is getting one of those 48GB vram 4090 cards that were floating around.

@vannoo67 2 дня назад

Nvidia cards can now use system RAM as VRAM. Of course it it quite a bit slower, but it makes some tasks possible that previously were not. My RTX 4070 TI Super has 16 G VRAM, but if I look in Settings | Display | Advanced Display Settings | Display adapter properties for Display 1, I see; Total Available Graphics Memory: 32729 MB Dedicated Video Memory: 16384 MB System Video Memory; 0 MB Shared System Memory: 16385 MB Note: I have 32GB of system RAM So don't feel that you will be limited to only 24 G VRAM on your 4090 if you have some system memory available.

@AaronBrooks0321 8 дней назад

Never put your feet online for free!

@hansofmadata3565 3 дня назад

Great vid Alex! Please 🙏 consider another vid where you do an Ubuntu install. I’m running ollama on the GMKTek M3 16GB RAM and am very pleased with inference with Gemma2 and Llama3.1 using open webui as a front end on any device. Thanks for your awesome content 👏

@mehregankbi 8 дней назад

Technically, nvidia could allow the user to use normal system ram as vram. Similar to swapping. They could also use memory compression for vram. It’s pretty usual for system ram. Maybe they already do this for vram too. I’m not sure. Yes it’d be slower if they used these techniques, but it’d be better than not being able to run the task at all.

@TommieHansen 6 дней назад

Tehnically you could just mount X amount of RAM as a volume and use that as swap-disk as well though, no need for Nvidia to do anything. If they allow swapping any volume could be swapped to.

@mehregankbi 6 дней назад

@@TommieHansen swap won’t help if nvidia doesn’t support offloading video memory pressure to system memory pressure.

@leifashley День назад

Need to put some 'H' in that Herman. lol

@AZisk День назад

that just seems weird

@Carambolero 5 дней назад

So, doing what already was done is the thing. Nice.

@AZisk 5 дней назад

yep, but video! cool concept

@РустемСиразов-м9м 8 дней назад

Actually, installing a model is much easier now. You can even have UI for free. Msty, for example. Or use ollama directly, if you prefer CLI.

@royambrose6363 7 дней назад

your promo cooler than the content !!! I love you man !!

@MeinDeutschkurs 8 дней назад

Yeah! Let‘s find the ideal AI inference machine for the cheapest price and the hugest model!!!!!! 👏👏👏👏👏 some kind of the Raspberry Pi, but for AI

@TazzSmk 7 дней назад

M2 Max Mac Studio (64GB) runs Reflection 70B about 7-10x faster than any PC that lacks 48GB graphics vram, just tried yesterday

@sirflimflam 2 дня назад

Pretty sure the limiting factor here is just the amount of memory. I can run a 70b model with my 3070ti, only a very small portion of it is offloaded to the gpu since it's 12GB card, the rest is on the cpu and it's...comparably slow. maybe a bit faster.

@MeinDeutschkurs 8 дней назад

The only thing I miss is the price for the hardware components.

@Bicyclesidewalk 4 дня назад

Would love to see some Linux content. Perhaps you already have something - will check your vids~ Neat stuff here.

@BlazingVictory 8 дней назад

aside from the ability to run the 70B LLM, is there a practical use case here when the tokens/second is pretty slow?

@mirkamolmirobidov1991 8 дней назад

Awesome video. I think lm studio also uses llama.cpp as engine.

@zenginellc День назад

So it basically gave 1 full stick of memory to the GPU.. 🤔

@user-br4iu 3 дня назад

When did 800 become a cheap PC? My workstation costs 300.

@ranjitmandal1612 8 дней назад

This is wild 🤯

@mrrobot4840 8 дней назад

how many chrome tabs can it open at a time?

@anshulsingh8326 8 дней назад

If only nvidia just give 64gb + vram Or make npu run llm models

@Balidor 8 дней назад

yes please

@odebroqueville 8 дней назад

Hey Alex, is there a direct correlation between the amount of RAM and the number of parameters a system would support? I’m just thinking of getting an M4 Mac mini and wondering what a difference it would make to get 16 or 32GB of RAM. What kind of LLM would I be able to run on this small system?

@paultparker 8 дней назад

Yes, there is a direct correlation. For a non-quantized model I generally assume I need about twice as much ram as parameters, but I could be totally off base. You can get an estimate by looking at the size of the model download file un hugging face. On a Windows machine, I believe only 75% of the ram is available to the internal GPU, which is why he only had 55 GB of ram available and not 96. You can see that he still used the whole 96 though.

@paultparker 8 дней назад

On a 32 GB machine you should be able to run a 7 or 8 billion parameter model. Some people say that they can run a heavily quantized 33 billion model. I even saw one claim about running a 70 billion model. However, even the 33 billion model that was heavily quantized was only 7-8 tokens per second. I think you can run the same model on a 16 GB machine if it was heavily quantized, but my guess is it would take up most of the system so you couldn’t do much else on the machine. I would buy as much RAM as you easily can: that is my plan. These models are very ram hungry. Plus, you may be running multiple models at the same time, or at least having them in at the same time, once the baked in apple models come out.

@davidtindell950 7 дней назад

P.S. ALSO, very sorry about repeated replies ! Although I am very careful, Google YT has been accepting my carefully worded replies, but then -- later when I again access a channel -- some of my replies are simply 'gone'. Days later -- when I access this and other channels again -- some (but not all) of my replies "reappear". This current comment may not survive, but this problem happens more with replies to individuals. Often, I will receive Gmail notifications of replies that showed on YT days later. Quote: "I have a bad feeling about this!".

@renatomartins5901 8 дней назад

Could AMD APUs be even better?

@scrumpy615 8 дней назад

*Financial planning is like navigation. If you know where you are and where you want to go, navigation isn't such a great problem. It's when you don't know the two points that it's difficult*

@tulapradhan6882 8 дней назад

Waking up every 14th of each month to 50,000 dollars it's a blessing to I and my family... Big gratitude to Ricky Wen🙌

@vilasaojose6703 8 дней назад

Hello, I am very interested. As you know, there are tons of investments out there and without solid knowledge, I can't decide what is best. Can you explain further how you invest and earn?

@josemanuelmacias7968 8 дней назад

Same, I operate a wide- range of Investments with help from My Financial Adviser. My advice is to get a professional who will help you, plan and enhance your management skills. For the record, working with Ricky Wen, has been an amazing experience.

@ruadasflores7559 8 дней назад

Hello how do you make such monthly?? sometimes I feel so down🤦🏽of myself because of low finance but I still believe in God

@jameslongwell5025 8 дней назад

I'm favoured, $90K every week! I can now give back to the locals in my community and also support God's work and the church. God bless America,, all thanks to Mr Ricky Wen

@supercurioTube 8 дней назад

Although it's cool to see that it works at all, I can't think of how it would be usable with such low output speed. Besides maybe confirming that the model runs? I'm comparing it with my M1 Max MacBook as a reference for 70b, which provides usable generation speeds (reading speed between 5-8 token/s depending on quantization)

@husanaaulia4717 6 дней назад

If you want to use those external GPU, i think MoE model is better. With Ktransformer library.

@ZeroToNineTimes 8 дней назад

And I'm in need of a chair...

@destiny_02 8 дней назад

now try running the same model in cpu only mode

@nikhilsharma32907 8 дней назад

@@destiny_02 nope it will be very slow

@quinxx12 23 часа назад

4 bit quantization.. It's gonna take ages to make a usable home AI affordable..

@waroonh4291 8 дней назад

yeah, 2022 hardware you have a problem with VRAM on nVidia Card.. that cost like gold and always out of stock.

@AbdelhamidMohamed 8 дней назад

Nice video, why you didn't use LM Studio

@gregblank247 8 дней назад

AI suggests that there is a way you can use the 4090 and somehow instead have it use your system RAM. The 4090 being a lot faster on the AI calcs maybe faster than your little box method. I would like to see you run 3.1 70B, and also Reflection 70B (assuming that model is legit)

@SandTiger42 2 дня назад

4 bit quant though. Why? 16 or bust. 8 if you're in a pinch. Soon though, give it another year or two. Models need to perform a bit better, and hardware needs to get a bit better, or just cheaper.

@vengirgirem 5 дней назад

I have Core Ultra 7 155H in my laptop. Unfortunately, I can't run the 70B parameter model since the ram is soldered (because it's, well, a laptop), but I have run the 8B parameter Llama

@NabekenProG87 8 дней назад

"never upgraded a Laptop" yeah, I had a feeling

@goodcitizen4587 4 дня назад

Would this PC support two 128GB modules? Maybe those modules are too large.

@KaziQTR 3 дня назад

Task manager isn’t reporting allocated VRAM anymore for iGPU? I max out at 16 GB on my Ryzen 5750G with 64 GB RAM installed in the BIOS. Maybe the new systems are truly HSA. But it looks like there is a copy of the data in the non GPU section? I wish AMD didn’t ditch on their HSA efforts 15 years ago.

@Michael-Martell 4 дня назад

I’m interested if my 2019 intel with 128gb of ram will finally have the opportunity to use all of it. Most I ever really needed was like 55 or so. My 16gb m1 hates me. Haha.