Тёмный

LocalAI LLM Testing: Viewer Questions using mixed GPUs, and what is Tensor Splitting AI lab session 

RoboTF AI
Подписаться 491
Просмотров 1,2 тыс.
50% 1

Attempting to answer good viewer questions with a bit of testing in the lab.
We will be taking a look at using different GPUs in a mixed scenario, along with going down the route of tensor splitting to best effect your mixed GPU machines
We will be using LocalAI, and an Nvidia 4060 Ti with 16GB VRAM along with a Tesla M40 24GB.
Grab your favorite after work or weekend enjoyment tool and watch some GPU testing
Recorded and best viewed in 4K

Наука

Опубликовано:

 

2 авг 2024

Поделиться:

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист
Посмотреть позже
Комментарии : 26   
@jackflash6377
@jackflash6377 22 дня назад
Outstanding ! Glad I found this channel. Thank you sir.
@RoboTFAI
@RoboTFAI 21 день назад
Thanks for watching!
@kevinclark1466
@kevinclark1466 9 дней назад
Great video! Looking forward to trying this…
@RoboTFAI
@RoboTFAI 9 дней назад
Have fun!
@six1free
@six1free 25 дней назад
hands down one of the best youtube chanels out there - and i'm not just saying that for flashing my question :D I really do love how thoroughly you've taken to answering it. .. this being the pause point... I'm going to guess that cuda will do it all for you ("as if" - I'm sure :D) I am so envious of your test rig... as it is though I need a data center for power... as for adding the other cards, further research tensors and rewatch this video when applicable :D - downloaded and saved to my good tutorials (very long) playlist... enjoy the well deserved follow-through.
@RoboTFAI
@RoboTFAI 24 дня назад
Thanks for the idea!
@SphereNZ
@SphereNZ 20 дней назад
Great video, great info, really appreciate it, thanks.
@RoboTFAI
@RoboTFAI 19 дней назад
Appreciated!
@246rs246
@246rs246 25 дней назад
I'm blown away by this comprehensive answer to my question. Thumbs up and I'm looking forward to more interesting videos.
@RoboTFAI
@RoboTFAI 24 дня назад
Awesome, thank you!
@AkhilBehl
@AkhilBehl 25 дней назад
This is absolutely awesome stuff.
@RoboTFAI
@RoboTFAI 24 дня назад
Thanks!
@CoderJon
@CoderJon 11 дней назад
Love your videos. I appreciate that you leave the interpretation of the results to us, but I would love a video talking about your interpretations of the data. For example: Why your results for Prompt tokens per second were higher with the 90/10 split. I can assume its because there is some sort of parallel processing happening on the interpretation of the prompt, but I am still new to the AI world so would love the education.
@RoboTFAI
@RoboTFAI 10 дней назад
Much appreciated! I attempt to keep my mouth shut and let the data show the info. Definitely not an expert and just learning like everyone else. I never intended on creating an actual channel, the first video was to prove a conversation with friends out with hard data, the testing app is for other uses in my lab, etc. Just turning into a place where we can all share some data and learn from it, or at least burn some of my power bill together!
@mbike314
@mbike314 14 дней назад
Thank you for creating this valuable content. I am pleased to have discovered it. I am interested in some 4060's you mentioned. I sent an email. Please keep going with this channel! Wonderful stuff!
@RoboTFAI
@RoboTFAI 10 дней назад
Thanks a ton! Didn't see any email - reach out robot@robotf.ai or ping me on reddit/etc
@mbike314
@mbike314 4 дня назад
Thank you. I did send it to the wrong address. Just resent it to the correct address.
@andre-le-bone-aparte
@andre-le-bone-aparte 23 дня назад
Question: @03:14 - NVTOP is showing - 90+ Degrees (86 on the M40) Fahrenheit on each of those cards... WITHOUT any active usage? - That seems excessive. Currently running a 4x3090 setup at 79-degrees or lower, in-between queries.
@RoboTFAI
@RoboTFAI 23 дня назад
the 4060's are stacked with each other on the bench node in this test (I don't recommend that, they could use space between them since side facing fans, and why I use a lot of pcie extenders normally) and don't run their fans unless there is a load - the M40 in this test has an active fan on all the time. Also I live in a hot climate and it's been 85-100 degrees (75+ in the workshop as it's not conditioned)🔥
@andre-le-bone-aparte
@andre-le-bone-aparte 22 дня назад
@@RoboTFAI 👍- Just looking to learn ways to extend the life of these GPUs and increase performance for LLM usage when running 10 hours a day (work day, remote-work, as a code assistant)
@tsclly2377
@tsclly2377 24 дня назад
I think loading is still an important factor, so do you use NVMe drives, like the large, high write level Octane p900 series for the fast load? and FPGAs for pre-setting data (like video, pictures) reconstructed in a faster use mode?
@RoboTFAI
@RoboTFAI 23 дня назад
I normally leave the unloaded model test off as it doesn't allow as much resolution in the smaller charts. I use Gen 4 NVMe M.2 drives in each of these systems (rated up to 5000/4800 MB/s...yea right).
@Zeroduckies
@Zeroduckies 17 дней назад
Or you can get 1tb ram and have 500gb ramdisk ^^
@tsclly2377
@tsclly2377 16 дней назад
@@Zeroduckies Using HP ML 350p machines one only gets up to 768GB of dram that has to be LRdram, but that ram is running on three channels that actually slows it down from the 2 channel 256GB because of the required 'blocking in' and processing. It is all in the specification PDF from HP.. It is only when going to The G11 model that one actually get significantly faster (PCIe 5.0.. HP skipped the 4.0 architecture in these machines) ram and a larger capacity at a astronomically increase in price.. So when getting a 'loaded' 256 dram ML 350p G8 afor a trade of on older gamer machine with a at GTX 1660ti and a less than tenth geni7 (about a 300$ value) one must be looking for a fast economical memory solution and that is where the Optane P900 card come in (with their 4000GB/s bust) and one must also compare that at the rate that the GPU actually can take in, so this is a cheap way to run data in (and out) in a comparable manner as dram... plus you are only occupying a PCIe 4 lane. Now this is al gfine and dandy, but in dual cpu chip-sets, the PCIe lanes go all over the place and that is a major consideration as the right and left side are controlled by different CPUs and SLI or VRLinking can be required for OS recondition of the linked GPU cards that is inherently required for proper function logging.... and PCIe controllers on these machines. They are going to be slower than single CPU specifically designed mother boards that are made other companies such as the multi-PCIe 16x SuperMicro or Gigabyte professions models... that have come out specifically designed for this type of application that use NVME arrays for storage.. and then you are back to the amount of writes that are going to be applied to the storage.
@tbranch227
@tbranch227 13 дней назад
Can you run a larger model when you span cards? Or does your model need to be able to fit on each card that you tensor split across? What happens to performance then, if you can run larger models by aggregating card ram?
@RoboTFAI
@RoboTFAI 11 дней назад
You can absolutely span the larger model between cards! These tests are actually doing that, performance depends on cards you are splitting between - but will be between your lowest end card, and highest end cards (if different models). Running multiple cards doesn't necessarily increase performance, it's really for expanding your VRAM capacity.
Далее
Take a tour of Google's Quantum AI Lab
0:39
Просмотров 46 тыс.
🔴Ютуб закрывают... Пока?
00:39
Просмотров 1,4 млн
The RTX 4090 Is Pathetic
0:33
Просмотров 6 млн
motor encoder testing
0:16
Просмотров 13
CrowdStrike IT Outage Explained by a Windows Developer
13:40
Water powered timers hidden in public restrooms
13:12
Просмотров 666 тыс.
Battery  low 🔋 🪫
0:10
Просмотров 13 млн
iPhone socket cleaning #Fixit
0:30
Просмотров 18 млн