Llama3 speed test on Dual Nvidia 3090

Подписаться 1,9 тыс.

50% 1

Llama3 speed test on Linux PC with Two Nvidia RTX 3090 with 24GB - 48GB total.
Presented by Lev Selector - May 13, 2024
Slides - github.com/lselector/seminar/...
--------- My websites:
- Enterprise AI Solutions - EAIS.ai
- Linkedin - / levselector
- GitHub - github.com/lselector
--------- Contents of today's video:
We have tested llama3 (70b and 8b) on a desktop with two Nvidia RTX 3090 24GB video cards.
- CPU: AMD Ryzen™ 9 3900X (12 cores, 24 threads)
- RAM: 32GB
- WSL2 (Windows Subsystem for Linux).
- We used ollama to run llama3 8b and 70b
- Actual models were: llama3:latest, llama3:70b, lama3:70b-instruct-q4_K_M
- We also compared with performance on Apple Macbook Pro Max M3 128GB
The prompt was:
Please make a numbered chronological list of the last ten (10) US presidents in reverse order. The list should start like this: 1. Joe Biden (2021-present); 2. Donald Trump (2017-2021); 3. Barack Obama (2009-2017); the list should contain 10 rows. Important - make a fresh list. Disregard the chat history. Ouput only the list itself, nothing else. Output each lsit element on a separate line.
Results below show total duration of the response (in seconds) and output speed (in tokens/s).
We compare the Linux (Windows WSL) with latest MacBook Pro Max M3 128GB
70b Models:
Linux. 9.7 s, 14 t/s
Mac 16.5 s, 8.6 t/s
8b Models:
both were 2.6 sec, 56 t/s

Опубликовано:

12 май 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 2

@lev-selector 25 дней назад

When Running 70b model, you can see that the model is loaded completely into two GPUs (using most of their combined memory). If we repeat the same test with only one GPU - the generation speed drops down 10+ times to approx. 1 token/second. It is amazing how a thin quiet Mac laptop can match performance of a powerful and noisy desktop with 1.5 KW power supply :)