Bitnet.CPP - Run 100B Models on CPU - Easy Install on Windows, Linux, Mac

Fahd Mirza

Подписаться 20 тыс.

Просмотров 9 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Опубликовано:

20 окт 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 63

@chouawarasteven 2 дня назад

Bro.... when do you sleep 😂 Kudos for all the hard work 💪👏👌⚡

@fahdmirza 2 дня назад

Appreciate the support 👍

@gusseppebravo8334 2 дня назад

If understood correctly from bitnet, the model is still 8b size (fixed parameters) and finetuned with 100b tokens (doesn't change the size)

@fahdmirza День назад

Good point.

@fontenbleau 2 дня назад

it's very interesting competition development, i've never heard of this. As i remember llama cpp, on which all open models works, originally was made by Stanford Uni after Meta "leaked" their models. For now these cpp's cores are very behind the time, they can't load multi-modal models yet or add new functions to older models also.

@fahdmirza 2 дня назад

Very insightful.

@spotnuru83 День назад

Great! thanks for sharing the knowledge, do they even expose any APIs like how Ollama does? , and by the way how do you make videos so frequently man.. You deserve more views and subscribers too. I subscribed already though :)

@w.o.jackson8432 День назад

Dude thank you so much, this was very helpful

@fahdmirza День назад

I am glad it was helpful.

@publicsectordirect982 2 дня назад

What is the trade off for accuracy for 1 bit quantization?

@barry_wastaken День назад

Well it's not clear as that 1bit quantization have a lot of uncovered potential according to tests and other studies and can probably preform as good as normal floating point quantization, but if i had to guess, it will most likely be precision and the overall ability to represent or produce more precise answers etc... But in my personal opinion, i think these kind of models will be implemented into mobiles for lesser complicated tasks.

@fahdmirza День назад

Thats right.

@publicsectordirect982 День назад

@@barry_wastaken thanks very much for explaining that.

@SreeKrishnaDeva 2 дня назад

Thanks all your videos are very helpful

@fahdmirza 2 дня назад

Appreciate that 👍

@NLPprompter 2 дня назад

when I'm thinking hard we really doesn't need blazing fast token generation, what we need is decent token generation speed measured by human maximum speed compression of hearing AI Inference speech, well i mean if we can understand RU-vid video at 2x speed maximum then we don't need more than that speed for token generation speed, for human AI Inference will it might different story for agentic workflow. If these type Inference can produce quality result it might change the agentic speed...

@fahdmirza 2 дня назад

very insightful

@NLPprompter 2 дня назад

@@fahdmirza glad you like it

@imranmohsin9545 2 дня назад

@@NLPprompter idea is to have inference speed so fast it can surpass compiling speed . Helps in automation agentive work flows Intresting areas maybe it could replace certain compilers , imagin having a game completely running on a inference model instead of a engine These just some usecase Also on other side we want quality responses so just inference speed is not enough

@NLPprompter 2 дня назад

@@imranmohsin9545 yes i agree with that also, I can imagine in future our laptop might not have UI anymore, we can invoke UI and functionality realtime by speaking to AI.

@imranmohsin9545 2 дня назад

@@NLPprompter yup that sounds awesome too

@CharlesTu1121 День назад

Hi, can this model review photo and tell us what's on photo? Thanks for your information.

@ByteBop911 2 дня назад

its yet another quantization technique but the catch here is how precise these models are . can a llama 3.1 be compared to even llama 2 in terms of response as the precision is highly reduced.

@fahdmirza День назад

Thanks for the comment.

@i34g5jj5ssx 2 дня назад

Really interesting. Do we have any recommendations for hardware (will server CPU and ton of RAM help here? If yes, than do we need cores or frequency for CPU and will be enough of old server DDR4?) and how it's performance compares to the GPU setup?

@fahdmirza День назад

I have shared my config in video, which should be suffice.

@sephirothcloud3953 День назад

Great job. Do you know if Stable Diffusion can be quantized @ 1.5b?

@fahdmirza 21 час назад

Not yet

@sephirothcloud3953 21 час назад

@@fahdmirza Thank you

@anglikai9517 2 дня назад

llama.cpp also supported bitnet model, but I couldn't make it perform as good as phi3, or llama 3.2 Anyone manage to do that ?

@fahdmirza 2 дня назад

Do you get any errors?

@ultimaterulez 2 дня назад

What is the "performance" criteria?

@dafnik8925 День назад

Thank you! Can I run these small bit sized models using my 8GB GPU?

@fahdmirza День назад

Yes, absolutely

@unveil7762 2 дня назад

What happens if tensor is a texture?? I was thinking to use this for tracking and depth estimations… so depth is a 32bit texture… performance looks amazing…. If can run an llm , inference depth and yolo is like drinking water. And i can keep gpu for more attractive interactions!!!

@fahdmirza День назад

I don't think that's supported at the moment.

@freeideas 2 дня назад

I didn't catch how much RAM something like this 100b model would need. I'm guessing quite a lot.

@fahdmirza 2 дня назад

16 to 32gb should be fine.

@mohammedrahmansherif849 2 дня назад

Sir thank you for your wonderful video! can we inference images with this bitnet models?

@fahdmirza 2 дня назад

When the models are available, I think for now text only models are available.

@dsfsgsgxx 2 дня назад

I have run llama 3.2 3b instruct in python locally on vscode. The code generation is very very slow compared to running it on ollama on terminal. What is the reason for that?

@fahdmirza 2 дня назад

with bitnet?

@melodymasterzz День назад

can we process pdf documents?

@fahdmirza 21 час назад

For RAG, please search the channel

@marekkroplewski6760 2 дня назад

Bomba 😊

@fahdmirza День назад

cheers

@emmanuelkolawole6720 2 дня назад

Is qwen supported??

@fahdmirza День назад

Not yet

@aa-xn5hc 2 дня назад

How much ram?

@fahdmirza 2 дня назад

16GB should be enough, though I would suggest 32GB RAM

@timothywcrane 2 дня назад

On it.

@fahdmirza 2 дня назад

Sure, keep us posted please.

@Valk-Kilmer 6 часов назад

Do it on Windows everyone uses Windows

@andrebadini3573 2 дня назад

Isn't 1-bit quantization equivalent to removing almost the entire brain?

@handsanitizer2457 2 дня назад

No, they're less accurate, but depending on the subject, they can get around 60 to 80% accuracy. With new techniques, it will gradually get better.

@fahdmirza День назад

That's right.

@nisamlc4685 2 дня назад

thanks for the model intro, while the gguf conversion i am getting an error INFO:root:Converting HF model to GGUF format... ERROR:root:Error occurred while running command: Command '['C:\Users\user\anaconda3\envs\bitnet-cpp\python.exe', 'utils/convert-hf-to-gguf-bitnet.py', 'models/Llama3-8B-1.58-100B-tokens', '--outtype', 'f32']' returned non-zero exit status 3221225477., check details in logs\convert_to_f32_gguf.log

@fahdmirza 2 дня назад

plz make sure to have Visual Studio 2022 and cmake installed.

@nisamlc4685 2 дня назад

@@fahdmirza all of them are set and versions are satisfied but error remains