it's very interesting competition development, i've never heard of this. As i remember llama cpp, on which all open models works, originally was made by Stanford Uni after Meta "leaked" their models. For now these cpp's cores are very behind the time, they can't load multi-modal models yet or add new functions to older models also.
Great! thanks for sharing the knowledge, do they even expose any APIs like how Ollama does? , and by the way how do you make videos so frequently man.. You deserve more views and subscribers too. I subscribed already though :)
Well it's not clear as that 1bit quantization have a lot of uncovered potential according to tests and other studies and can probably preform as good as normal floating point quantization, but if i had to guess, it will most likely be precision and the overall ability to represent or produce more precise answers etc... But in my personal opinion, i think these kind of models will be implemented into mobiles for lesser complicated tasks.
when I'm thinking hard we really doesn't need blazing fast token generation, what we need is decent token generation speed measured by human maximum speed compression of hearing AI Inference speech, well i mean if we can understand RU-vid video at 2x speed maximum then we don't need more than that speed for token generation speed, for human AI Inference will it might different story for agentic workflow. If these type Inference can produce quality result it might change the agentic speed...
@@NLPprompter idea is to have inference speed so fast it can surpass compiling speed . Helps in automation agentive work flows Intresting areas maybe it could replace certain compilers , imagin having a game completely running on a inference model instead of a engine These just some usecase Also on other side we want quality responses so just inference speed is not enough
@@imranmohsin9545 yes i agree with that also, I can imagine in future our laptop might not have UI anymore, we can invoke UI and functionality realtime by speaking to AI.
its yet another quantization technique but the catch here is how precise these models are . can a llama 3.1 be compared to even llama 2 in terms of response as the precision is highly reduced.
Really interesting. Do we have any recommendations for hardware (will server CPU and ton of RAM help here? If yes, than do we need cores or frequency for CPU and will be enough of old server DDR4?) and how it's performance compares to the GPU setup?
What happens if tensor is a texture?? I was thinking to use this for tracking and depth estimations… so depth is a 32bit texture… performance looks amazing…. If can run an llm , inference depth and yolo is like drinking water. And i can keep gpu for more attractive interactions!!!
I have run llama 3.2 3b instruct in python locally on vscode. The code generation is very very slow compared to running it on ollama on terminal. What is the reason for that?
thanks for the model intro, while the gguf conversion i am getting an error INFO:root:Converting HF model to GGUF format... ERROR:root:Error occurred while running command: Command '['C:\Users\user\anaconda3\envs\bitnet-cpp\python.exe', 'utils/convert-hf-to-gguf-bitnet.py', 'models/Llama3-8B-1.58-100B-tokens', '--outtype', 'f32']' returned non-zero exit status 3221225477., check details in logs\convert_to_f32_gguf.log