Thanks for the coverage, I'd be interested in a tool use / RAG and other utilities comparison with Llama 3.1 8B quantized aggressively to bridge the gap in RAM and performance!
the MoE wasn't wrong, the correct answer for that calculation was exactly 9.9996, rounding _is_ the next step. So I'd say it did better at that specific question..
Unfortunately every Phi model I tested so far had a model collapse after 3 to 5 queries. I have this only with Microsoft models OR models I truncated on my own. I do not understand the hype and do not trust the benchmarks. Just to make clear: I have about 15 different official models running locally that were not tampered with and NONE except the Microsoft models have this issue.
Does anyone know of a source for community/conversation on LLMs and business? I'm a technologist developing an app and would really like to find a good source for discussing ideas and what's working/not working.
It's funny. Every time a new Phi model comes out I get so insanely bearish for LLMs because they always suck. Just gaming the benchmark but are horrendous to use.
How much longer are we going to pretend that these are in any way practical? No on prem running for anyone except large corp and many of the privacy issues open source was supposed to address arise come back once you start using someone else's hardware. Guess Its great to see smaller models improve and push foundation models, but if you want to do stuff with any off these, especially with agentic processes gobbling thousands of tokens, latency and performance demand hosted service.... might as well go free flash, mini with no setup or hosting issues.
Well, you actually can run a crew of Phi models on a MacBook Pro. The M3 Pro with 36 GB of system memory, can allocate around 27 GB of that pool solely to GPUs for inference.
@@pwinowski Its not about can/cant. What is the tokens/sec doing that locally? Now consider hitting the gemini-flash API with 128k tokens 15 times a minute for free.