It would be best to have alternatives to all these which are free and open source. Maybe later down the line.. The video is really cool tho! Thanks Matthew
Matthew - for those of us who develop line-of-business apps for SME businesses - local LLM deployment is a must. Would certainly like to see you demo RouteLLM with orchestration - Thanks!
I am a cyber security analyst who knows very little about coding so, between your videos and just straight asking ChatGPT or Claude, I am ham-fisting my way through getting AI to run locally. Please keep making tutorial videos - I am excited to see how to impliment RouteLLM!
I agree, I'm pretty much in the same boat as you. The problem is that my knowledge is outdated by the time I finally figure things out because there is so much advancement in so little time. I think we need a "checkpoint" how-to on how to do things now, as opposed to 3 months ago.
If you don't know much about anything, like me, but want to run LLMs locally, you just need to install LM Studio. No need to understand anything. On the software, it has even the option to download and install them, and run. That's what I use. Now that I learned a bit more, I will try to install Open WebUI, Ollama and Docker, these are way more complicated. 🎉❤
Yes! Please show us a comprehensive breakdown of this great tool! I’m also interested in your sponsor’s product, LangTrace. Can you possibly show us how to use it?
Great breakdown, much appreciated. I definitely foresee local LLMs becoming dominant for organisations as soon as next year. My advice during consults is for them not to invest a massive amount in high-end data secure cloud systems, but just to hang on a little, work with dummy data on current models to build up foundational knowledge, and then once local options exist they can start diving into more sensitive analytics.
Hey Matt, yes it would be great if you could show a demo of how to setup this model on Azure OpenAI or Azure Databrix and then use it in the application.
I feel like everyone is realising things at the same time. I started 2 projects, the first an LLM co-ordination system and a chain of thought processing on specific models
Just popping up to say thanks Matthew. You have become almost my only required source for AI news because your take is right up my street every time. Great work, keep it coming
It's time people admit that benchmarking off GPT4 is stupid. When GPT4 came out it was amazing. Now its no better than any other LLM. Ever since OpenAI introduced cheaper Turbo models, the quality has gone down hill. They sacrificed intelligence for speed to the point where they have plateaued in quality and its not getting better no matter how new models they release.
I agree man I don't care about speed as much as I care about accuracy. I'll happily wait for a better response than rapidly go through 2-3 quick responses that need more time in the oven.
Thanks for this video. Very informative. Please make a full tutorial about the setup of route llm and what the recommendations of the local pc should be. Thank you in advance!
There seems to be a hold up on the highest end models as the leading companies continually try to improve safety while watching their competition. Nobody seems to want to jump in and release a new/better model at risk of the potential "dangerous" label being applied to them. So a lot of the progress remains hidden in the lab, waiting for competition to finally engage.
I think you misrepresented the graph. The "ideal router" point on the graph is likely just that - the ideal. I don't think that's claiming actual results
Yes please... Full tutorial on setting this up to run locally. Also, I'd like to know how to setup multi-modal so I can show my images and casually talk to it (local).
I really don't find this to be a big deal. I expect people select the model to use themselves on a per-task basis on what they believe is the most appropriate one for the task. For me the decision process is really simple: 1. is it code or requires complex problem-solving? -> Claude 3.5 Sonnet 2. Do I want to have a deep conversation with a creative partner -> Claude 3 Opus 3. Is it anything the other models would refuse? -> GPT-4o 4. Is it too private for any of the above? -> Local LLM I don't need a router for this and I wouldn't trust it to reliably choose the same way I would either.
First thing I did a year and a half ago was routing different LLMs via a zero-shot classifier. Looks like Route has done the same thing lol. I figured it was common sense.
I'd be interested in what hardware is needed to run something like this locally. I was waiting until late fall or early next year to buy, but I might need to get an intern system to train up. I am big on local control except when needed to reach out.
Hi Mathew, I love your channel. I’m curious if you would be willing to explore Pi ai? It doesn’t compare to the others in the same way. Maybe it’s hard to test. But very interesting. It’s trained to be empathetic and you can actually have a conversation with voice that feels satisfying.
After developing exclusively on GPT models, then joining an org with a ridiculous amount of free GCP credits and being pushed to use Gemini family instead- I can honestly say that while differences on benchmarks may seem small, they end up being really extreme in practice. I spent days smashing my head against a wall trying to get Gemini to provide quality responses, and after switching to 4o, I was literally ready to deploy. There still don't seem to be great benchmarks that represent performance of generative models well.
Problem with this method is that there are some trade offs. While it maybe cheaper at answering a question directly; you sacrifice its social intelligence. Even though you get the right answer, the way the answer is phrased can be the difference between either a toddler or a graduate student. Personally I wauld want to talk with the graduate student.
This sounds interesting, but does it offer the same capability that the OpenAI API offers with customizable assistants, RAG, and function calling? I still have yet to find anything that compares. Would love to see something open source that can do this.
I suspect these features, plus dedicated non-GPU hardware will eventually reduce energy costs per "thought" to less than the human brain. Currently perplexity using Sonnet 3.5 thinks GPT4 uses 25x more.
I think they are saying the brown dot is where an ideal LLM would be placed, I'm not sure that Route LLM is better than Claude 3 Opus. SO not sure where on that chart their router actually is. probably down with Llama 3 8b. Cause it's only job its to route.
Promising, but a bit mess with naming. They are using GPT-4 to mean at least GPT-4 Turbo and GPT-4 Omni in various places. I am not even sure if on some place they don't really mean the older model GPT-4.
Could you do a tutorial on RAG (retrieval augmented generation) ? I think I'll be pretty massive thing in agentic archetecure. Also I think RAG might soon be more than just text and PDFs 😂 in the not too distant future.
It doesn’t change anything. LLMs are good at certain tasks (most of which aren’t as useful as we need, and most don’t help us earn money). AI has plateaued. They haven’t replaced software engineers.
This wasn’t just released. It had been around for a while. Now that GPT-4o and Claude 3.5 Sonnet exist things are much cheaper. I can understand using a local LLM with these two but overall the cost savings are not as big of a deal as before.
While this looks promising, it is just a router that forwards simple queries to weak models while forwarding hard queries to strong models. This assumes that the queries can be divided between strong and weak models. If your work is truly intensive, I don't see much reduction here as it still requires querying strong models most of the time.
Dear Matt, you could create a presentation for showing all this rather than just reading from the website. Please give a bit more time when creating videos, I watch your videos for learning things faster not the opposite. Thank you
90% quality and 80% cheaper? I'm actually not sure if I should be impressed or not. Sure, on the surface that seems like a small decrease in quality for massively reduced cost, but isn't it normal for that last ~10% quality to be a lot harder to achieve? I think I'd be more impressed to see a model that's just 5% better quality for an 80% increase in cost.
Batch API - 50% cheaper!! can you, have you, talked about batch processing? I see OpenAI is 50% cheaper for batch processing on all models, input and output!!! These are asynchronous groups of requests which don't require immediate turnaround, clear within 24 hours and include higher rate limits. Links available on the OpenAI api pricing page.