Wow. I am impressed to find one useful AI related channel. I mean you show things running with your code, you state real problems you find and you discuss your own results. Please continue with that 🙏 and thank you very much!
If they had an option to load multiple models at the same time (if there's enough ram/vram), it would be cool. The current workaround is to dockerize an ollama instance and run multiple of them on the same gpu.
Thank you for another very informative video. It would indeed be cool to hear more about using Ollama and local LLMs with AutoGen and for a fully local RAG system.
Greater canter through the recent updates, have to say I am a fan of ollama and have switched to using it almost exclusively in projects now. Not least as it's easier for others in my team to pick up. Really short learning curve to get up and running with local LLMs.
Pls. create a video about hosting an LLM server with Ollama on Google Colab (free T4) available via API. That might be a cost efficient way of hosting "local" models.
Essentially this is based on llama.cpp embedded in Go but stranglely cannot handle concurrency. Love ollama and use it a lot but to run it in a production setting you have to basically spin multiple ollama server each of which can take a queue. In other words a load balancer setup with niginx or something.
I just noticed some of these updates yesterday and it let me simplify some bits of my stack and remove litellm. It's actually kind of scary how quickly all of this stuff is becoming commodity parts.
@@samwitteveenaiI'll throw in a suggestion - using DSPy for an LLM agent with tool usage! Imo DSPy seems really powerful for bootstrapping examples for optimal answers. Let's say we have an LLM agent that has the purpose of performing five or six different main purposes with one or two functions for each purpose. Can use DSPY to optimize the pipeline for each of those purposes, it would be amazing.
Ollama is Awesome however there are some minor issues with oLlama: 1. Single threaded, so can not run on server serving single url to team. It’s big issue, I don’t want everyone in my team install ollama in their machine. 2. With Stream response its not easy to create client app as the response is not same is OpenAI 3. CORS issue, so need a wrapping around the APIs, which means you need to install ollama and install api wrapper on every machine
heh, run ollama run llama-pro:text "what are you" .. about 10 times and confirm that i'm not going crazy, it's the model . . . that thing is outputting it's fine-tuning data verbatim .. AFAIK
Maybe if you actually take the time to check for yourself, you’ll notice that there is a web interface available, just need to point to your ollama instance, exactly the same as chatGPT, actually it is even better 🙃
Does Ollama support the same grammar specification that restricts your output, the way llama.cpp does? That’s a great feature which I’ve used in a project recently to force JSON output.