It was fascinating to hear a behind-the-scenes perspective on the development of the transformer architecture. Aidan is so humble and unassuming. Very impressive.
If get nervous when I hear (starting at 48:08) discussions around "outsourcing" mundane intellectual activity. There is no doubt that I was a better mathematician / math enthusiast when I had limited access to computing devices and unfettered access to the math stacks at a university library. I don't believe there is a shortcut to internalizing and understanding patterns from rote processes, without actually carrying out the rote processes. I would have never gotten through Algebra without 1000s of hours of elementary computation in grammar and high school that helped my mind to intuit patterns, and prime me for theorems that formalized these patterns. And that served as a foundation to iterate between increasingly complex rote processes and increasingly abstract patterns, formalized by increasingly bonkers theories. All this to say, if we give up the mundane tasks, we leave the greatest part of learning to the imperfect neural net/model. IMHO.
I think the best learning of new knowledge (knowledge that is new to you) happens with teachers though in a conversational format. Becoming good at that skill is of course through practice but there's a reason people still pick uni over watching videos. Being able to query back and forth on a topic is immensely useful in not only learning but understanding and applying.
@@edz8659 great observation. VideosUni. The best of my learning happened on a dusty grad student chalkboard, scribbling thoughts with fellow grads after a mind-imploding lecture. Ultimately, key ingredients: great mentor/teacher, great peers, and dedicated time on task.
"Relief" is the first thing that came to mind when I saw this video pop up. I was already afraid that MLST had gone on hiatus. It's still one of the most underrated YT channels out there given the production and content quality IMO! If nothing else, future generations will be thankful.
wonder what the transformer can do with information jwst collects thinking of this model just blows my mind. Or what about the language model which represents all the particle data something like CERN has collected over last 20 years down to the cookie crumbs ... emerging is correct
HF is a different category entirely i.e. deploying your own, much smaller models (in most cases supervising the deployment, you need to know what you are doing). Because the models are smaller and less generalisable, they are fine-tuned for doing specific things rather than the LLM/Cohere idea which is that you can use a single planetary-scale LLM to do anything. OpenAI is the only other player in this game, at this scale which I am aware of. It's certainly true that most things you need to do as an app developer are narrowly defined (i.e. NER), so simple pointillistic models would work... but then you have the ML DevOps / engineering problem, which blows up fast with many models in an app platform. Imagine if a lot of that complexity melted away because it was all just a single LLM?
@@MachineLearningStreetTalk open source models are free and just like stable diffusion there will be plenty of large open source language models soon (see BLOOM). For 99% percent of business use cases (not blogspam generation) a smaller fine tuned model will perform much better than a huge general model that was trained on random web data. Cohere will have to learn the hard way that companies that have people who know how to use outputs from machine learning models usually also have people capable of training their own models on domain specific data and that due to the cost of hosting these large models it will make more sense for most of their large customers to develop things in house. Deep learning research being so pro open source also means that in 6-12 months new methods will be available to everyone that are as good as whatever they spent building internally. Also like you mentioned in the interview, the latency of sending all of your data over http to an external service is suboptimal and you'll get much better throughput by doing batch inference in your own VPC or save money by moving the model to customer devices (obviously not trillion parameter language models).
Thanks for the thoughtful response! Will be interesting to see how this pans out. I agree that Stable Diffusion was a watershed moment i.e. the community suddenly got a very capable generative vision model to play with (although there has recently been a lot of drama, as you will know from watching Yannic's videos!). Currently public LLMs are nowhere near as good and you need to know what you are doing i.e. do the engineering yourself, be a good software engineer and create a pointillistic solution. You may well be right that a similar watershed moment might arrive soon for LLMs, but it's definitely not there yet. The interesting story here for me is that large scale and generalisable LLMs might remove some of the complexity of software engineering, as it becomes a matter of prompt engineering and language becomes the interface.
@@MachineLearningStreetTalk a big hurdle is getting business people and web/mobile devs to be comfortable dealing with functions that can output the wrong/stupid answer a decent % of the time. We went through the same thing with CV at clarifai, CNNs were hard to train and deploy in 2014 but that didn't last long and every summer there were new imagenet results coming from teams from all over the world, something that no single team can stay ahead of. Back then it was also way easier because machine learning was a much smaller field and things like tensorflow and pytorch didn't exist.