Lighthearted bite-sized ML videos for your AI Coffee Break! 📺 Mostly videos about the latest technical advancements in AI, such as large language models (LLMs), text-to-image models and everything cool in natural language processing, computer vision, etc.!
We try to post twice a month, if not more often! 🤞 But, you know, a PhD thesis has still to be worked on.
Disclaimer: Opinions expressed are solely my own and do not express the views or opinions of my employer.
I understand compleatly nothing. I was focused on this guy aspergery passion. How can I work with such cyborgs :D sorry. I am a hater here but I started questioning my role on the job market after watching it.
It's done by adding an image to the generated image during inference. This extra added image is computed via the gradient with respect to clip's output. It's a bit like deep dream, if you are old enough to know about it.
Hello! I'm a researcher in the NLP field, and despite reading the various papers on CoT this whole "keep some secrets for the competitive advantage" really bothers me and makes the understanding of the advancements really complex. This video was able to remove some of the forced opaqueness behind o1 last developments, so, thank you very much!
It is a shame that the video does not mention our paper that first showed that LLMs can easily learn impossible languages (see reference below - Chomsky does cite our paper in his NY Times article). And as noted by @TheRyulord, the models in the Kallini et al. did learn the impossible languages almost as easily (other than the random one that is not an impossible language, it is not a language at all, just random bag of words). Add this to the fact that LLMs need to be trained on orders of magnitude more data compared to humans, and the findings are perfectly consistent with Chomsky. That is, the claim that the brain has priors that make it easier for humans to learn language with minimal data, at the cost that they can only learn some types of language. Mitchell, J., & Bowers, J. (2020, December). Priorless recurrent networks learn curiously. In Proceedings of the 28th international conference on computational linguistics (pp. 5147-5158). aclanthology.org/2020.coling-main.451/
Thanks for the review! The most scary thing personally for me, is that with improving capabilities of the models, the hallucinations are becoming more and more convincing, and this is what makes me worried.
I make all visuals (including the drawings in PowerPoint. 😅). I use Adobe Premiere Pro for editing (this is also the stage Ms. Coffee Bean comes into the picture).
@@AICoffeeBreak this is so impressive. I couldn't even imagine that you can do that much visual using PowerPoint. You must be a guru level in power point. Maybe this is the topic of another video to make :)
In a standard MLP network, we always multiply a weight W and data item X. We never multiply two data items X1 and X2. That's also the case for a convolutional network, right ? However, in an attention layer, when we compute the dot product of Key and Query vectors, we multiply data elements with each other. That (partially) justifies calling attention layers "dynamic".
"The most important thing about transformers" 3:10 Exactly, and *nobody* talks about this. The number of parameters is independent of sequence length. BTW, transformers are quadratic in time but linear in space -- see the FlashAttention paper.
Time is quadratic, but memory is linear -- see the FlashAttention paper. But the number of parameters is constant -- that's the magic ! Thanks for the excellent videos ! 👍
Hi Letitia. First I want to thank you for such a great video, I rally love it :) Second, could you please explain for dummy the logick behind numbers that you have presented. For instance accroding to the example in video word Queen has ben vectorized in to the following sequence 0.33 | 0.71 | 0.91 | 0.23 | 0.15 Then when you showed the sin/cos functions applied you get with the follwoing sequence 0.2 | 0.7 | 0.1 | 0.99 | 0.01 as a positional vector for word Queen. And this is where I get compleetly lost, I don't understand the math behind it, how did you come up with these numbers for positional vector in accrodance with the Quen vector ? And second, in your example you started from the end of the word, is there are any reasons for exact this sequnce (start from the end) or it is made up example? I appologize for dummy question, but I really would like to jump into the nitty gritty of the real calculus. I do really appreciate you if you could provide more details. Best regards, Maks.
I am sorry, but this prof. was smart enough to take you as a PhD and make the student life at the institute better, beyond the academical value that you added to their team! <3 Congrats again!
Hi! I have a question. Can I use multimodal transformers for my inputs which are only text inputs?? Because I didn't find any methods to take my multiple text inputs separately into the pipeline other than multimodal transformers. The reason is I have some primary and secondary text input columns in my dataset.
Multimodal transformer encoders such as ViLBERT are now quite an old paradigm and they are meant to take both image and text as input. However, the text model called BERT can accept two inputs with a separator token between two sentences. Modern LLMs (transformer decoders) can also accept multiple text inputs if you make it clear in the prompt how exactly they differ. This is all I can say from the insight I got into your problem from your description.
@@DerPylz based on the bibliographic research I have done, I don't need to go through the video. I echo that these algorithms "don't think," and to be more precise, researchers in that area should start using different terminology because that's what creates the hype and misinformation.
I couldn't find a satisfying explanation anywhere. This video finally made me understand things in a bit more detail especially the use of sine and cosine functions across multiple dimensions. Thank you! You're awesome.
I already did this today using Claude 3.5 Sonnet and very good prompting. Claude can reason and think thru common sense questions, zero shot, if you know HOW to prompt engineer. I'm not seeing any huge leaps to justify paying OpenAI extra money.
Great video as always, Letitia! I really liked your phrasing around describing how these o1 models “think” in terms that don’t needlessly anthropomorphize the models. I think some of the hype is justified, but it’s nice to also have that important grounding around how this is more of a preview than anything else while also having some fundamental limitations.
I got excited about o1 yesterday because it happened to solve the particular problem I threw at it really well. Silly me for falling for the sample-of-one trap again. Since LLMs are still at the core, "Jagged Intelligence" still applies. Thanks for reminding me to stay skeptical :)
@@therainman7777 Do you have a source for that claim? My understanding is that the jagged boundary just changes, it's not necessary getting flatter. In other words, larger, newer models sometimes fail at tasks that smaller, older models did better at, and it's not predictable, so you really have to test each new model for the specific task.
@@BrianPeiris Yes, that is true in certain cases-but those cases are the minority, not the majority. The simplest source to look at for this claim is the fact that on every benchmark we have and care about, the newer generation of models performs better than the previous generations. Sometimes slightly better, and sometimes much better. But the fact that the performance on all these benchmarks is improving over time indicates that the “jagged edge” is not merely changing shape, but becoming less and less prevalent. If it were not, then newer models would not be able to outperform the previous generation of models to a statistically significant degree, across thousands of questions on dozens of benchmarks. In addition to benchmarks, there are also less standardized evaluation measures, such as the Elo scores on LMSys Chatbot Arena. In virtually all cases, performance is increasing over time.
@@therainman7777 I suppose in the end what I care about is whether these LLMs, or optimized chain-of-thought systems, can actually reason in a generalized way. So, despite the improvements in the standard benchmarks, if the most advanced systems still have a jagged edge or fundamental lack of true reasoning that causes them to fail on simple tasks, I wouldn't classify them as improving on what actually matters. If you watch Chollet's recent AGI-24 talk on the ARCprize channel, he makes a distinction between skill and intelligence. Most benchmarks measure skill. You can train for skill without actually improving on general intelligence. So ultimately this method of AI is fundamentally incapable of surpassing their training. There is a hard limit on LLM-derived systems. They can be useful for some applications, but if we care about general intelligence, we have to look elsewhere for solutions.
@@therainman7777 I suppose in the end what I care about is whether these LLMs, or optimized chain-of-thought systems, can actually reason in a generalized way. So, despite the improvements in the usual benchmarks, if the most advanced systems still have a jagged edge or fundamental lack of true reasoning that causes them to fail on simple tasks, I wouldn't classify them as improving on what actually matters. If you watch Chollet's recent AGI-24 talk on the ARCprize channel, he makes a distinction between skill and intelligence. Most benchmarks measure skill. You can train for skill without actually improving on general intelligence. So ultimately this method of AI is fundamentally incapable of surpassing its training. There is a hard limit on LLM-derived systems. They can be useful for some applications, but if we care about general intelligence, we have to look elsewhere for solutions.
Such a great format of video! Very short and I love it more that 2min papers because of author’s explanation. Now to the questions: Great insight of words and tokens, does this mean that we need bigger models, that will learn on their own how to transform words into tokens. From my understanding words are just less computational easy tokens. So if we do go to 400B or larger models, following the idea that “size is all you need” maybe it will work. I wonder what is the performance if largest model published would be
Thanks for the kind words! Now, to your question: Yes, we already kind of have such models that are tokenizer-free and work directly on characters or byte representation, but this makes the input sequence length frow much bigger. So, models of the like of MAMBA or long-context transformers could eventually, if big enough, surpass existing tokenizer-based models. But we have not seen that happen yet at GPT4 scale. I do not know if they even tried that path.
I agree. But I am sure that some sentences like this were there. Every LLM today trains of Wikipedia and they must have seen the entries of Douglas Hofstadter and his books.
@@AICoffeeBreak Adding I Am a Strange Loop to my reading list. I agree that it is in wikipedia, but how do you explain "how many R in strawberry" as if not with tokenization bug? That we need different kind of models (or tokenizers) to solve this
It's really really difficult to design any system with even close the degrees of freedom even the simplest transformer has without it being 'turning complete'. My guess is transformers have the special property of being complete in the most frustrating and useless way possible
Training on data exactly like this might make models perform better on these examples. But I doubt these models didn't see any self-referential statements somewhere in their training data.
This was very well explained. Thank you so much for this. I couldn't help but wonder since these models are trained by data, how much data do you need for a model to be reasonably accurate? A quick google search told me that GPT4.0 is trained by approximately 500,000,000,000 pages of text, which is absolutely insane to me! I want to know if there are models that we can develop that train based on less data but still provide accurate results and what do these models look like?
Thanks a lot, especially since this is a very old video. We have made a new transformer explainer: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-ec9IQMiJBhs.html About your question: Unfortunately in deep learning, great performance comes with big data, because the models work only well in the domains and kind of data they have seen so far (in distribution). And the motto is: nothing will be out of distribution if we include the entire world in the training data, no? (this is a tongue-in-cheek comment, just flagging it). 😅 So, if you are willing to sacrifice a lot of performance, then there are models that can work with less data, going back to older NLP, based on word embeddings and tf-idf representations. But I cannot say more until I know your specific use case. But if you want a chatbot that can talk about almost anything, then you need trillions of tokens of text, at least this is what we learned from ChatGPT et al.
Oh wow I didn't even realize I was on the older video I will definitely check out the new one, and thanks for your answer! The motivation for my question was since we typically don't have a lot of data on endangered languages, could there be language models that can produce helpful results in these languages despite the lack of data on them. I guess the broader question would be what kinds of language models could we apply to endangered languages for things such as documenting them or aiding in that kind of research?
@@ChengyiLi-t1k I'm not an expert in multilingual AI, but I have heard from experts there. Your question reminds me of two points. * In multilingual AI, people still try to scrape all the monolingual data they have, automatically produce back translations and then train a multilingual model that hopefully can transfer its knowledge from high resource languages, to the low resource one. But you need some decent amount of data from every language you aim to learn. We've made a video on this approach, find the link in the description. ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-1gHUiNLYa20.html * If you have a very powerful model, of the class of GPT-4, Gemini, et al, then you could hope that the representations that exist are strong enough to elicit them with few-shot prompting. So, if you have the context length of Gemini, of multiple million input tokens, then you can many-shot a language from scratch by feeding in its dictionary and a grammar book. This is what Gemini 1.5 did for Kalamang: www.reddit.com/r/singularity/comments/1arla9z/gemini_15_pro_can_learn_to_zero_shot_translate_of/ It was meant as a out-of-distribution test, because the authors were sure that there is no trace of Kalamang on the internet that Gemini was trained on.