Тёмный
AI Coffee Break with Letitia
AI Coffee Break with Letitia
AI Coffee Break with Letitia
Подписаться
Lighthearted bite-sized ML videos for your AI Coffee Break! 📺 Mostly videos about the latest technical advancements in AI, such as large language models (LLMs), text-to-image models and everything cool in natural language processing, computer vision, etc.!

We try to post twice a month, if not more often! 🤞 But, you know, a PhD thesis has still to be worked on.

Disclaimer: Opinions expressed are solely my own and do not express the views or opinions of my employer.


Impressum: aicoffeebreak.com/impressum.html
Комментарии
@MariaM-pu4fx
@MariaM-pu4fx 13 часов назад
LOVE IT. Dr any idea on research in this area?
@MariaM-pu4fx
@MariaM-pu4fx 13 часов назад
I understand compleatly nothing. I was focused on this guy aspergery passion. How can I work with such cyborgs :D sorry. I am a hater here but I started questioning my role on the job market after watching it.
@MariaM-pu4fx
@MariaM-pu4fx 13 часов назад
chenged my mind I like the explanation sorry for my ADHD
@declan6052
@declan6052 2 дня назад
At 13:14 - Is this 'clip guided diffusion' done by adding a term to the loss function or via a different method?
@AICoffeeBreak
@AICoffeeBreak 2 дня назад
It's done by adding an image to the generated image during inference. This extra added image is computed via the gradient with respect to clip's output. It's a bit like deep dream, if you are old enough to know about it.
@SnowBrian
@SnowBrian 2 дня назад
70185 Grayson Plains
@SandhyaPatil20
@SandhyaPatil20 3 дня назад
❤ I am a fan ❤
@AICoffeeBreak
@AICoffeeBreak 3 дня назад
Thank you!
@yannickpezeu3419
@yannickpezeu3419 3 дня назад
Do we have an evaluation of the difficulty to learn each human language in terms of flops to achieve a given perplexity ?
@joelbelafa
@joelbelafa 3 дня назад
Yep still there. I assume a MoE would make one’s life a bit difficult on hacking Token bias feature. Correct ?
@AICoffeeBreak
@AICoffeeBreak 3 дня назад
Great point.
@enkk
@enkk 4 дня назад
Hello! I'm a researcher in the NLP field, and despite reading the various papers on CoT this whole "keep some secrets for the competitive advantage" really bothers me and makes the understanding of the advancements really complex. This video was able to remove some of the forced opaqueness behind o1 last developments, so, thank you very much!
@AICoffeeBreak
@AICoffeeBreak 3 дня назад
Thank you!
@anomanees
@anomanees 6 дней назад
At 3:45, 256x256=65536, so wouldn't we have that many tokens, instead of 63504? Very nice video, thank you!
@AICoffeeBreak
@AICoffeeBreak 3 дня назад
Yes, indeed. Thanks for noticing!
@b_01_aditidonode43
@b_01_aditidonode43 7 дней назад
such an incredible explanation
@AICoffeeBreak
@AICoffeeBreak 5 дней назад
Thank you, it's great to hear this, especially about such an old video!
@jeffbowers7899
@jeffbowers7899 10 дней назад
It is a shame that the video does not mention our paper that first showed that LLMs can easily learn impossible languages (see reference below - Chomsky does cite our paper in his NY Times article). And as noted by @TheRyulord, the models in the Kallini et al. did learn the impossible languages almost as easily (other than the random one that is not an impossible language, it is not a language at all, just random bag of words). Add this to the fact that LLMs need to be trained on orders of magnitude more data compared to humans, and the findings are perfectly consistent with Chomsky. That is, the claim that the brain has priors that make it easier for humans to learn language with minimal data, at the cost that they can only learn some types of language. Mitchell, J., & Bowers, J. (2020, December). Priorless recurrent networks learn curiously. In Proceedings of the 28th international conference on computational linguistics (pp. 5147-5158). aclanthology.org/2020.coling-main.451/
@harumambaru
@harumambaru 12 дней назад
Thanks for the review! The most scary thing personally for me, is that with improving capabilities of the models, the hallucinations are becoming more and more convincing, and this is what makes me worried.
@harumambaru
@harumambaru 12 дней назад
7:50 what is thoot anyhow? :)
@AICoffeeBreak
@AICoffeeBreak 12 дней назад
thooot! 🤣
@AICoffeeBreak
@AICoffeeBreak 12 дней назад
Nothing, just my mouth doing random things.
@AbdallahAbdallah
@AbdallahAbdallah 13 дней назад
What software and video editing tools do you use for creating this great content ?
@AICoffeeBreak
@AICoffeeBreak 11 дней назад
I make all visuals (including the drawings in PowerPoint. 😅). I use Adobe Premiere Pro for editing (this is also the stage Ms. Coffee Bean comes into the picture).
@AbdallahAbdallah
@AbdallahAbdallah 11 дней назад
@@AICoffeeBreak this is so impressive. I couldn't even imagine that you can do that much visual using PowerPoint. You must be a guru level in power point. Maybe this is the topic of another video to make :)
@AICoffeeBreak
@AICoffeeBreak 11 дней назад
@AbdallahAbdallah
@davidespinosa1910
@davidespinosa1910 14 дней назад
The octopus just needs more training data ! 🙂
@davidespinosa1910
@davidespinosa1910 14 дней назад
Transformer memory space is linear, not quadratic -- see the FlashAttention paper.
@AICoffeeBreak
@AICoffeeBreak 12 дней назад
True with Flashattention! Yes.
@davidespinosa1910
@davidespinosa1910 14 дней назад
In a standard MLP network, we always multiply a weight W and data item X. We never multiply two data items X1 and X2. That's also the case for a convolutional network, right ? However, in an attention layer, when we compute the dot product of Key and Query vectors, we multiply data elements with each other. That (partially) justifies calling attention layers "dynamic".
@AICoffeeBreak
@AICoffeeBreak 11 дней назад
Great point!
@davidespinosa1910
@davidespinosa1910 14 дней назад
"The most important thing about transformers" 3:10 Exactly, and *nobody* talks about this. The number of parameters is independent of sequence length. BTW, transformers are quadratic in time but linear in space -- see the FlashAttention paper.
@davidespinosa1910
@davidespinosa1910 15 дней назад
Time is quadratic, but memory is linear -- see the FlashAttention paper. But the number of parameters is constant -- that's the magic ! Thanks for the excellent videos ! 👍
@AmitSheth
@AmitSheth 16 дней назад
Is a "chain of though" close in any way to "thinking" in general?
@MaxZaikin
@MaxZaikin 16 дней назад
Hi Letitia. First I want to thank you for such a great video, I rally love it :) Second, could you please explain for dummy the logick behind numbers that you have presented. For instance accroding to the example in video word Queen has ben vectorized in to the following sequence 0.33 | 0.71 | 0.91 | 0.23 | 0.15 Then when you showed the sin/cos functions applied you get with the follwoing sequence 0.2 | 0.7 | 0.1 | 0.99 | 0.01 as a positional vector for word Queen. And this is where I get compleetly lost, I don't understand the math behind it, how did you come up with these numbers for positional vector in accrodance with the Quen vector ? And second, in your example you started from the end of the word, is there are any reasons for exact this sequnce (start from the end) or it is made up example? I appologize for dummy question, but I really would like to jump into the nitty gritty of the real calculus. I do really appreciate you if you could provide more details. Best regards, Maks.
@DaNa-px7ol
@DaNa-px7ol 16 дней назад
I am sorry, but this prof. was smart enough to take you as a PhD and make the student life at the institute better, beyond the academical value that you added to their team! <3 Congrats again!
@AICoffeeBreak
@AICoffeeBreak 16 дней назад
It's so heart-warming to hear about the students, thanks Dana!
@johnson8743
@johnson8743 17 дней назад
Smartest person on youtube
@Arvolve
@Arvolve 17 дней назад
Great video, thanks for sharing your insights and expertise! I just found your channel and subscribed.
@AICoffeeBreak
@AICoffeeBreak 17 дней назад
Thank you, welcome!
@samyukthareddy7218
@samyukthareddy7218 17 дней назад
Hi! I have a question. Can I use multimodal transformers for my inputs which are only text inputs?? Because I didn't find any methods to take my multiple text inputs separately into the pipeline other than multimodal transformers. The reason is I have some primary and secondary text input columns in my dataset.
@AICoffeeBreak
@AICoffeeBreak 16 дней назад
Multimodal transformer encoders such as ViLBERT are now quite an old paradigm and they are meant to take both image and text as input. However, the text model called BERT can accept two inputs with a separator token between two sentences. Modern LLMs (transformer decoders) can also accept multiple text inputs if you make it clear in the prompt how exactly they differ. This is all I can say from the insight I got into your problem from your description.
@jeffg4686
@jeffg4686 17 дней назад
starbucks sponsor? coffee is not good btw (I would have said it worse to a guy). Causes a huge number of health problems.
@Thomas-gk42
@Thomas-gk42 17 дней назад
Reward? Like in human brains😁?
@MuammarElKhatib
@MuammarElKhatib 18 дней назад
These algorithms "don't think."
@DerPylz
@DerPylz 17 дней назад
That's what she explains at 1:35. And why she puts "think" in quotes.
@MuammarElKhatib
@MuammarElKhatib 16 дней назад
@@DerPylz based on the bibliographic research I have done, I don't need to go through the video. I echo that these algorithms "don't think," and to be more precise, researchers in that area should start using different terminology because that's what creates the hype and misinformation.
@parthvashisht9555
@parthvashisht9555 18 дней назад
I couldn't find a satisfying explanation anywhere. This video finally made me understand things in a bit more detail especially the use of sine and cosine functions across multiple dimensions. Thank you! You're awesome.
@AICoffeeBreak
@AICoffeeBreak 18 дней назад
@frommarkham424
@frommarkham424 19 дней назад
4:39 mann the diminishing returns be hitting real hard today💀
@frommarkham424
@frommarkham424 19 дней назад
3:22 thanks for the knowledge🙏we gonna make it out the data center with this tutorial🗣🗣🗣🗣
@mrd6869
@mrd6869 19 дней назад
I already did this today using Claude 3.5 Sonnet and very good prompting. Claude can reason and think thru common sense questions, zero shot, if you know HOW to prompt engineer. I'm not seeing any huge leaps to justify paying OpenAI extra money.
@MaJetiGizzle
@MaJetiGizzle 19 дней назад
Great video as always, Letitia! I really liked your phrasing around describing how these o1 models “think” in terms that don’t needlessly anthropomorphize the models. I think some of the hype is justified, but it’s nice to also have that important grounding around how this is more of a preview than anything else while also having some fundamental limitations.
@garyhuntress6871
@garyhuntress6871 19 дней назад
Don't anthropomorphize the models, they hate it when you do that
@MaJetiGizzle
@MaJetiGizzle 19 дней назад
@@garyhuntress6871 It’s not that people hate it. It’s just largely unhelpful and a crude/unfair comparison.
@garyhuntress6871
@garyhuntress6871 18 дней назад
@@MaJetiGizzle FYI it may not have come across that my comment is the classic joke about anthropomorphism.
@BrianPeiris
@BrianPeiris 19 дней назад
I got excited about o1 yesterday because it happened to solve the particular problem I threw at it really well. Silly me for falling for the sample-of-one trap again. Since LLMs are still at the core, "Jagged Intelligence" still applies. Thanks for reminding me to stay skeptical :)
@therainman7777
@therainman7777 14 дней назад
The important point is that the intelligence is becoming less jagged over time, with each new breakthrough.
@BrianPeiris
@BrianPeiris 14 дней назад
@@therainman7777 Do you have a source for that claim? My understanding is that the jagged boundary just changes, it's not necessary getting flatter. In other words, larger, newer models sometimes fail at tasks that smaller, older models did better at, and it's not predictable, so you really have to test each new model for the specific task.
@therainman7777
@therainman7777 14 дней назад
@@BrianPeiris Yes, that is true in certain cases-but those cases are the minority, not the majority. The simplest source to look at for this claim is the fact that on every benchmark we have and care about, the newer generation of models performs better than the previous generations. Sometimes slightly better, and sometimes much better. But the fact that the performance on all these benchmarks is improving over time indicates that the “jagged edge” is not merely changing shape, but becoming less and less prevalent. If it were not, then newer models would not be able to outperform the previous generation of models to a statistically significant degree, across thousands of questions on dozens of benchmarks. In addition to benchmarks, there are also less standardized evaluation measures, such as the Elo scores on LMSys Chatbot Arena. In virtually all cases, performance is increasing over time.
@BrianPeiris
@BrianPeiris 14 дней назад
@@therainman7777 I suppose in the end what I care about is whether these LLMs, or optimized chain-of-thought systems, can actually reason in a generalized way. So, despite the improvements in the standard benchmarks, if the most advanced systems still have a jagged edge or fundamental lack of true reasoning that causes them to fail on simple tasks, I wouldn't classify them as improving on what actually matters. If you watch Chollet's recent AGI-24 talk on the ARCprize channel, he makes a distinction between skill and intelligence. Most benchmarks measure skill. You can train for skill without actually improving on general intelligence. So ultimately this method of AI is fundamentally incapable of surpassing their training. There is a hard limit on LLM-derived systems. They can be useful for some applications, but if we care about general intelligence, we have to look elsewhere for solutions.
@BrianPeiris
@BrianPeiris 14 дней назад
@@therainman7777 I suppose in the end what I care about is whether these LLMs, or optimized chain-of-thought systems, can actually reason in a generalized way. So, despite the improvements in the usual benchmarks, if the most advanced systems still have a jagged edge or fundamental lack of true reasoning that causes them to fail on simple tasks, I wouldn't classify them as improving on what actually matters. If you watch Chollet's recent AGI-24 talk on the ARCprize channel, he makes a distinction between skill and intelligence. Most benchmarks measure skill. You can train for skill without actually improving on general intelligence. So ultimately this method of AI is fundamentally incapable of surpassing its training. There is a hard limit on LLM-derived systems. They can be useful for some applications, but if we care about general intelligence, we have to look elsewhere for solutions.
@theosalmon
@theosalmon 19 дней назад
As these things grow smarter and more capable, we will look to you for guidance and survival.
@CodexPermutatio
@CodexPermutatio 19 дней назад
Reasoning is the key to unlock true intelligence. So, every little step counts!
@AhmetTungaBayrak
@AhmetTungaBayrak 19 дней назад
Could using quantum superposition for probability distribution in text generating diffusion models be a thing?
@DerPylz
@DerPylz 19 дней назад
"anyone who not at the same level of maths, as they were in their youth" - I feel seen 😅
@theshow3376
@theshow3376 19 дней назад
I see you, I feel you.
@OnStageLighting
@OnStageLighting 19 дней назад
Sadly, I have exactly the same level of maths as in my youth. It's still poor.
@Thomas-gk42
@Thomas-gk42 17 дней назад
😂​@@OnStageLighting
@CamillaHodge-r9n
@CamillaHodge-r9n 21 день назад
Oral Corner
@ZafyArimbola
@ZafyArimbola 21 день назад
Can His tutorial on computational expressivity of Language Model be found on youtube?
@AICoffeeBreak
@AICoffeeBreak 20 дней назад
I'm afraid not. Only written form and slides here: acl2024.ivia.ch/ acl2024.ivia.ch/about
@harumambaru
@harumambaru 21 день назад
Such a great format of video! Very short and I love it more that 2min papers because of author’s explanation. Now to the questions: Great insight of words and tokens, does this mean that we need bigger models, that will learn on their own how to transform words into tokens. From my understanding words are just less computational easy tokens. So if we do go to 400B or larger models, following the idea that “size is all you need” maybe it will work. I wonder what is the performance if largest model published would be
@AICoffeeBreak
@AICoffeeBreak 21 день назад
Thanks for the kind words! Now, to your question: Yes, we already kind of have such models that are tokenizer-free and work directly on characters or byte representation, but this makes the input sequence length frow much bigger. So, models of the like of MAMBA or long-context transformers could eventually, if big enough, surpass existing tokenizer-based models. But we have not seen that happen yet at GPT4 scale. I do not know if they even tried that path.
@kathyh8047
@kathyh8047 21 день назад
I mean I'd definitely assume that sentences like these don't occur a whole lot in the training data
@AICoffeeBreak
@AICoffeeBreak 21 день назад
I agree. But I am sure that some sentences like this were there. Every LLM today trains of Wikipedia and they must have seen the entries of Douglas Hofstadter and his books.
@harumambaru
@harumambaru 21 день назад
@@AICoffeeBreak Adding I Am a Strange Loop to my reading list. I agree that it is in wikipedia, but how do you explain "how many R in strawberry" as if not with tokenization bug? That we need different kind of models (or tokenizers) to solve this
@AICoffeeBreak
@AICoffeeBreak 21 день назад
D'accord.
@anind3r
@anind3r 22 дня назад
Those E and F shapes test dont make sense unless fed as an image
@kathyh8047
@kathyh8047 21 день назад
well it would still look different to the model from just feeding the sentences as normal. I think most LLM tokenizers even preserve the linebreaks
@googleyoutubechannel8554
@googleyoutubechannel8554 22 дня назад
It's really really difficult to design any system with even close the degrees of freedom even the simplest transformer has without it being 'turning complete'. My guess is transformers have the special property of being complete in the most frustrating and useless way possible
@juanmanuelcirotorres6155
@juanmanuelcirotorres6155 22 дня назад
Wow, I can't believe it, two of my idols together! Congrats, Tristan! Go Contextual!
@simonstorf7080
@simonstorf7080 22 дня назад
Interesting, another reason might be that the task is just too far out of distribution?
@AICoffeeBreak
@AICoffeeBreak 22 дня назад
Training on data exactly like this might make models perform better on these examples. But I doubt these models didn't see any self-referential statements somewhere in their training data.
@syedmustahsan4888
@syedmustahsan4888 23 дня назад
Very Good explanation. Amazing. Thank You very much madam
@gettingdatasciencedone
@gettingdatasciencedone 24 дня назад
Great summary and discussion. Thank you Letitia. Also - some really great points raised here in the comments.
@AICoffeeBreak
@AICoffeeBreak 23 дня назад
Yes, so many, and so heated that it is hard to keep up. 😅 But indeed, they bring up interesting points.
@ChengyiLi-t1k
@ChengyiLi-t1k 24 дня назад
This was very well explained. Thank you so much for this. I couldn't help but wonder since these models are trained by data, how much data do you need for a model to be reasonably accurate? A quick google search told me that GPT4.0 is trained by approximately 500,000,000,000 pages of text, which is absolutely insane to me! I want to know if there are models that we can develop that train based on less data but still provide accurate results and what do these models look like?
@AICoffeeBreak
@AICoffeeBreak 22 дня назад
Thanks a lot, especially since this is a very old video. We have made a new transformer explainer: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-ec9IQMiJBhs.html About your question: Unfortunately in deep learning, great performance comes with big data, because the models work only well in the domains and kind of data they have seen so far (in distribution). And the motto is: nothing will be out of distribution if we include the entire world in the training data, no? (this is a tongue-in-cheek comment, just flagging it). 😅 So, if you are willing to sacrifice a lot of performance, then there are models that can work with less data, going back to older NLP, based on word embeddings and tf-idf representations. But I cannot say more until I know your specific use case. But if you want a chatbot that can talk about almost anything, then you need trillions of tokens of text, at least this is what we learned from ChatGPT et al.
@ChengyiLi-t1k
@ChengyiLi-t1k 22 дня назад
Oh wow I didn't even realize I was on the older video I will definitely check out the new one, and thanks for your answer! The motivation for my question was since we typically don't have a lot of data on endangered languages, could there be language models that can produce helpful results in these languages despite the lack of data on them. I guess the broader question would be what kinds of language models could we apply to endangered languages for things such as documenting them or aiding in that kind of research?
@AICoffeeBreak
@AICoffeeBreak 21 день назад
@@ChengyiLi-t1k I'm not an expert in multilingual AI, but I have heard from experts there. Your question reminds me of two points. * In multilingual AI, people still try to scrape all the monolingual data they have, automatically produce back translations and then train a multilingual model that hopefully can transfer its knowledge from high resource languages, to the low resource one. But you need some decent amount of data from every language you aim to learn. We've made a video on this approach, find the link in the description. ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-1gHUiNLYa20.html * If you have a very powerful model, of the class of GPT-4, Gemini, et al, then you could hope that the representations that exist are strong enough to elicit them with few-shot prompting. So, if you have the context length of Gemini, of multiple million input tokens, then you can many-shot a language from scratch by feeding in its dictionary and a grammar book. This is what Gemini 1.5 did for Kalamang: www.reddit.com/r/singularity/comments/1arla9z/gemini_15_pro_can_learn_to_zero_shot_translate_of/ It was meant as a out-of-distribution test, because the authors were sure that there is no trace of Kalamang on the internet that Gemini was trained on.
@ChengyiLi-t1k
@ChengyiLi-t1k 21 день назад
@@AICoffeeBreak Thank you very much! I appreciate the response.