How AI 'Understands' Images (CLIP) - Computerphile

Подписаться 2,4 млн

Просмотров 164 тыс.

50% 1

With the explosion of AI image generators, AI images are everywhere, but how do they 'know' how to turn text strings into plausible images? Dr Mike Pound expands on his explanation of Diffusion models.
/ computerphile
/ computer_phile
This video was filmed and edited by Sean Riley.
Computer Science at the University of Nottingham: bit.ly/nottscomputer
Computerphile is a sister project to Brady Haran's Numberphile. More at www.bradyharanblog.com
Thank you to Jane Street for their support of this channel. Learn more: www.janestreet.com

Опубликовано:

24 апр 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 250

@michaelpound9891 Месяц назад

As people have correctly noted: When I talk about the way we train at 9:50, I should say we maximise the similarity on the diagonal, not the distance :) Brain failed me!

@adfaklsdjf Месяц назад

we gotcha 💚

@harpersneil Месяц назад

Phew, for a second there I thought you were dramatically more intelligent then I am!

@ArquimedesOfficial Месяц назад

Omg, I’m your fan since spiderman 😆, thanks for the lesson!

@adfaklsdjf Месяц назад

thank you for "if you want to unlock your face with a phone".. i needed that in my life

@alib8396 Месяц назад

Unlocking my face with my phone is the first thing I do when I wake up everyday.

@orange-vlcybpd2 Месяц назад

The legend has it that the series will only end when the last sheet of continuous printing paper has been written on.

@pyajudeme9245 Месяц назад

This guy is one of the best teachers I have ever seen.

@sebastyanpapp Месяц назад

Agreed

@edoardogribaldo1058 Месяц назад

Dr. Pound's videos are on another level! He explains things with such passion and such clarity rarely found on the web! Cheers

@joker345172 Месяц назад

Dr Pound is just amazing. I love all his videos

@aprilmeowmeow Месяц назад

Thanks for taking us to Pound town. Great explanation!

@pierro281279 Месяц назад

Your profile picture reminds me of my cat ! It's so cute !

@pvanukoff Месяц назад

pound town 😂

@rundown132 Месяц назад

pause

@aprilmeowmeow Месяц назад

@@pierro281279 that's my kitty! She's a ragdoll. That must mean your cat is pretty cute, too 😊

@BrandenBrashear Месяц назад

Pound was hella sassy this day.

@MichalKottman Месяц назад

9:45 - wasn't it supposed to be "minimize the distance on diagonal, maximize elsewhere"?

@michaelpound9891 Месяц назад

Absolutely yes! I definitely should have added “the distance” or similar :)

@ScottiStudios Месяц назад

Yes it should have been *minimise* the diagonal, not maximise.

@rebucato3142 Месяц назад

Or it should be “maximize the similarity on the diagonal, minimize elsewhere”

@chloupichloupa 16 дней назад

That cat got progressively more turtle-like with each drawing.

@beardmonster8051 Месяц назад

The biggest problem with unlocking a face with your phone is that you'll laugh too hard to hear the video for a minute or so.

@JohnMiller-mmuldoor Месяц назад

Been trying to unlock my face for 10:37 and it’s still not working!

@eholloway Месяц назад

"There's a lot of stuff on the internet, not all of it good, I should add" - Dr Mike Pound, 2024

@rnts08 Месяц назад

Understatement of the century, even for a brit.

@skf957 Месяц назад

These guys are so watchable, and somehow they make an inherently inaccessible subject interesting and easy to follow.

@letsburn00 Месяц назад

RU-vid is like you got the best teacher in school. The world has hundreds or thousands of experts. Being able to explain is really hard to do as well.

@bluekeybo Месяц назад

The man, the myth, the legend, Dr. Pound. The best lecturer on Computerphile.

@Shabazza84 Месяц назад

Excellent. Could listen to him all day and even understand stuff.

@TheRealWarrior0 Месяц назад

A very important bit that was skipped over is how you get an LLM to talk about an image (multimodal LLM)! After you got your embedding from the vision encoder you train a simple projection layer that aligns the image embedding with the semantic space of the LLM. You train the projection layer so that the embedding of the vision encoder produces the desired text output describing the image (and or executing the instructions in the image+prompt). You basically project the "thoughts" of the part that sees (the vision encoder) into the part that speaks (the massive LLM).

@or1on89 Месяц назад

That’s pretty much what he said after explaining how the LLM infers an image from written text. Did you watch the whole video?

@TheRealWarrior0 Месяц назад

@@or1on89 What? Inferring an image from written text? Is this a typo? You mean image generation? Anyway, did he make my same point? I must have missed it. Could you point to the minute he roughly says that? I don't think he ever said something like "projective layer" and/or talked about how multimodality in LLMs is "bolted-on". It felt to me like he was talking about the actual CLIP paper rather than how CLIP is used on the modern systems (like Copilot).

@exceptionaldifference392 Месяц назад

I mean the whole video was about how to align the embeddings of the visual transformer with LLM embeddings of captions of the images.

@TheRealWarrior0 Месяц назад

@@exceptionaldifference392 to me, the whole video seems to be about the CLIP paper which is about “zero-shot labelling images”. But that is a prerequisite to make something like LLaVa which is able to talk, ask questions about the image and execute instruction based on the image content! CLIP can’t do that! I described the step from going to having a vision encoder and an LLM to have a multimodal-LLM. That’s it.

@TheRealWarrior0 Месяц назад

@@exceptionaldifference392 To be exceedingly clear: the video is about how you create the "vision encoder" in the first place, (which does require you also train a "text encoder" for matching the image to the caption), not how to attach the vision encoder to the more general LLM.

@rigbyb Месяц назад

6:09 "There isn't red cats" Mike is hilarious and a great teacher lol

@uneasy_steps Месяц назад

I'm a simple guy. I see a Mike Pound video, I click

@jamie_ar Месяц назад

I pound the like button... ❤

@Afr0deeziac Месяц назад

@@jamie_arI see what you did there. But same here 🙂

@BooleanDisorder Месяц назад

I like to see Mike pound videos too.

@kurdm1482 Месяц назад

Same

@MikeUnity Месяц назад

Were all here for an intellectual pounding

@chuachua-hj9zd 9 дней назад

I like how he just sit in an office chair and talk. Simple but high quality talk

@wouldntyaliktono Месяц назад

I love these encoder models. And I have seen these methods implemented in practice, usually as part of a recommender system handling unstructured freetext queries. Embeddings are so cool.

@musikdoktor Месяц назад

Love seeing AI problems explained on fanfold paper. Classy!

@sebastianscharnagl3173 Месяц назад

Awesome explanation

@codegallant Месяц назад

Computerphile and Dr. Pound ♥️✨ I've been learning AI myself these past few months so this is just wonderful. Thanks a ton! :)

@sukaina4978 Месяц назад

i just feel 10 times smarter after watching any computerphile video

@sbzr5323 Месяц назад

The way he explains is very interesting.

@martin777xyz Месяц назад

Really nice to explanation 👍👍

@captaintrep6002 27 дней назад

This video was super helpful. I just wish it came out with the stable diffusion video. Was quite confused how we went from text to meaning.

@VicenteSchmitt Месяц назад

Great video!

@jonyleo500 Месяц назад

At 9:30, doesn't a distance of zero mean the image and caption have the same "meaning", therefore, shouldn't we want to minimize the diagonal, and maximize the rest?

@michaelpound9891 Месяц назад

Yes! We want to maximise the similarity measure on the diagonal - I forgot the word similarity!

@romanemul1 Месяц назад

@@michaelpound9891 Cmon. Its Mike Pound !

@stancooper5436 Месяц назад

Thanks Mike, nice clear explanation. You can still get that printer paper!? Haven't seen that since my Dad worked as a mainframe engineer for ICL in the 80s!

@user-dv5gm2gc3u Месяц назад

i'm an it-guy & programmer, but this is kinda hard to understand. thanks for the video, gives a little idea about the concepts!

@aspuzling Месяц назад

I'd definitely recommend the last two videos on GPT from 3blue1brown. He explains the concept of embeddings in a really nice way.

@barrotem5627 Месяц назад

Brilliant mike !

@Funkymix18 Месяц назад

Mike is the best

@Stratelier Месяц назад

When they say "high dimensional" in the vector context, I like to imagine it like an RPG character stat sheet, as each independent stat on that sheet can be considered its own dimension.

@xersxo5460 Месяц назад

Just writing this to crystallize my understanding: (and for others to check me for accuracy) So by circumventing the idea of trying to instill “true” understanding (which is a hard incompatibility in this context, due to our semantics); On a high level it’s substituting case specific discrepancies (like how a digital image is made of pixels, so only pixel related properties are important: like color and position) and filtering against them, because it happens to be easier to tell what something isn’t than what it is in this case (like there are WAAAY more cases where a random group of pixels isn’t an image of a cat, so your sample size for correction is also WAAY bigger.) And if you control for the specific property that disqualifies the entity (in this case, of the medium: discrete discrepancies), as he stated with the “ ‘predisposed noise’ subtraction to recreate a clean image’“ training, you can be even more efficient and effective by starting with already relevant cases. Once again because a smattering of colors is not a cat so it’s easier to go ahead and assume your images will already be in some assortment of colors similar to a cat to train on versus the near infinite combinations of random color pixel images. And then in terms of the issue of accuracy through specificity versus scalability, it was just easier to use the huge sample size as a tool to approximate accuracy between the embedded images and texts because as a sample size increases, precision also roughly increases given a rule, (in crude terms). And that it’s also a way to circumvent “ mass hard coding” associations to approximate “meaning” because the system doesn’t even have to deal directly with the user inputs in the first place, just their association value within the embedded bank. I think that’s a clever use of the properties of a system as limitations to solve for our human “black box” results. Because the two methods, organic and mathematical, converge due to a common factor: The fact that digital images in terms of relevance to people are also useful approximations, because we literally can only care about how close an “image” is to something we know, not if it actually is or not, which is why we don’t get tripped up over individual pixels in determining the shape of a cat in the average Google search. So in the same way by relying on pixel resolution and accuracy as variables you can quantify the properties so a computer can calculate a useable result. That’s so cool!

@zzzaphod8507 Месяц назад

4:35 "There is a lot of stuff on the internet, not all of it good." Today I learned 😀 6:05 I enjoyed that you mentioned the issues of red/black cats and the problem of cat-egorization Video was helpful, explained well, thanks

@Foxxey Месяц назад

14:36 Why can't you just train a network that would decode the vector in the embedded space back into text (being either fixed sized or using a recurrent neural network)? Wouldn't it be as simple as training a decoder and encoder in parallel and using the text input of the encoder as the expected output in the decoder?

@or1on89 Месяц назад

Because that’s a whole different class of problem and would make the process highly inefficient. There are better ways just to do that using a different approach.

@IceMetalPunk Месяц назад

For using CLIP as a classifier: couldn't you train a decoder network at the same time as you train CLIP, such that you now have a network that can take image embeddings and produce semantically similar text, i.e. captions? That way you don't have to guess-and-check every class one-by-one? Anyway, I can't believe CLIP has only existed for 3 years... despite the accelerating pace of AI progress, we really are still in the nascent stages of generalized generative AI, aren't we?

@GeoffryGifari Месяц назад

Can AI say "I don't know what I'm looking at"? Is there a limit to how much it can recognize parts of an image?

@throttlekitty1 Месяц назад

No, but it can certainly get it wrong! Remember that it's looking for a numerical similarity to things it does know, and by nature has to come to a conclusion.

@OsomPchic Месяц назад

Well in some way. It would say that picture have this embedings: cat:0.3, rainy weather: 0.23, white limo 0.1 every number representing a percentage how "confident" it is. So with a lot of tokens below 0.5 you can say it have no idea what's on that picture

@ERitMALT00123 Месяц назад

Monte-Carlo dropout can produce confidence estimations of a model. If the model doesn't know what it's looking at then the confidence should be low. CLIP natively doesn't have this though

@el_es Месяц назад

The 'i don't know ' answer is not very evenly treated along users and therefore there is an understandable hate of it embedded into the model;) possibly because it also means more work for the programmers... Therefore it would rather hallucinate than say it doesn't know something.

@dimitrifogolin Месяц назад

Amazing

@Trooperos90 15 дней назад

This video satisfies my ai scepticism.

@el_es Месяц назад

@dr Pound: sorry if this is off topic here but, i wonder if the problem of hallucinations in AI comes from us not treating the 'i don't know what I'm looking at ' answer of a model, as a very negative outcome? If it was treated by us as a valid neutral answer, could it reduce the rate if hallucinations?

@pickyourlane6431 Месяц назад

i was curious, when you are showing the paper from above, are you transforming the original footage?

@robsands6656 13 дней назад

Still have to manually generate the sentences for the embedding. It’s just a convoluted way to let a computer generate its own lookup table.

@LupinoArts Месяц назад

3:55 As someone born in the former GDR, I find it cute to label a Trabi as "a car"...

@ColibrisMusicLive 21 день назад

Please explain 9:2, do you mean that the embeddings laying on the diagonal will receive higher score?

@zxuiji Месяц назад

Personally I woulda just did the colour comparison by putting the 24bit RGB integer colour into a double (the 64bit fpn type) and divided one by the other. If the result is greater than 0.01 or less than -0.01 then they're not close enough to deem the same overall colour and thus not part of the same facing of a shape. **Edit:** When searching for images it might be better to use simple line path (both a 2d and 3d one) matching the given text of what to search for and compare the shapes identified in the images to those 2 paths. If at least 20% of the line paths matches a shape in the image set then it likely contains that what was searching for. Similarly when generating images the line paths should then traced for producing each image then layered on to one image. Finally for identifying shapes in a given image you just iterate through all stored line paths. I believe this is how our brains conceptualise shapes in the 1st place given how our brains have nowhere to draw shapes to compare to. Instead they just have connections between...cells? neurons? Someone will correct me. Anyways they just have connections between what are effectively physical functions that equate to something like this in C: int neuron( float connections[CHAR_BIT * sizeof(uint)] ); Which tells me the same subshapes share neurons for comparisons which means a bigger shape will likely be just something initial nueron to visit, how many neurons to vist, and what angle to direct the path at to identify the next neuron to visit. In other words every subshape would be able to revisit a previous subshapes neruon/function. There might be an extra value or 2 but I'm no neural expert so a rough guess should be accurate enough to get the ball rolling.

@jonathan-._.- Месяц назад

approx how many samples do i need when i just want to do image categorisation (but with multiple categories per image)

@IOSARBX Месяц назад

Computerphile, This is great! I liked it and subscribed!

@WilhelmPendragon Месяц назад

So the Visio-Text encoder is dependent on the quality of the captioned photo dataset? If so, where do you find quality datasets ?

@Sleeperknot 26 дней назад

At 9:48, did Mike use 'maximize' and 'minimize' wrongly? The distances in the diagonal should be minimal right? EDIT: I saw Mike's pinned comment only after posting this :P

@Misiok89 Месяц назад

6:30 if for LLM you have nodes of meaning then you could look fof "nodes of meaning" in description and make classes based on those "nodes", if you are able to represent every language based on same "nodes of meaning" that is even better to translate text from one language to another then average translator that is not LLM, then you should be able to use it also for clasification.

@thestormtrooperwhocanaim496 Месяц назад

A good edging session (for my brain)

@brdane Месяц назад

Oop. 😳

@FilmFactry Месяц назад

When will wee see the multimodal LLMs be able to answer a question with a generated image. Could be how do you wire an electric socket, and it would generate either a diagram or illustration of the wire colors and position. Should be able to do this but it can't yet. Next would be a functional use of SORA rendering a video how you install a starter motor in a Honda.

@LukeTheB Месяц назад

Quick question from someone outside computer science: Does the model actually instill "meaning" into the embedded space? What I mean is: Is the Angel between "black car" and "Red car" smaller than "black car" and "bus" and that is smaller than "black car" and "tree"?

@suicidalbanananana Месяц назад

Yeah that's correct, "black car" and "red car" will be much closer to each other than "black car" and "bus" or "black car" and "tree" would be. It's just pretty hard to visualize this in our minds because we're talking about some strange sort of thousands-of-dimensions-space with billions of data points in it. But there's definitely discernable "groups of stuff" in this data. (Also, "Angle" not "Angel" but eh, we get what you mean ^^)

@aleksszukovskis2074 Месяц назад

there is stray audio in the background that you can faintly hear at 0:05

@j3r3miasmg Месяц назад

I didn't read the cited paper, but if I understood correctly, the 5 billion images need to be labeled for the training step?

@Hexanitrobenzene Месяц назад

Or "at least" 400 million...

@felixmerz6229 18 дней назад

Is there a specific reason why this process would have to be single-directional? There doesn't seem to be much difference in principle to an autoencoder. I assume this isn't about whether it would work or not, but rather about approximating this behavior to navigate around impossible amounts of training, or am I mistaken in this assumption?

@utkua Месяц назад

How do you go from embedings to text of something never been see. before?

@donaldhobson8873 Месяц назад

Once you have a clip, can't you train a diffusion on pure images, just by putting an image into clip, and training the diffusion to output the same image?

@Holycrabbe Месяц назад

so the length of the clip array training the defusion would have 400 million entries ? so it defines a "corner" of the space we have spanned by the 400 million fotos and foto descriptions ?

@klyanadkmorr Месяц назад

Heyo, a Pound dogette here!

@StashOfCode Месяц назад

There is a paper on The Gradient about reverting embeddings to text ("Do text embeddings perfectly encode text?")

@AZTECMAN Месяц назад

Clip is fantastic. It can be used as a 'zero-shot' classifier. It's both effective and easy to use.

@hehotbros01 Месяц назад

Poundtown.. sweet...

@proc Месяц назад

9:48 I didn't quite get how do similar embeddings end up close to each other if we maximize the distances to all other embeddings in the batch? Wouldn't two images of dogs in the same batch will be pulled further away just like an image of a dog and a cat would? Explain like Dr. Pound please.

@drdca8263 Месяц назад

First: I don’t know. Now I’m going to speculate: Not sure if this had a relevant impact, but: probably there are quite a few copies of the same image with different captions, and of the same caption for different images? Again, maybe that doesn’t have an appreciable effect, idk. Oh, also, maybe the number of image,caption pairs is large compared to the number of dimensions for the embedding vectors? Like, I know the embedding dimension is pretty high, but maybe the number of image,caption pairs is large enough that some need to be kinda close together? Also, presumably the mapping producing the embedding of the image, has to be continuous, so, images that are sufficiently close in pixel space (though not if only semantically similar) should have to have similar embeddings. Another thing they could do, if it doesn’t happen automatically, is to use random cropping and other small changes to the images, so that a variety of slightly different versions of the same image are encouraged to have similar embeddings to the embedding of the same prompt.

@NeinStein Месяц назад

Oh look, a Mike!

@lucianoag999 15 дней назад

So, if we want to break AI, we just have to pollute the internet with a couple billion pictures of red cats with the caption “blue dog”.

@MilesBellas Месяц назад

Stable Diffusion 3 = potential topic Optimum workflow strategies using Control Nets, LORAS, VEAs etc....?

@GeoffryGifari Месяц назад

How can AI determine the "importance" of parts of an image? why would it output "people in front of boat" instead of "boat behind people" or "boat surrounded by people"? Or maybe the image is a grid of square white cells. One cell then get its color progressively darken to black. Would the AI describe these transitioning images differently?

@michaelpound9891 Месяц назад

Interesting question! This very much comes down to the training data in my experience. For the network to learn a concept such as "depth ordering", where something is in front of another, what we are really saying is it has learnt a way to extract features (numbers in grids) representing different objects, and then recognize that an object is obscured or some other signal that indicates this concept of being in front of. For this to happen in practice, we will need to see many examples of this in the training data, such that eventually such features occurring in an image lead to a predictable text response.

@GeoffryGifari Месяц назад

@@michaelpound9891 The man himself! thank you for your time

@GeoffryGifari Месяц назад

@@michaelpound9891 I picked that example because... maybe its not just depth? maybe there are myriad of factors that the AI summarized as "important" For example the man is in front of the boat, but the boat is far enough behind that it looks somewhat small.... Or maybe that small boat has a bright color that contrasts with everything else (including the man in front). But your answer makes sense, that its the training data

@Jononor Месяц назад

@@GeoffryGifarisalience and salience detection is what this concept is usually called in computer vision. CLIP style models will learn it as a side effect

@nenharma82 Месяц назад

This is as simple as it’s ingenious and it wouldn’t be possible without the internet being what it is.

@IceMetalPunk Месяц назад

True! Although it also requires Transformers to exist, as previous AI architectures would never be able to handle all the varying contexts, so it's a combination of the scale of the internet and the invention of the Transformer that made it all possible.

@Retrofire-47 Месяц назад

@@IceMetalPunk the transformer, as someone who is ignorant, what is that? I only know a transformer as a means of converting electrical voltage from AC - DC

@JT-hi1cs Месяц назад

Awesome! I always wondered how the hell does the AI “gets” that an image is made with a certain type of lens or film stock. Or how the hell AI generates objects that were never filmed in a way, say, The Matrix filmed on fisheye and Panavision in the 1950s.

@lancemarchetti8673 Месяц назад

Amazing. Imagine the day when AI is able to detect digital image steganography. Not by vision primarily, but by bit inspection.... iterating over the bytes and spitting out the hidden data. I think we're still years away from that though.

@genuinefreewilly5706 Месяц назад

Great explainer. Appreciated. I hope someone will cover AI music next

@suicidalbanananana Месяц назад

In super short: Most "AI music stuff" is literally just running stable diffusion in the backend, they train a model on the actual images of spectrograms of songs, then ask it to make an image like that & then convert that spectrogram image back to sound.

@genuinefreewilly5706 Месяц назад

@@suicidalbanananana Yes I can see that, however AI music has made a sudden marked departure in quality of late. Its pretty controversial among musicians. I can wrap my head around narrow AI applications in music ie mastering, samples etc.. Its been a mixed bag of results until recently.

@or1on89 Месяц назад

It surely would be interesting…I can see a lot of people embracing it for pop/trap music and genres with “simple” compositions…my worry as a musician is that it would make the landscape more boring than boy bands in the 90s (and somewhat already is without AI being involved). As a software developer I would love instead to explore the tool to refine filters, corrections and sampling during the production process… It’s a bit of a mixed bag…the generative aspect is being marketed as the “real revolution” and that’s a bit scary…knowing more the tech and how ML can help improve our tools would be great…

@bogdyee Месяц назад

I'm curios about a thing. If you have a bunch of millions of photos of cats and dogs and they are also correctly labeled (with descriptions) but all these photos have the cats and dogs in the bottom half of the image, will the transformer be able to correctly classify them after training if they are put in the upper half of the image? (or images are rotated, color changed, filtered, etc..).

@Macieks300 Месяц назад

Yes, it may learn it wrong. That's why scale is necessary for this. If you have a million of photos of a cats and dogs it's very unlikely that all of them are in the bottom half of the image.

@bogdyee Месяц назад

@@Macieks300 That's why for me it pose a philosophical question. Will these things actually solve intelligence at some point? If so, what exactly might be the difference between a human brain an an artificial one.

@IceMetalPunk Месяц назад

@@bogdyee Well, think of it this way: humans learn very similarly. It may not seem like it, because the chances of a human only ever seeing cats in the bottom of their vision and never anywhere else is basically zero... but we do. The main difference between human learning and AI learning, with modern networks, is the training data: we're constantly learning and gathering tons of data through our senses and changing environments, while these networks learn in batches and only get to learn from the training data we curate, which tends to be relatively static. But give an existing AI model the ability to do online learning (i.e. continual learning, not "look up on the internet" 😅) and put it in a robot body that it can control? And you'll basically have a human brain, perhaps at a different scale. And embodied AIs are constantly being worked on now, and continual learning for large models... I'm not sure about. I think the recent Infini-Attention is similar, though, so we might be making progress on that as well.

@suicidalbanananana Месяц назад

@@bogdyee Nah they won't solve intelligence at some point when going down this route they are currently going down, AI industry was working on actual "intelligence" for a while but all this hype about shoving insane amounts of training data into "AI" has reduced the field to really just writing overly complex search engines that sort of mix results together... 🤷‍♂ Its not trying to think or understand (as is the actual goal of AI field) anything at all at this stage, it's really just trying to match patterns. "Ah the user talked about dogs, my training data contains the following info about dog type a/b/c, oh the user asks about trees, training data contains info about tree type a/b/c", etc. Actual AI (not even getting to the point of 'general ai' yet but certainly getting to somewhere much better than what we have now) would have little to no training data at all, instead it would start 'learning' as its running, so you would talk to it about trees and it would go "idk what a tree is, please tell me more" and then later on it might have some basic understanding of "ah yes, tree, i have heard about them, person x explained them to me, they let you all breathe & exist in type a/b/c, right? please tell me more about trees" Where the weirdness lies is that the companies behind current "AI" are starting to tell the "AI" to respond in a similar smart manner, so they are starting to APPEAR smart, but they're not actually capable of learning. All the current AI's do not remember any conversation they have had outside of training, because that makes it super easy to turn Bing (or whatever) into yet another racist twitter bot (see microsoft's history with ai chatbots)

@suicidalbanananana Месяц назад

@@IceMetalPunk The biggest difference is that we (or any other biological intelligence) don't need insanely large amounts of training data, show a baby some spoons and forks and how to use them and that baby/person will recognize and be able to use 99.9% of spoons and forks correctly for the rest of its life, current overhyped AI's would have to see thousands of spoons and forks to maybe get it right 75% of the time & that's just recognizing it, we're not even close yet to 'understanding how to use' Also worth noting is how we (and again, any other biological intelligence) are always "training data" and much more versatile when it comes to new things, if you train an AI to recognize spoons and forks and then show it a knife it's just going to classify it as a fork or spoon, where as we would go "well that's something i've not seen before so it's NOT a spoon and NOT a fork"

@robosergTV Месяц назад

Please make a Playlist only about GenAI or a separate AIphile channel. I care only about genAI.

@nightwishlover8913 Месяц назад

5.02 Never seen a "boat wearing a red jumper" before lol

@LaYvi 7 дней назад

I'm an artist and I'm very worried about my art being used to train an AI model. What can I do to prevent that? Any tips?

@RupertBruce Месяц назад

One day, we'll give these models some high resolution images and comprehensive explanations and their minds will be blown! It's astonishing how good even a basic perceptron can be given 28x28 pixel images!

@MedEighty Месяц назад

10:37 "If you want to unlock a face with your phone". Ha ha ha!

@eigd Месяц назад

9:48 Been a while since I did machine learning class... Anyone care to tell me why I'm thinking of PCA? What's the connection?

@Hexanitrobenzene Месяц назад

Hm, I'm not an expert either, but... AFAIK, Principal Component Analysis finds directions which maximise/minimise the variance of the data, which can be thought of as average distance. The drawback is that it's only a linear method and it cannot deal with high dimensional data such as images effectively.

@unvergebeneid Месяц назад

But confusing to say that you want to maximise the distances on the diagonal. Of course you can define things however you want but usually you'd say you want to maximise the cosine similarity and thus minimise the cosine distance on the diagonal.

@MattMcT Месяц назад

Do any of you ever get this weird feeling that you need to buy Mike a beer? Or perhaps, a substantial yet unknown factor of beers?

@MikeKoss Месяц назад

Can't you do something analogous to stable diffusion for text classification? Get the image embedding, and then start with random noisy text, and iteratively refine it in the direction of the image's embedding to get a progressively more accurate description of the image.

@quonxinquonyi8570 Месяц назад

Image manifolds are of huge dimension compare to text manifolds….so guided diffusion from a low dimension manifold to a very high dimension manifold would have a less information and more noise, basically information theoretic bounds still hold when you transform from high dimensional space to low dimension embedding but the other way around isn’t seems that intuitive…might be some prior must be taken into an account..but it still is a hard problem

@ginogarcia8730 Месяц назад

I wish I could hear Professor Brailsford's thoughts on AI these days man

@fredrik3685 Месяц назад

Question 🤚 Up until recently all images of a cat on internet were photos of real cars and the system could use them in training. But now more and more cat images are AI generated. If future systems use generated images in training it will be like a blind leading a blind. More and more distortion will be added. Or? Can that be avoided?

@quonxinquonyi8570 Месяц назад

Distortion and perceptual qualities are the tradeoff we make when we use generative ai

@EkShunya Месяц назад

I thought diffusion models had VAE and not ViT Correct me if I m wrong

@quonxinquonyi8570 Месяц назад

Diffusion model is an upgraded version of vae with limitation in sampling speed

@ianburton9223 Месяц назад

Difficult to see how convergence can be ensured. Lots of very different functions can be closely mapped over certain controlled ranges, but then are wildly different outside those ranges. What I have missed in many AI discussions is these concepts of validity matching and range identities to ensure that there's some degree of controlled convergence. Maybe this is just a human fear of the unknown.

@Ankhyl 14 дней назад

Mike explains very well, however it is very noticeable that the concept is not easily explained in 20 minutes. There a are a lot of cliffhangers and each step in the process requires its own iteration of 20 min video, probably at least 2 levels deep to really understand what's going on.

@Rapand Месяц назад

Each time I watch one of these videos, I could might as well watch Apocalypto without subtitles. My brain is not made for this 🤓

@charlesgalant8271 Месяц назад

The answer given for the "we feed the embedding into the denoise process" still felt a little hand-wavey to me as someone who would like to understand better, but overall good video.

@michaelpound9891 Месяц назад

Yes I'm still skipping things :) The process this uses is called attention, which basically is a type of layer we use in modern deep networks. The layer allows features that are related to share information amongst themselves. Rob Miles covered attention a little in the video "AI Language Models & Transformers", but it may well be time to revisit this since attention has become quite a lot more mainstream now, being put in all kinds of networks.

@IceMetalPunk Месяц назад

@@michaelpound9891 It is, after all, all you need 😁 Speaking of attention: do you think you could do a video (either on Computerphile or elsewhere) about the recent Infini-Attention paper? It sounds to me like it's a form of continual learning, which I think would be super important to getting large models to learn more like humans, but it's also a bit over my head so I feel like I could be totally wrong about that. I'd appreciate an overview/rundown of it, if you've got the time and desire, please 💗

@bennettzug Месяц назад

13:54 you actually probably can, at least to an extent there’s been some recent research on the idea of going backwards from embeddings to text, maybe look at the paper “Text Embeddings Reveal (Almost) As Much As Text” (Morris et al) the same thing has been done with images from a CNN, see “Inverting Visual Representations with Convolutional Networks” (Dosovitsky et al) neither of these are with CLIP models so maybe future research? (not that it’d produce better images than a diffusion model)

@or1on89 Месяц назад

You can, using a different type of network/model. We need to remind that all he said is in the context of a specific type of model and not in absolute terms, otherwise the lesson would go very quickly out of context and hard to follow.

@bennettzug Месяц назад

@@or1on89 i don’t see any specific reason why CLIP model embeddings would be especially intractable though

@JeiShian Месяц назад

The exchange at 6:50 made me laugh out loud and I had to show that part of the video to the people around me😆😆

@creedolala6918 Месяц назад

'and we want an image of foggonstilz' me: wat 'we want to pass the text of farngunstills' me: u wot m8

@MuaddibIsMe Месяц назад

"a mike"

@Hexanitrobenzene Месяц назад

THE Mike :)

@zurc_bot Месяц назад

Where did they get those images from? Any copyright infringement?

@quonxinquonyi8570 Месяц назад

Internet is a huge public repository since its inception

@djtomoy Месяц назад

Why is there always so much mess and clutter in the background of these videos? Do you film them in abandoned buildings?

@AngelicaBotelho-he1hb Месяц назад

Crypto Bull run is making waves everywhere and I have no idea on how it works. What is the best step to get started please,,

@roseypasha1706 Месяц назад

Am facing the same challenges right now and I made a lots of mistakes trying to do it on my own even this video doesn't give any guidelines

@GiseleLuz-rm6vd Месяц назад

I will advise you to stop trading on your own if you continue to lose. I no longer negotiate alone, I have always needed help and assistance

@brandonkim4554 Месяц назад

You're right! The current market might give opportunities to maximize profit within a short term, but in order to execute such strategy, you must be a skilled practitioner.