DDPM - Diffusion Models Beat GANs on Image Synthesis (Machine Learning Research Paper Explained)

Yannic Kilcher

Подписаться 263 тыс.

Просмотров 152 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Опубликовано:

4 окт 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 127

@YannicKilcher 3 года назад

OUTLINE: 0:00 - Intro & Overview 4:10 - Denoising Diffusion Probabilistic Models 11:30 - Formal derivation of the training loss 23:00 - Training in practice 27:55 - Learning the covariance 31:25 - Improving the noise schedule 33:35 - Reducing the loss gradient noise 40:35 - Classifier guidance 52:50 - Experimental Results

@TechyBen 3 года назад

Will you cover Nvidias or Intels "AI photorealism" examples for game images to photorealism? IIRC a new paper was just released on it. Still early work, but is having better progress, as it no longer fails the temporal or hallucination (artifacts/errors) problems.

@ChunkyToChic 3 года назад

My boyfriend wrote these papers. Go Alex Nichol!

@taylan5376 3 года назад

And i already felt sorry for your bf

@LatinDanceVideos 3 года назад

You’ll have to compete for his attention with all the coding fanbois. Either way, lucky girl. Hold onto that guy.

@luisfable 2 года назад

With every great person, there is a great partner

@TheRyulord 2 года назад

@@LatinDanceVideos RU-vid says her name is Samantha Nichol now so I guess she took your advice.

@cedricvillani8502 2 года назад

Lose 10 pounds by cutting your head off??? 😂😂

@ahmedalshenoudy1766 3 года назад

Thanks a lot for the thorough explanation! It's helping me figure out a topic for my master's degree. Much much appreciated ^^

@CosmiaNebula 3 года назад

Summary: self-supervised learning. Given dataset of good images, keep adding Gaussian noise to it to create sequences of increasingly noisy images. Let the network learn to denoise images based on that. Then the network can "denoise" completely Gaussian random pictures into real pictures. To do: learn some latent space (like VAEGAN does) so that it can smoothly interpolate between generated pictures and create nightmare arts.

@scottmiller2591 3 года назад

That notation \mathcal{N}(x_t;sqrt{1-\beta_t}x_{t-1},\beta_t \mathbf{I}) sets my teeth on edge. Doing this with P, a general PDF, is fine, but I would always write x_t ~ \mathcal{N}(sqrt{1-\beta_t}x_{t-1},\beta_t \mathbf{I}), since \mathcal{N} is the Gaussian _distribution_ with a defined parameterization. BTW, the reason for sqrt{1-\beta_t}x_{t-1} is to keep the energy of x_{t-1} approximately the same as the energy for x_t; otherwise, the image would explode to a variance of T*\beta after T iterations. It's probably a good idea to keep the neural network inputs to about the same range every time.

@cedricvillani8502 2 года назад

Don’t forget to edit your text next time you paste it in😮

@pedrogorilla483 9 месяцев назад

Historic video! Fun to see it now and compare it to the current state of image generation. I’ll check it again in two years to see how far we’ve got.

@ShresthShukla-h9n 9 месяцев назад

lol :)

@andrewcarr3703 3 года назад

Love it!! It's called the "number line" in english. Keep up the great work

@impromptu3155 2 года назад

Just Amazing. I guess I might read this paper for another whole day if I missed your video. Grateful!

@sshatabda 3 года назад

Great video! I was surprised to see this after the latest paper just a fews days back! Thanks for the great explanations!

@MrBOB-hj8jq 3 года назад

Can you please make a video about SNN's and latest research on SNN's?

@mariafernandadavila8332 Год назад

Amazing explanation. Saved me a lot of time!! Thank you!

@linminhtoo 3 года назад

yannic, thanks for the video. the audio is a little soft even at max volume (unless I'm wearing my headphones). is it possible to make it a bit louder?

@YannicKilcher 3 года назад

Thanks a lot! Can't change this one, but I'll pay attention in the future

@abdalazizrashid 3 года назад

Yup, correct most of your videos has a quite low volume

@JurekOK 2 года назад

maybe this is just correct -- it's a regular hifi audiophile loudness level. In here, there is no need for hyper compression filters like in commercials and cheap music videos.

@ShawnFumo 2 года назад

@@JurekOK Maybe, but in practice using my laptop speakers with Windows and RU-vid volumes maxed out, it is still pretty low volume. I had to put subtitles on to make sure I didn't miss things here and there, and this was in a fairly quiet room.

@videowatching9576 2 года назад

Fascinating, incredible video! Really appreciate the walkthrough! Such as the cosine vs linear approach to make sure each step in diffusion is useful - very interesting!

@bg2junge 3 года назад

Any results(images) from generative models should be accompanied by the nearest neighbor(vgg latent, etc) from the training dataset. I am going to train it on mnist🏋

@alexnichol3138 3 года назад

There are nearest neighbors in the beginning of the appendix!

@bg2junge 3 года назад

@@alexnichol3138 i retract my statement.

@48956l 2 года назад

@@bg2junge I demand seppuku

@Galinator9000 Год назад

Your videos are amazing Yannic, keep it up. Much love

@binjianxin7830 Год назад

18:46 I guess it’s very likely to be related to Shannon’s Sampling theorem, reconstructing the data distribution by sampling with the well defined normal distribution. The number of time steps and Beta closely related to the band width of the data distribution.

@kxdy8yg8 2 года назад

Great materials ! Honestly, I really enjoy your content !! Keep it up 👏👏

@JamesAwokeKnowing 3 года назад

This makes me think that instead of super res from lower res image it could be even more effective to store a sparse pixel array (with high res positioning). You could even have another net 'learn' a way of choosing eg which 1000 pivels of a high res image to store (pixels providing most information for reconstruction).

@vidret 3 года назад

yes... yeeeeeesssssssssssss

@Champignon1000 3 года назад

wow thats a really great idea actually!

@stephanebeauregard4083 3 года назад

I´ve only listened to 11 minutes so far but DDPMs remind me a lot of Compressed (or Compressive) Sensing ...

@thirtysixnanoseconds1086 3 года назад

same, the Steve brunton videos :D

@underlecht 3 года назад

Amazing review. Please do more as such, very interesting and thank you for sharing. Subscribed!

@chaerinkong5303 3 года назад

Thanks a lot for this awesome video. I really needed it

@proinn2593 3 года назад

There is this step wise generation in GAN's, not based on steps from noise to image, but based on the size of the image, like in Pro-GAN and MSG-GAN. In these models you have discriminators for different sizes of the image, kind of.

@gustavboye6691 3 года назад

yes that should be the same right?

@cedricvillani8502 2 года назад

Are you saying it’s not the size of your GAN that matters, but how You use it? 😂

@luke.perkin.inventor 3 года назад

I wonder if multiscale noise would work better. It'd fit more with convolutions. Instead of 0% to 100% noise, it could disturb from pixels to the whole image.

@bertobertoberto242 Год назад

I would say that the sqrt(1-B) is used to converge to a N(0,sigma), mainly in it's "mu", othersize adding gaussian noise would just (in expectation) have X0 as mu, instead of 0

@Kerrosene 3 года назад

Reminds of normalising flows..the direction of the flow leads to a normal form through multiple invertible transformations...

@PlancksOcean 3 года назад

It looks like it but the transformation (adding some noise) is stochastic and non invertible

@TechyBen 3 года назад

Detecting signal inside the noise. Wow. It's like a super cheat for cheat sheets. And it works! :D

@nahakuma 3 года назад

By the way, these DDPM models seem very related (a practical simplification?) to the Neural Autoregressive Flows, where each layer is invertible and each layer performs a small distrbution perturbation which vanishes with enough layers

@gooblepls3985 2 года назад

True! I think the important difference (implementational simplification) is that you have no a-priori restrictions on the DNN architecture here, i.e., the layers do not need to be invertible, and the idea is almost agnostic to what exact DNN architecture you use

@easyBob100 2 года назад

Another question. If the network is predicting the noise added to a noisy image, what do you then do with that prediction? Subtract it from the noisy image? Do you then run it back through the network to again, predict noise? When you train this network, do you train it to only predict the small amount of noise added to the image between the forward process steps? Or does it try to predict all the noise added to the image from that point? Or maybe it's more like the forward process? Starting with latent x_T as input to the network, the network gives you an 'image' that it thinks is on the manifold (x_T-1). At this point, it most likely isn't, but, you can move 1/T towards it like we did moving towards the Gaussian noise to get to x_T. Then, repeat....? More examples and less math always helps...

@furrry6056 Год назад

Yes, it's a step by step approach. Thus, when 'destroying' the image, the image at Ti = image at Ti-1 + noise step. You just keep adding / stacking noise, adding a bit more noise (to the previous noise) at each new step. It isn't really 'constant' though. The variance / amount of noise added, depends on the time step and the schedule. A Linear schedule would be constant (adding same amount of noise at each Ti), but if you look at the images (de)generated doing so, you get a quite long tail of images that contain nearly only noise. Therefore a cosine schedule is used, meaning the variance differs per Ti, and also ending up with more information left in the images at the latter time steps. The timestep is actually encoded into the model. Thus, the parameters that are learned to predict the noise 'shift' depending on T. (At least.. In my understanding / words. I'm just a dumb linguist - I don't know any maths either 😅.) Perhaps a better way to explain it, is to imagine that at small Ti, the model can depend on all kinds of visual features (edges, corners, etc.) learned to predict noise. At large T, those features / params get less informative, thus you rely on other features to estimate where the noise is. (Thus its probably not the features that shift depended on T, but their weights.) When generating a new image, you start at Tmax. Thus, pure random noise only. The model first reconstructs to Tmax-1. Removing a little noise.. Then, taking this image, you again remove a bit more noise, etc. It's an iterative process.

@princeofexcess 3 года назад

Great video. Could you possibly up the volume level for the next video. I notice this video is much quieter than other videos I watch.

@nisargshah467 3 года назад

I was waiting for this.. so not read the paper.. thanks yannic

@johnsnow9925 3 года назад

It wouldn't be OpenAI if they actually released their pretrained models

@PaulanerStudios 3 года назад

ClosedAI

@herp_derpingson 3 года назад

@@PaulanerStudios BURN

@ShawnFumo 2 года назад

Well it's a bit of a moot point now that Stable Diffusion has released theirs. Maybe it isn't matching DALL-E 2 in all areas yet, but is coming pretty close, especially the 1.5 model (already on DreamStudio, though not available for download quite yet).

@CristianGarcia 3 года назад

This is me being lazy and not looking it up, but if they predict the noise instead of the image, to actually get the image they subtract the predicted noise from the noisy image iteratively until they get a clean image?

@YannicKilcher 3 года назад

Yes, pretty much, except doing this in a probabilistic way where you try to keep track of the distribution of the less and less noisy images.

@arnabdey7019 4 месяца назад

Please make explanation videos on Yang Song's papers too

@G12GilbertProduction 3 года назад

Diffusing noise with a foward sampling is really more entropian in context accumulation of sharing data by the transformer, but visual autoencoders is thinny for this Gaussian / or / Bayes-Gauss mixture, without a one transformer for a layer. EDIT: I thought is only the prescriptive sense of this upper statement, not evenmore.

@jakubsvehla9698 2 года назад

awesome video, thanks!

@nomadow2423 Год назад

16:55 denoising depends on the entire data distribution sizes because adding random noise in one step can be done independent of all previous steps; just add a bit of noise wherever you like. But removing noise (the reverse) has to assume there was noise added in some number of previous steps. Thus, in the example of denoising a small child's drawing, it's not that we're removing ALL the noise. Instead, The dependence problem arises in simply taking a single step towards a denoised picture. Can anyone clarify/confirm?

@brandomiranda6703 3 года назад

What is the main take away?

@YannicKilcher 3 года назад

make data into noise, learn to revert that process

@herp_derpingson 3 года назад

Train a denoiser but dont add or remove all the noise in one step.

@romagluskin5133 Год назад

50:24 "Distribution shmistribution" 🤩

@samernoureddine 3 года назад

Lightning

@johongo 3 года назад

This paper is really well-written.

@natanielruiz818 2 года назад

Amazing video.

@JTMoustache 3 года назад

44:14: p(a|b,c) = p(a,c|b) / p(c|b) = p(a|b) * p(c|b,a) / p(c|b) = Z * p(a|b) * p(c|a,b) and if c independant of b given a = Z * p(a|b) * p(c|a) But Z = p(c|b) So given that c independant of b given a, p(a|b,c) = p(a|b) * p(c|a) / p(c|b) Here a = xt, b = xt+1, c=y, Z= 1 / p(y|xt+1) .. Then they probably consider y independent of xt+1 given xt. Problem is, if they consider y indep of xt+1 given xt, they should probably consider y indepedent of xt given xt+1 which would basically say p(xt|xt+1,y) = p(xt|xt+1). But I guess it is the whole point to say that actually no, xt contains more information about y than xt+1 so it y is not independant of xt given a more noisy version of xt (xt+1).

@nahakuma 3 года назад

I think it is more natural to do your derivation with a=x_t, b=y, c=x_{t+1}. In this way, a fitting probabilistic graph model would be y -> x_{t} - > x_{t+1}. So, the class label y clearly determines the distribution of your image at any step, but given the current image x_{t} you already have a well defined noise process that tell you how x_{t+1} will be obtained from x_{t} and the label then becomes irrelevant.

@herp_derpingson 3 года назад

The audio is a bit quiet in this video. . 0:00 I didnt realize any of these were generated. Totally fooled my brain's discriminator. . 29:00 How can the noise be lesser than the accumulated noise upto that point? Are we taking into account that some noise added later might undo the previously added noise? . 50:00 I am not sure how to take the learnings to GANs from diffusion models. The only thing I can think of is pre-training the discriminator with real image and noised real image, but that sounds so obvious I am sure 100s of papers have already done that. . All in all I would love to see more papers which make the neural networks output weird things like probability distributions instead of simple images or word tokens.

@easyBob100 2 года назад

Can someone explain the noising process with some pseudocode? Is the noise constantly added(based on t) or blended (based on percent of T)? And of course, does it make a difference and why? EDIT: Nevermind. I always figure it out after asking. :) (I generate some noise, and either blend or lerp towards it, as they are the same)

@daniilchesakov6010 3 года назад

Hi! Amazing video, thank you a lot! But I'm a bit confused with one detail and have a stupid question. As soon as we train our model to predict epsilon, using x_t and t, and also we have a formula x_t = \sqrt{\bar{\alpha}_t } * x_0 + \sqrt{1 - \bar{\alpha}_t }*eps We can get that x_0 = (x_t - \sqrt{1 - \bar{\alpha}_t }*eps) / \sqrt{\bar{\alpha}_t } And here we know alphas coz they are constants, also we know x_t (just some noise) and we know eps as it is the output of our model -- why can't we calculate the answer in just one step? Would be very grateful for answer!

@idenemmy 2 года назад

I have the same question. My hypothesis is that such x0 would be very bad. Have you found the answer to this question?

@zephyrsails5871 3 года назад

Thank you Yannic for the video. QQ: why would we adding Gaussian noise for image requires multivariate Gaussian instead of just 1d Gaussian? Is the extra dimension used for different color channel?

@PlancksOcean 3 года назад

1dimension per pixel 🙂

@mohamedrashad7845 3 года назад

What software and hardware you use to make this video (drawing tables, adobe reader, others) ?

@JTchen-sq6gs 3 года назад

It seems you used a tool to concatenate two paper PDFs togather? It is cool, would you mind telling me which tool?

@erniechu3254 3 года назад

If you're on Mac, there's a native script for that. /System/Library/Automator/Combine PDF Pages.action/Contents/Resources/join.py. Or you can just use Preview.app lol

@mikegro3138 2 года назад

Hi, i watched the video, but this is not a topic i am familiar with. Could anyone pleas describe in a few sentences how this works. Especially how disco diffusion works. Where does it gets the graphical Elements for the images, how does it connect keywords from the prompt with the artists, the style etc. It seems i can use every Keyword i want, but if there is a database, it should be limited. Is it trained somehow to learn what the different styles look like? What if i pick an uncommon keyword? So much questions to understand this incredible Software. Thanks

@НикитаДробышев-ж4т 3 года назад

Hi! Please do something with your mic, because the video is so silent

@soumyanasipuri 2 года назад

Can anyone tell me what do we mean by x0 ~ q(x0)? In terms of pictures, what is x0 and what is the data distribution? Thank you.

@lllcinematography 2 года назад

your audio recording volume is too low. i have to increase my volume like 4x compared to other videos. thanks for the content.

@nahakuma 3 года назад

I wonder why they state that the undefined norm ||.|| of the covariance tends to 0. Doesn't it tend to whatever is the norm of a uniform covariance matrix?

@herp_derpingson 3 года назад

Isnt the norm of uniform cov matrix, with mean=0, std=1, zero?

@nahakuma 3 года назад

@@herp_derpingson As far as I know, the norm of a matrix A is typically defined as the maximum norm of the vector x^TAx, with x^Tx = 1. In the case of a normal distribution you would have x^TAx=1 for any x and so the norm of the covariance would be 1. Am I wrong?

@herp_derpingson 3 года назад

@@nahakuma Nah, I am a bit out of touch with math. You are probably right.

@Farhad6th 3 года назад

The voice has a problem, it is very low. Please in the next videos fix that. Great video. Thank you.

@cedricvillani8502 2 года назад

Turn volume up?

@CristianGarcia 3 года назад

Can you use this technique to erase adversarial attacks?

@herp_derpingson 3 года назад

Thats an interesting idea. Although I think we will have to train the network specifically on adversarial noise. Might not though. Not sure, but good idea regardless.

@jg9193 3 года назад

You'd have to be careful, because this technique relies on neural networks that can potentially be attacked

@ProfessionalTycoons 3 года назад

super dope

@piotr780 Год назад

but this random image at the end does not contain any information !

@bertchristiaens6355 3 года назад

If you add noise from a standard normal distribution thousands of times, isn't the average noise (expected value) added close to zero, resulting in the same image?

@samernoureddine 3 года назад

Even if they were using standard Gaussians (they aren't), the sum of just two standard Gaussians X and Y is not a standard Gaussian (the variances add up)

@trevoryap7558 3 года назад

But the variance will increase so significantly that it will be just noise. (Assuming that the noise are all independent)

@bertchristiaens6355 3 года назад

@@samernoureddine Thank you! I assumed that it was equivalent of sampling 1000 times (for example) from the same distribution N(0,var). Since these samples approximate the distribution of N(0,var) the mean of these values were 0 I thought. But I should rather see it as a sample from N(0, var+var+..+var), right? (since we add up the samples)

@samernoureddine 3 года назад

@@bertchristiaens6355 that would be right if they just wanted the noise distribution at some time t (and if the mean were zero: it isn't). But they want the noise distribution to evolve with time, and so the total noise at time t+1 is not independent from the total noise at time t

@cerebralm 3 года назад

It's like a random walk, the more random choices you make, the further you get from where you started (but unpredictably so)

@GuanlinLi-l8j 2 года назад

how about explain the code of this paper

@peterthegreat7125 Год назад

Still confused about the math theory

@이상윤-n7d 3 года назад

22:31 Can someone explain how eq 12 acquired?

@hudewei7166 3 года назад

It is approximated by the product of two Gaussian distribution q(x_t|x_{t-1}) and q(x_{t-1}|x_0). If the chain rule is applied on eq.(12), then you can get q(x_{t-1}|x_t,x_0)=q(x_t|x_{t-1},x_0)q(x_{t-1}|x_0)/q(x_t|x_0). They also approximate q(x_t|x_{t-1},x_0)=q(x_t|x_{t-1}). Then if the normalization term is ignored, you will get the expression of (10) and (11).

@shynie4986 Год назад

what' the purpose of the covariance matrix? or covariance and why is it important to us?

@donfeto7636 2 года назад

paper after updates become so complex to read with math

@oleksandrskurzhanskyi2233 2 года назад

And now these models are used in DALL·E 2

@akashchadha6388 3 года назад

Schmiduber enters chat.

@XX-vu5jo 3 года назад

The problem with these solutions are their computing cost. I think they should focus more on that instead and they rely too much on data.

@amansinghal5908 11 месяцев назад

why even do it when you'd do it in such a hand wavy manner?

@fast_harmonic_psychedelic 3 года назад

why dont they just use clip as a classifier. does nobody know about this? lol

@twobob 2 года назад

Not too long

@XX-vu5jo 3 года назад

You can’t explain the equations. 🙄

@cedricvillani8502 2 года назад

The better you are detecting bullshit, the better you are at creating bullshit😂 none of my work would ever be public facing until I was sure I could always identify it and manipulate it and I’m sure that’s true for any company or skilled researcher. ❤😢

@PeterIsza 3 года назад

Video starts at 4:28.

@herp_derpingson 3 года назад

Video ends at 54:33

@DanFrederiksen 3 года назад

This seemed much too long. For instance you don't need to labor the notion of denoising for minutes. Noise reduction should be in people's vocabulary at this level. I'd suggest going directly to what diffusion models are and try to prepare succinct explanations instead of just going for an hour.

@nahakuma 3 года назад

Or you could simply skip the parts you already understand ;)

@frankd1156 3 года назад

This free knowledge...as try to criticize nicely or move on to another resource

@banknote501 3 года назад

Ok, if it is so easy, just do a video yourself. We need videos about AI topics for viewers of all skill levels.

@mgostIH 3 года назад

This seems like fair criticism, I don't see why they are being hostile with you

@DanFrederiksen 3 года назад

@@mgostIH I understand that some feel defensive, but it wasn't meant as an attack but empowering observation. communication is vastly more potent the more concise and clear it is.