Stable/Latent Diffusion - High-Resolution Image Synthesis with Latent Diffusion Models Explained

Gabriel Mongaras

Подписаться 8 тыс.

Просмотров 10 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Paper found here: arxiv.org/abs/2112.10752

Опубликовано:

28 апр 2023

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 30

@jahovajenkins5947 9 дней назад

Huge thanks dude, this explanation has saved me much agony.

@gonzalorubio1154 9 месяцев назад

I love how they combined VAE, GANs (adversarial loss) and diffusion models

@thecheekychinaman6713 11 месяцев назад

This is the only detailed, non-hype based walkthrough of how SD works, thanks. Especially for explaining the math.

@aesadugur 9 месяцев назад

Extremely underrated video. Thanks so much for all the explanations!

@acasualviewer5861 8 месяцев назад

This was a really good video. It really helped me understand this diffusion concept that I didn't know about. Your videos are underrated, but I have no doubt they will gain traction over time.

@claudeclaude9991 3 месяца назад

Congratulations on the video. I've always had doubts about whether Stable Diffusion is the same thing as Latent Diffusion. Now with your explanation I understand that they are the same thing.

@lzh00 10 месяцев назад

Thank you so much Gabriel! I wanted to understand the intuition behind Latent Diffusion, and watching your video saved me tons of time from actually reading through the paper.

@hiepphamduc8789 Год назад

These videos are criminally underrated! This one, ViT, Attention and LoRA have helped me so much with my learning! As a Compsci student majoring in AI, going from learning lectures and reading books, to reading, understanding and implementing the actual papers is a big leap, and you've made that leap a lot more simpler and digestible. Thank you so much, please never stop this series!

@gabrielmongaras Год назад

Thank you! Glad you're finding my videos helpful. Hopping from lectures to papers is difficult as the style is very different and I'm happy to see my videos are helping with that. I'm also learning as I create each video, so not planning on stopping this series any time soon!

@AI_For_Scientists 4 месяца назад

Really enjoyed every minute. Got a new subscriber

@Anonymou377 Год назад

Thank you for the video, it is the first video I found about training AE in LDMS, and I think this part is the hard part to understand the whole model, thanks for your explanation, it is very easy to understand. One thing I would like to add is that the AE in the paper is based on VQ-VAE, so L_rec uses perceptual loss and L_adv is a patch-based adversarial objective. Anyway, I hope you will continue to work on this series!

@denistimonin 3 месяца назад

thanks for explanation, man! amazing video!

@ahamuffin4747 2 месяца назад

Hello, thanks for the explanations! just a few words on the greek letters. its "psi" not "phi" here, and the "rho"_theta you mention is actually a "tau"

@alexxiang5899 9 месяцев назад

Fantastic videos! Any plan for the recently published ControlNet?

@hunterli7791 5 месяцев назад

Does the text emb serve as a label in this model? For example, i put a pic of peguin and describe it "peguin". The model learns to match pic and text and reduce loss

@prateekpani9464 8 месяцев назад

Hey, what's the setup you are using to write and see the paper on split-screen?

@gabrielsamberg461 3 месяца назад

Thank you very much for an amazing explanation! one question tho, on minute 22:00 when explaining the Autoencoder loss function, you activate the log function on the output of the discriminator, isn't that a bit problematic ( since log is not defined in the case that the discriminator predicts "fake")?

@gabrielmongaras 3 месяца назад

You're right! When the discriminator predicts "fake" (zero), then this value would be undefined. However, it's very unlikely for the discriminator to pick an extreme of 100% real or 100% fake. In the case of the discriminator predicting 100% fake (maybe due to quantization rounding), resulting in the log of 0, we can just add a small value such as 1e-5 to prevent the value from being undefined.

@gabrielsamberg461 3 месяца назад

Amazing, thank you very much!🙏🏻

@NadavBenedek 7 месяцев назад

Good video, but didn't explain how the cross attention output is actually being used in the UNET

@tushargarg8378 Год назад

How do you know the "epsilon" ? Meaning for the second training step, where you are doing MSE ( noise, predicted noise), how do you know the "noise" before hand? Is it coming through a function "P" ? Also what does it mean to train the diffusion layers? Are diffusion layers also like convolution?

@gabrielmongaras Год назад

Our noisy data, x_t, is computed by a function like (for simplification) x_t = t*x_0 + (1-t)*ε For the ε value, we sample from a Gaussian which is the same size as our image. If our image was 3x256x256, we would sample 3*256*256 values from a Gaussian. This is the tensor that is both added to out data to create x_t and is the tensor being compared in the MSE loss. For each training step and for each image, we resample the epsilon value so the model learns to model the Gaussian prior distribution. By train diffusion layers, we just train a U-net. It's very arbitrary how you construct the U-net, but generally, this is made of resnet blocks with cross-attention conditioning. Hope this helps!

@user-mw7co5zl1s Год назад

Thank you, that makes sense.

@namidasora2357 Год назад

I love your videos, been following you since the girlfriend video, can you please explain RWKV models

@gabrielmongaras Год назад

Glad you're enjoying the videos! I haven't heard of RWKV models before, but it looks pretty interesting. I'm going to a look at them and I think I'll be able to create a video on them!

@gabrielmongaras Год назад

Was working on reading the code and formulating a video. I didn't realize they published a paper recently! Yannic made a video on it here which is much better than what I could do! Check it out here: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-x8pW19wKfXQ.html

@user-ox6xd7kv7y 7 месяцев назад

Really nice Video. I have a doubt about the latent loss, 1) Are they trying to fake the encoder-decoder part and the real input real?? 2) if the above assumption is true then they need to put plus sign for 2nd term in the loss function (i.e output of discriminator with encoder-decoder output as input)

@anoubhav 10 месяцев назад

why is it better to predict the noise instead of the denoised image directly in the UNet? Thanks for your videos.

@gabrielmongaras 10 месяцев назад

You can actually have the model predict either the data or noise and it mathematically doesn't matter. Since the data x_t is an interpolation at any time t, and we know the interpolating function, we can equivalently get the predicted data if the model predicts the noise. Same thing with the model predicting the data instead, we can still extract the predicted noise. In practice, the model has to learn a different function, so a model that predicts noise may somehow do better than a model that predicts data. All the papers I've seen have the model predict the noise rather than the data, so I'm guessing that's the better way to go, or maybe it doesn't really matter. Would be something to experiment with if implementing your own DM.