This is one of the best intuitive lectures that I have come across recently for VAE, the visualizations were really great and pretty concise with what needs to be said visually. Thanks a lot!!!
This was an intuitive explanation yet grounded in math. Such a delicate balance! Thanks for doing this! Also, @24:48 I agree the bubble-of-bubbles is indeed cute. 😄
Thank you for the excellent class. I am curious as to why there is such a significant difference in the distribution between epoch 5 and epoch 10 in the last Jupyter notebook cell (at 43:40 in the video). Is this variation a result of the TSNE or the VAE training?
Woww, this is probably the best lecture for me to understand VAE to a very good extent. Little more about the derivation of loss function and back propagationwould have been fantastic. Thanks a lot for these lecture series
I'm glad it was helpful. Every other VAE explanation out there is so confusing for me, so I tried to explain it in simpler terms, without neglecting too much.
@@rakshithv5073 that was a question on the exam. I recommend you trying it out yourself. You won't gain much understanding in just seeing it. Moreover, there are two approaches. One involving integrals the other manipulates implicit operators. Let me know if you need more suggestions to get started.
Hi My doubt was specific to the expansion of relative entropy between z and n(0,1) which leads to penalty term. As you suggested I gave it a try but I stuck at a point , can you give me a hint to move forward drive.google.com/file/d/1p8MWL9B60h6nTuRNkGo6ZHTPOpNvWhkp/view?usp=drivesdk
I loved the way how you are using the concepts of Linear Algebra (@19:32) - at the end it's all vectors and transformations :) You are a great mentor! Note that I did not say "coach", because you are equipping each of us with the skills that can solve most problems, not just one :) Huge Fan of your lectures & advice :)
Glad you like it! I generally have to "fight 👊🏻 y'all" at the beginning. But if you trust me, then we can move forward together. 🤗 I'm simply trying to "rotate 🔄" you such that you can view things from my perspective. 😋
@@alfcnz Haha, that's true and you are succeeding at it :) Can I use this model on my own image dataset ? If yes, how ? I really want to see - Unique images being generated from the images around me.. I saw the training data being loaded from torch library.. A way out to use own data ?
@@songsbyharsha of course you can use your images. Check out the ImageFolder data set provided by TorchVision. pytorch.org/docs/stable/torchvision/datasets.html#imagefolder
@@alfcnz This feeling is priceless. I just used my own image dataset and could see stunning results.. Machine Learning is love! I can't thank you enough!!! I have just one doubt, though this might be trivial, the images that I have are RGB ones. While I am reading them into PyTorch, why are they turning into Lemonish and Dark Gray shades ? Am I reading them wrong ? I observed this with MNIST dataset you showed us as well.. Please please let me know :)
Hey, Alfredo awesome lecture! Needed to brush up on my VAE knowledge. Combing the paper with the online resources is the way to go since the original paper is super abstract. It would probably be beneficial to note that we don't have to deal with Gaussians in the general case. The original paper just gave some examples using Gaussian and Bernoulli distributions but mentioned many other possibilities. I'm afraid people will start "overfitting" to that particular form of the KL div when we're dealing with Gaussian priors - although I am strongly for giving a concrete example as you did! Just noting the abstraction (so the opposite approach of how the paper introduced it). Secondly, did you do some simple ablation studies on how does computing logvar instead of directly regressing s.d. impact the performance? (properties of the latent space and the reconstruction loss).
This was a great explanation, however I don’t understand why at 52:00 we don’t want to do the reparameterization trick during testing and only return the mu? I would assume we would want to always sample from the latent distribution for passing it to the decoder? Making the encoder give deterministic output (just the mu) during testing will defeat the purpose of variational auto encoders right?
This is honestly the best explanation of VAEs. SO what I understand is instead of just training one latent distribution z in AEC (which has a mean and variance), we are training two parameters E and V which represent features of the images (digit 1 will have it's own E and V and digit 2 will be different) and using e (epsilon), we are able to interpolate different shapes of a digit. Did I get it right?
I'm glad you liked the video. I *highly* recommend checking out the 2021 version ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-PpcN-F7ovK0.html Back to your question, we don't have labels… so we don't have a specific distribution per digit. Nevertheless, the network will indeed have a continuous set of distribution per given input. For example, there will be a smooth probability family modelling anything that goes from a 1 to a 7 (since they look _very_ similar, especially if handwritten). The decoder will do its best to try to reproduce the input, while the latent space is packed with internal representations / code of the input data, which lives on a manifold, embedded in ℝⁿ.
thank you for your lecture, may I ask in the reparameterise function, in eps = std.data.new(std.size()).mormal_() why do we need the .data? If we only want a noise of std.size , why would you copy the data from std before overriding it?
The behaviour is not clear and may depend on the version of PyTorch itself. If you access the data, then you'll know for sure you'll not mess with the gradients. Anyhow, I think now the recommended way is to pass std.type() to torch.rand's dtype. This didn't exist when I created the notebook.
You have given great lecture. Thanks for that. I have a question,If my dataset is numerical features(like housing price dataset) than how should I construct loss and network?
Assume that my batch size is 256 and latent space dimension is 2, the size of learnt mu is [256,2]. Why is the model learning 256x2 different mu when all my training examples belong to the same distribution. I understand that it learns 2 different mu because my latent space dimension is 2 and each mu here corresponds to one of the learnt features, but why 256?
Hi Alfredo, Amazing lecture. One question I had for image retrieval task should I go for feature space extracted from vgg or latent space trained from variational auto encoders.
@@alfcnz Thanks for suggesting. One last question linear interpolation between two images in the input (image) domain does not work as it produces a cross-dissolve effect between the intensities of the two images. Can you suggest further on interpolation which doesn't have this issue.
Alfredo, this is amazing! I've been trying to learn about VAE on my own and this is by far the best lecture and implementation I have found. One question, in your definition of the loss function, you have the Beta term like in the paper from deepmind. In the code is your Beta defined as .5 when computing KLD in loss_function()? Thank you for the epic tutorial!!
Yeah, I've already replied there. «Hi u/yupyupbrain, thanks for asking. Indeed we should have had `return BCE + β * KLD`, and added `β=1` in the function definition. Feel free to send a pull request on GitHub with such correction. Yes, cross-validation is how hyperparameters are selected.» Next time, plz include the time stamp, so that I can easily see what you're talking about here.
Great, just like the previous lecture in the series. I must say that your visualization skills are just superb! The way you elucidated reconstruction loss is really elegant and intuitive at the same time. I have a minor question - Was it necessary to use nn.Sigmoid() at the end? I mean you could have used BCEWithLogitsLoss and KLD, which uses log-sum-exp trick and is said to be more numerically stable thab using sigmoid followed by BCE + KLD? I understand that images pixels were in (0,1) but that could be rescaled right?
I'm glad you've enjoyed the lecture. I believe you're correct. Let's continue this discussion on an issue / PR on GitHub. As I keep saying, I'm always up to improving the material I've been making! Thanks for your feedback! ❤️
Since I am new to the field of AI domain I have studied that sigmoid goes through the vanishing gradient problem. The same applies to tanh too, then I have a doubt, what other activation function do you prefer to use apart from the above 2 that are mentioned. Do you think that we can use relu and rescale the pixel values between 0 and 1, or are there any other activation functions that I haven't heard of.
These lectures are heaven-sent for those who are looking to upskill! I had a question about the encoder; Can the encoded vectors (mu+epsilon*std_dev) for images be used to measure image similarity using cosine distance? Thanks a ton! Cheers!
Hi Alfredo, the bubbles intuition was mind blowing. Thanks a ton for making this so intuitive! One question, can we take any other distributional assumption for the hidden space, like exponential etc.? If yes, how will it affect the hidden representations and the final decoder output?
The only other option I'm aware of is using a categorical one, where the 1 embedding (out of K) closest to the encoded input is sent to the decoder. I'm talking about the VQ (vector quantised) VAE.
Superb lecture, btw I am curious that could we capture the relationship of Z and output of decoder? For example higher Z value will produce image with darker colours. Besides, I think we can put a classifier on top of output of encoder of autoencoder but not sure we can do the same for variational encoder.
Actually after training, we can attach a classifier instead of decoder to find out if the test image fits into the latent neighbourhood distribution or not and hence classify normal or abnormal.
Great video! Thanks a lot! I have a question about the visualization of z as bubbles: As I understand z follows a normal distribution with mean E(z) and variance V(z), so the spread of those bubbles is actually "infinite" right (there is no clear separation between "bubbles")? In that case, wouldn't they overlap anyway ?
In stats.stackexchange.com/a/60699/228453 there is -d part, but formula atcold.github.io/pytorch-Deep-Learning/en/week08/08-3/ uses -1. Is there any tips how to get from general formula for multivariate Gaussian relative entropy into more specific one from your site.
We can use sums because d dimensions of latent space are independent because z covariance matrix is diagonal. In the image you draw just z1 single dimension of latent space, but in fact there should be z1, z2, ... and zd dimensions of the latent space. Right?
@@alfcnz you were able to suppress 80 pages paper arxiv.org/pdf/1906.02691.pdf into few nice diagrams and 30 minutes of theoretical talk that has sense. Possible the only few things that you maybe have in some other videos, is to connect paper terminology with your work. That would be awesome! Say what would be the prior distribution, what is posterior distribution, encoder distribution, decoder distribution, what is that ELBO? I guess ϵ is the prior, but I am not sure. w are NN parameters (one part for decoder and other for encoder). In the paper they used φ and θ. q_φ(z|x) for the encoder and p_θ(x|z) for the decoder. Hard to swallow paper. 🙂
@@Учитељица haha, I'm glad you found it useful. Regarding the notation, it's irrelevant. I'm not explaining a given paper. I'm explaining its contributions. Once you understand how it works, the way it's written is irrelevant. I would also highly recommend watching the 2021 edition of this lecture (coming out in two weeks). It's rather different. The version 2022 will have the prior / posterior nomenclature and derivation. As you've already noticed, I did not use anything from the variational inference field. Just explained how the model works.
I do not understand why on 28:04 N(0, Id) has 0 as E()? According to picture you said Lkl is enforcing z to be in small bublle with different centers, but center must be in some point not in 0. So your KL loss must construct something like in the picture ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-bbOFvxbMIV0.html but actually no bubbles
The 0-vector is what we choose as prior, namely a normal distribution. The bubbles' centre are attracted to the origin, but the bubbles won't overlap, since the reconstruction error would increase otherwise. Furthermore, the "bubble" is simply a d-dimensional Gaussian, which parameters are given to you by the encoder. You didn't share any picture in your comment, so I'm not sure what you're talking about.