No video :(

Evidence Lower Bound (ELBO) - CLEARLY EXPLAINED!

Подписаться 9 тыс.

Просмотров 27 тыс.

50% 1

This tutorial explains what ELBO is and shows its derivation step by step.
#variationalinference
#kldivergence
#bayesianstatistics

Опубликовано:

29 авг 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 124

@AndreiMargeloiu 3 года назад

Cristal clear explanation, the world needs more people like you!

@KapilSachdeva 3 года назад

🙏🙏

@TheProblembaer2 6 месяцев назад

Again, thank you. This is incredible well explained, the small steps and the explanation behind, pure gold.

@KapilSachdeva 6 месяцев назад

🙏

@genericperson8238 2 года назад

Absolutely beautiful. The explanation is so insanely well thought out and clear.

@KapilSachdeva 2 года назад

🙏

@sonny1552 9 месяцев назад

Best explanation ever! I found this video for my understanding of VAE at first, but I recently found that this is also directly related to diffusion models. Thanks for making this video.

@KapilSachdeva 9 месяцев назад

🙏

@9speedbird 9 месяцев назад

That was great, been going through paper after paper, all I needed was this! Thanks!

@KapilSachdeva 9 месяцев назад

🙏

@T_rex-te3us 10 месяцев назад

Insane explanation Mr. Sachdeva! Thank you so much - I wish you all the best in life

@KapilSachdeva 10 месяцев назад

🙏

@thatipelli1 3 года назад

Thanks, your tutorial cleared my doubts!!

@danmathewsrobin5991 3 года назад

Fantastic tutorial!! Hoping to see more similar content. Thank you

@KapilSachdeva 3 года назад

🙏

@FredocasS 7 месяцев назад

Thank you so much for this explanation :) Very clear and well explained. I wish you all the best

@KapilSachdeva 7 месяцев назад

🙏

@bevandenizclgn9282 5 месяцев назад

Best explanation I have found so far, thank u!

@vi5hnupradeep 2 года назад

Thankyou so much sir ! I'm glad that I found your video 💯

@KapilSachdeva 2 года назад

🙏

@AruneshKumarSinghPro Год назад

This one is masterpiece. Can you please put one video on Hierarchical Variational AutoEncoders when you have time. Looking forward to it.

@KapilSachdeva Год назад

🙏

@schrodingerac 3 года назад

excellent presentation and explanation Thank you very much sir

@KapilSachdeva 3 года назад

🙏

@kappa12385 2 года назад

Kadak sikhaya bhau. Majha aa gaya.

@KapilSachdeva 2 года назад

🙏

@brookestephenson4354 3 года назад

Very clear explanation! Thank you very much!

@KapilSachdeva 3 года назад

Thanks Brooke. Happy that you found it helpful!

@Aruuuq 3 года назад

Amazing tutorial! Keep up the good work.

@KapilSachdeva 3 года назад

🙏

@mahayat 3 года назад

best and clear explanation!

@KapilSachdeva 3 года назад

🙏

@sahhaf1234 10 месяцев назад

Good explanation. I can follow the algebra easily. The problem is this: what is known and what is not known in this formulation? In other words, @0:26, I think we try to find the posterior. But, do we know the prior? Do we know the likelihood? Or, is it that we do not know them but can sample them?

@KapilSachdeva 10 месяцев назад

Good questions and you have mostly answered them yourself. Prior is what you assume. Likelihood function you need to know (or model). But the most difficult will be computing the normalizing constant. Most of the time computationally intractable

@HelloWorlds__JTS 11 месяцев назад

Great explanations! I do have one correction to suggest: At (6:41) you say D_KL is always non-negative; but this can only be true if q is chosen to bound p from above over enough of their overlap (... for the given example, i.e. reverse-KL).

@KapilSachdeva 11 месяцев назад

🙏 Correct

@HelloWorlds__JTS 7 месяцев назад

@@KapilSachdeva I was wrong to make my earlier suggestion, because p and q are probabilities. I can give details if anyone requests it, but it's trivial to see using total variation distance or Jensen's inequality.

@chethankr3598 9 месяцев назад

This is an awesome explaination. Thank you.

@KapilSachdeva 9 месяцев назад

🙏

@chadsamuelson1808 2 года назад

Amazingly clear explanation!

@KapilSachdeva 2 года назад

🙏

@satadrudas3675 8 месяцев назад

Explaied very well. Thanks

@KapilSachdeva 8 месяцев назад

🙏

@abhinav9058 2 года назад

Subscribed sir awesome tutorial Learning variantional auto encoder 😃

@KapilSachdeva 2 года назад

🙏

@lihuil3115 2 года назад

best explanation ever!

@KapilSachdeva 2 года назад

🙏

@ziangshi182 Год назад

Fantastic Explanation!

@KapilSachdeva Год назад

🙏

@alexfrangos2402 Год назад

Amazing explanation, thank you so much!

@KapilSachdeva Год назад

🙏

@alfcnz 3 года назад

Thanks! 😍😍😍

@KapilSachdeva 3 года назад

🙏

@UdemmyUdemmy Год назад

U are a legend!

@KapilSachdeva Год назад

🙏

@amaramouri9137 3 года назад

Very good explanation.

@KapilSachdeva 3 года назад

🙏

@ajwadakil6892 Год назад

Great Explanation. Can you tell me which books / articles that I may refer to for further and deeper reading regarding variational inferences, bayesian statistics and concepts related to in depth probability?

@KapilSachdeva Год назад

For Bayesian Statistics, I would recommend reading: Statistical Rethinking by Richard Mclearth [See this page for more information - xcelab.net/rm/] A good overview is this paper (Variational Inference: A Review for Statisticians by David M. Blei et al) arxiv.org/abs/1601.00670 For Basic/Foundational Variational Inference, PRML is a good source www.microsoft.com/en-us/research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf There are many books and lecture notes on Probability theory. Pick any one.

@easter.bunny.6 2 месяца назад

Thanks for the lecture sir! I have a question at 4:54, how did you expand that E[log_p_theta(x)] into Integral(q(z|x)log_p_theta(x)dz)? Thanks!

@kadrimufti4295 3 месяца назад

At the 4:45 mark, how did you expand the third term Expectation into its integral form in that way? How is it an "expectation with respect to z" when there is no z but only x?

@peterhall6656 Год назад

Top drawer explanation.

@KapilSachdeva Год назад

🙏

@wolfgangpaier6208 Год назад

Hi, I really appreciate your video tutorial because it’s super helpful and easy to understand. I only have one question left. At 10:27 you replaced the conditional distribution q(z|x) by q(z). Is this also true for Variational Auto-Encoders? Because for VAEs, if I understand right, q(z) is approximated by a neural network that predicts z from x. So I would expect that it’s a conditional distribution where z depends on x.

@KapilSachdeva Год назад

In the case of VAE it will always be conditional distribution. Your understanding is correct 🙏

@wolfgangpaier6208 Год назад

@@KapilSachdeva ok. Thanks a lot for the fast response 🙏

@mmattb Год назад

Sorry to bother you again Kapil - is the integral at 5:05 supposed to have d(z|x) instead of dz? If not, I'm certainly confused haha.

@KapilSachdeva Год назад

No bother at all. Conceptually you can think of it like that but I have not seen/encountered differential portion of the integral using the conditional (the pipe) thing. So just a notation thing here. Your understanding is correct.

@mmattb Год назад

One more question: at 10:11 I can see the right hand term looks like a KL divergence between the distributions, but I'm confused: what would you integrate over if you expanded that? In the KL formulation typically the top and bottom of the fraction are distributions over the same variable. Is it just an intuition to call this KL, or is it literally a KL divergence; if the latter, do you mind writing out the general formula for KL when the top and bottom are distributions over different variables (z|x vs z in this case)?

@KapilSachdeva Год назад

Z|X just means that you got the Z given X but it still remains the (conditional) distribution for Z. Hence your statement about using KL divergence over the same variable is still valid. Hope this makes sense.

@mmattb Год назад

@@KapilSachdeva ohhhh so both of them are defined over the same domain as Z. That makes sense. Thanks again.

@KapilSachdeva Год назад

🙏

@the_akhash Год назад

Thanks for the explanation!

@KapilSachdeva Год назад

🙏

@riaarora3126 Год назад

Wow, clarity supremacy

@KapilSachdeva Год назад

🙏 😀 “clarity supremacy” …. Good luck with your learnings.

@wadewang574 Год назад

At 4:40, how to see the third component is an expectation with respect to z instead of x ?

@KapilSachdeva Год назад

Because the KL divergence (which in turn is the expected value) is between p(z|x) and q(z|x). Now you need to have a good understanding of KL divergence and expected value to understand it.

@MrArtod 2 года назад

Best explanation, thx!

@KapilSachdeva 2 года назад

🙏

@AI_ML_DL_LLM Год назад

it is a great one, would be greater if you could start with a simple numerical example

@KapilSachdeva Год назад

Interesting. Will think about it. 🙏

@mammamiachemale Год назад

I love you, great!!!

@KapilSachdeva Год назад

😄🙏

@anshumansinha5874 Год назад

So, we have to maximise the ELBO (@9:28), right? As that would make it go closer to the log likelihood of the original data. 1. Will that mean we should find parameter 'phi' which increase the reconstruction error (as it is the first term)? 2. And find 'phi' such that the second term gets minimised? Which would mean q_phi(z|x) should be as close as possible from the prior p(z) ? But don't we need to minimise the reconstruction error while not going far from the assumed prior p(z). How to get these inferences from the derived equation @9:28

@KapilSachdeva Год назад

We minimize the “negative” ELBO

@YT-yt-yt-3 Месяц назад

@@KapilSachdeva The terminologies and signs surrounding KL divergence and ELBO are what make them complex else it's simple concept. Is it really an ' reconstruction error'? I mean, is it the likelihood of observing data given z that needs to be maximized? Why is it called error?

@user-or7ji5hv8y 3 года назад

Just one question, at ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-IXsA5Rpp25w.html, when you expanded log p(x), how did you know to use q(z | x) instead of simply q(x)? Thank you.

@KapilSachdeva 3 года назад

We are after approximating the posterior p(z|x). We do this approximation using q, a distribution we know how to sample from and whose parameters we intend to find using optimization procedure. So the distribution q would be different from p but would still be about (or for) z|x. In other words, it is an "assumed" distribution for "z|x". The symbol/notation "E_q" .... (sorry can't write latex/typeset in the comments 😟) means that it is an expectation where the probability distribution is "q". Whatever is in the subscript of symbol E implies the probability distribution. Since in this entire tutorial q is a distribution of z given x ( i.e. z|x); the notations E_q and E_q(z|x) are same .....i.e. q and q(z|x) are same. This is why when it expanded it was q(z|x) and not q(x) Watch my video on Importance Sampling (starting portion at least where I clarify the Expectation notation & symbols). Here is the link to the video - ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-ivBtpzHcvpg.html

@ericzhang4486 3 года назад

@@KapilSachdeva does that mean: the expectation of log p(x) don't depend on distribution q, since at the end E_q[ log p(x)] becomes to log p(x)?

@KapilSachdeva 3 года назад

@@ericzhang4486 since log p(x) does not have any 'z' in it, log p(x) will be treated as constant when your sampling distribution when computing expectation is q(z) (or even q(z|x)). This is why the equation gets simplified by taking this constant out of the integral. Let me if know this helps you understand it.

@ericzhang4486 3 года назад

@@KapilSachdeva it makes perfectly sense. Thank you so much!

@ericzhang4486 3 года назад

I come to your video from the equation 1 in DALL-E paper (arxiv.org/pdf/2102.12092.pdf). If it's possible, could you give me a little enlightenment on how elbo is derived in that case? Feel free to leave, if you don't have time. Thank you!

@anshumansinha5874 Год назад

Why is the first term reconstruction error? I mean we are getting back x from latent variable z; but reconstruction should it not be x-x' like initial x and final x from (x|z) ? Also, how to read that expression? Eq[log(p(x|z))] = \Int (q(x)*log(p(x|z)*dx) ; i.e we want to average out the function of random variable x with the weight parameter q(x); what does that mean in the sense of VAE?

@KapilSachdeva Год назад

> Why is the first term reconstruction error? One way to see the error in reconstruction is x - x' i.e. the difference or square of the difference. This is what you are familiar with. Another way to see it in terms of "likelihood". That type of objective function is called maximum likelihood estimation. Read on MLE to see what it is about if you are not familiar with it. In other words, what is have is another objective/loss function that you will maximize/minimize. That said, you can indeed replace the E[log p(x|z)] with the MSE. It is done in quite many implementations. In the VAE, tutorial I talk about it as well. > what does that mean in the sense of VAE? For that you will want to the VAE tutorial. In that I explain why we need to do this!. If not clear from that tutorial ask the question in the comments of that vide.

@heshamali5208 2 года назад

in minute 9:200. how it's log p(z|x) / p(z). it was addition. shouldn't be log p(z|x) * p(z)? please correct it to me sir. thanks.

@KapilSachdeva 2 года назад

Hello Hesham, I do not see the expression "log p(z|x)/p(z)" any where in the tutorial. Could you check again the screen which is causing some confusion for you and may you have a typo in the above comment?

@heshamali5208 2 года назад

@@KapilSachdeva thanks for your kind reply sir. I mean in the third line in minute 9:22, we moved from Eq[ log q(z|x)] + Eq[ log p(z)] to --> Eq[log q(z|x) /p(z)] which I don't no why it is division and not multiplication as it was addition before taking a common log.

@KapilSachdeva 2 года назад

@@heshamali5208 Here is how you should see it. I did not show one intermediary step and hence your confusion. Let’s look at only the two last terms in the equation. -E[log q(z|x)] + E[log p(z)] -E[ log q(z|x) - log p(z)] {I have take the expectation out as it common} -E[log q(z|x) / p(z)] Hope this clarifies now.

@heshamali5208 2 года назад

@@KapilSachdeva ok thanks sir. it is clear now

@Pruthvikajaykumar 2 года назад

Thank you so much

@KapilSachdeva 2 года назад

🙏

@Maciek17PL Год назад

what is log p blue theta (x) at 5:40? is it a pdf or a single number?

@KapilSachdeva Год назад

it would be a density but if used for optimization you would get a scalar value for a given batch of samples

@mikhaildoroshenko2169 2 года назад

Can we choose the prior distribution of z in any way we want or do we have to estimate it somehow?

@KapilSachdeva 2 года назад

In Bayesian Statistics, choosing/selecting prior is one of the challenging aspects. The prior distribution can be chosen based on your domain knowledge (when you have small datasets) or estimated from the data itself (when your dataset is large). Method of "estimating" the prior from data is called "Empirical Bayes" (en.wikipedia.org/wiki/Empirical_Bayes_method) There are few modern research papers that try to "learn" prior as an additional step in VAE.

@RAP4EVERMRC96 Год назад

4:33 why is it + ('plus') Expected value of log of p of x as to - ('minus')?

@RAP4EVERMRC96 Год назад

nvmd got it

@KapilSachdeva Год назад

🙏

@yongen5398 2 года назад

haha that " I have cheated you" at 7:36

@KapilSachdeva 2 года назад

😀

@heshamali5208 2 года назад

why when maximizing the first component the second component will be minimized directly?

@KapilSachdeva 2 года назад

let's say fixed_amount = a + b if `a` increases then `b` must decrease in order to respect above equation. ## log_evidence is fixed. It is the total probability after taking into consider all parameters and hidden variables. As the tutorial shows, it consists of two components. If you maximize one component then the other should decrease.

@heshamali5208 2 года назад

@@KapilSachdeva Thanks sir. my last question is how computational I could calculate Q(Z)||P(Z). like how do I know P(Z), while all I can get is latent variable Z which in my understanding it is Q(Z)? so how do I make sure that the predicted distribution of Z is close as possible to the actual distribution of Z? I know now how I could get P(X/Z). my question how do I calculate the regularization term?

@KapilSachdeva 2 года назад

I explain this in the tutorial on variational auto encoder. ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-h9kWaQQloPk.html

@heshamali5208 2 года назад

@@KapilSachdeva Thanks sir for your fast reply.

@anshumansinha5874 Год назад

Why even know the posterior p(z|x) ? I think you can start with that.

@KapilSachdeva Год назад

For that watch the “towards Bayesian regression” series on my channel.

@anshumansinha5874 Год назад

@@KapilSachdeva Oh great that’ll be of a lot help! And great video series!

@UdemmyUdemmy Год назад

hTis one video is worth a million gold particles..

@KapilSachdeva Год назад

🙏

@NadavBenedek 7 месяцев назад

Not clear enough. In the first minute you say 'intractable', but you need to give an example of why this is intractable and why other terms are not. Also, explain why the denominator is intractable while the nomination is not.