The Principle of Maximum Entropy

Mutual Information

Подписаться 75 тыс.

Просмотров 27 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Опубликовано:

14 окт 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 110

@pedrobianchi1929 3 года назад

These principles explained here appear everywhere: thermodynamics, machine learning, information theory. Very fundamental

@equanimity26 2 года назад

An amazing video. Proving once again why the internet is a blessing to humanity.

@maniam5460 3 года назад

You know that feeling when you find a criminally overlooked channel and you’re about to get in on the ground level of something that’s gonna blow up? This is you now

@Mutual_Information 3 года назад

That is quite nice of you - thank you! I hope you're right, but for now, I'm working on my patience. It can take quite a while to get noticed on RU-vid. I'm trying to keep my expectations realistic.

@arongil Год назад

I can't get over how much fun you make learning about stats, ML, and information theory---not to mention that you teach it with skill like Feynman's and a style that is all your own.

@Mutual_Information Год назад

That's quite a compliment - Feynman is a total inspiration for many, myself included. His energy about the topics makes you *want* to learn about them.

@kaishang6406 Год назад

Just in the recent 3b1b video about normal distribution, there is a mention about normal distribution maximizes entropy. Then immediately I saw it here on your video displaying the normal distribution as the one that maximizes the entropy while constraining average and variance which are the only two parameters of the normal distribution. That is very nice.

@NowInAus Год назад

Really stimulating. Your last example looked to be heading towards information in the variables. Got me hooked

@HarrysKavan 2 года назад

I didn't expect much and was disappointed. What a great Video. I wish you lots more followers!

@Mutual_Information 2 года назад

Thank you so much! More to come :)

@ilyboc 3 года назад

It blew my mind that those famous distributions come naturally as the ones that give maximum entropy when we set the domain and constraints in a general way. Now I kind of know why they are special.

@MP-if2kf 2 года назад

Definitely very cool. In many cases there are also other fascinating characterizations. For example: assume continuous distribution on positive support with the no-memory-property-->solve differential equation-->find that it MUST be the exponential

@dobb2106 3 года назад

I’m glad I clicked on your comment, this channel is very well presented and I look forward to your future content.

@alec-lewiswang5213 Год назад

I found this video very helpful! Thanks for making it! The animated visuals especially are great :)

@garvitprashar3671 3 года назад

I have a feeling you will become famous someday because and video quality is really good...

@nandanshettigar873 Год назад

Great video, love the level of complexity and fundamentals. I feel this just gave me some fresh inpso on my research

@albertoderfisch1580 3 года назад

woah this is such a good explanation. I just randomly discovered this channel but I'm sure it's bound to blow up. Just a bit of critique: Idk if this is only meant for college students but if you want to get a slightly broader audience you could focus a bit more on giving intuition for the concepts.

@Mutual_Information 3 года назад

Thank you very much! Yea the level of technical background I expect of the audience is an important question. I’m partial to keeping it technical. I think it’s OK not to appeal to everyone. My audience will just be technical and small :)

@zorro77777 2 года назад

@@Mutual_Information ++ with Alberto le Fisch : "Prof. Feynman... "If you cannot explain something in simple terms, you don't understand it." And I am sure you understand so please explain us! :)

@diegofcm6201 Год назад

His channel already puts lots of efforts into doing exactly that, the final bit of the video explaining about the FUNDAMENTAL difference from discrete 2 continuous breakage of "label invariance" just blowed my mind, seriously some of the best intuition I've ever received about something.

@mino99m14 Год назад

@@zorro77777 well, he also said something like "If I could summarise my work in a sentence it wouldn't be worth a Nobel price". Which means that although something can be simplified, this doesn't mean it won't take a long time to explain. It's not that easy and he is not your employee you know. Also in that quote he meant to be able to explain to physics undergrads, which you would expect them to have some knowledge already.

@murilopalomosebilla2999 3 года назад

The quality of your content is amazing!

@Mutual_Information 3 года назад

Thanks a lot! I try 😅

@YiqianWu-dh8nr 5 месяцев назад

大概捋了一下整体思路，用很多浅显易懂的描述代替了很多复杂的数学公式，让我至少明白了他的原理。感谢！

@ckq Год назад

This channel is so amazing. I had a fuzzy understanding of a lot of these concepts, but this clarifies it. For example, my intuition suggested that for a given mean and variance, the maximum entropy estimate would be a beta-binomial distribution, but I wasn't really able to prove it to myself. 7:00

@Mutual_Information Год назад

Glad this is helping!

@outtaspacetime 2 года назад

This one saved my life!

@mCoding 3 года назад

Another fantastic video! I would love to improve my knowledge about the Jeffreys prior for a parameter space.

@Mutual_Information 3 года назад

Thank you! Always means a lot. And yea, now that I've covered the Fisher Information, I can hit that one soon. Appreciate the suggestion - it's on the list!

@derickd6150 Год назад

@@Mutual_Information I do believe (and I hope this is the case) that we are going through a boom in science channels right now. It seems that the youtube algorithm is identifying that sub-audience that loves this content and recommending these types of channels to them sooner and sooner. So I really hope it happens to you very soon!

@nerdsofgotham 3 года назад

Been 20 years since I last did information theory. This seems closely related to the asymptotic equipartition principle. Excellent video.

@Mutual_Information 3 года назад

Oh I’m sure they’re related in some mysterious and deep way I don’t yet understand, just because that’s a big topic in source [1] for this topic :)

@tylernardone3788 3 года назад

Great video! Great channel! Im working my way throught that Jaynes book [2] and absolutely love it.

@Mutual_Information 3 года назад

That is a heroic move! He has some wild insights on probability theory. Guy was a complete beast.

@sirelegant2002 9 месяцев назад

Incredible lecture, thank you so much

@mattiascardecchia799 2 года назад

Brilliant explanation!

@antoinestevan5310 3 года назад

I do not think I would produce any interesting analysis today. I simply... appreciated it a lot! :-)

@boar6615 Год назад

Thank you so much! the graphs were especially helpful, and the concise language helped me finally understand this concept better

@Mutual_Information Год назад

Exactly what I'm trying to do

@arnold-pdev 2 года назад

Great video!

@sivanschwartz3813 3 года назад

Thank a lot for this great and informative video!! one of the best explanations I have come across

@Mutual_Information 3 года назад

Thanks! Glad you enjoyed it, more to come

@praveenfuntoo 2 года назад

I able to apply this equation in my work , thanks to make it plausible .

@kylebowles9820 7 месяцев назад

I think it would be up to the parametrization to care about area or side length depending on the problem case in that example. I'd like my tools to do their own distilled thing in small, predictable, usable pieces.

@martinkunev9911 День назад

Very insightful video. It feels somehow incomplete as it left me want to know more.

@NoNTr1v1aL 2 года назад

When will you make a video on Mutual Information to honor your channel's name?

@Mutual_Information 2 года назад

Haha it's coming! But I got a few things in queue ahead of it :)

@alixpetit2285 2 года назад

Nice video, do you think that set shaping theory can change the approach to information theory?

@Mutual_Information 2 года назад

I don't know anything about set shaping theory, so.. maybe! Whatever it is, I think it could only *extend* information theory. I believe the core of information theory is very much settled.

@informationtheoryvideodata2126 2 года назад

Set shaping theory is a new theory, but the results are incredible, it can really change information theory.

@kenzilamberto1981 3 года назад

your video is easy to understand, I like it

@TuemmlerTanne11 3 года назад

Btw i dont know if Mutual Information is a good channel name. The term is pretty stacked and I can't just say "do you know Mutual Information" like I can say "do you know 3blue1brown"... It also makes it harder to find your channel, because if someone looks up mutual information on youtube you wont show up at the top. Maybe thats your strategy though, to have people find your channel when they search for Mutual Information on youtube ;) Anyways, im sure you have thought about this, but thats my take.

@Mutual_Information 3 года назад

I fear you may be correct. I've heard a few people say they tried to find my channel but couldn't when they searched. But part of me thinks I've gone too far. There's actually quite a bit of work I'd have to do to make a title change, and if the cost is my channel is a little bit more hidden, I think that's OK. Weirdly, I'm kinda enjoying the small channel stage (being a bit presumption that I'll eventually not be in this stage :) ). It's less pressure, gives me time to really nail the feedback and it's easier to have 2-way communication with the viewers. Don't get me wrong, I'd like to grow the channel, but I'm OK with leaving some growth hacks on the table. That said, I'm not totally set on "Mutual Information." I'd like to feel it out a bit more. As always, appreciate the feedback!

@dermitdembrot3091 3 года назад

Your videos are great! I am curious about the connection between maximum entropy and Bayesian Inference. They seem related. Lets think about Bayesian inference in a variational way where you minimize a KL divergence between approximate and true posterior KL(q(z)||p(z|x)) where z is e.g. the vector of all our unknown digits and x the digit mean. Maximizing this KL divergence is equivalent to maximizing the sum of (1) H(q(z)), an entropy maximization objective (2) CE(q(z),p(z)), a cross-entropy term with the prior distribution p(z). This term is constant for a uniform prior (3) E_q(z) log p(x|z), a likelihood term that produces constraints, in our digits case p(x|z) is deterministically 1 if the condition x=mean(z) is fulfilled and 0 otherwise. All z with log p(x|z) = log 0 = -infty must be given a probability of 0 by q(z) to avoid the KL objective reaching negative infinity. On the other hand, once this constraint is fulfilled, all remaining choices of q attain E_q(z) log p(x|z) = E_q(z) log 1 = 0, therefore the entropy terms gets to decide among them.

@dermitdembrot3091 3 года назад

further, if we choose the digits to be i.i.d. ~ q(z_1) (z_1 being the first digit), as the number of digits N goes to infinity, the empirical mean, mean(z), will converge almost surely to the mean of q(z_1), so in the limit, we can put the constraint on the mean of q(z_1) instead of the empirical mean, as done by the maximum entropy principle. Digits being i.i.d. should be an unproblematic restriction due to symmetry (and due to entropy maximization).

@Mutual_Information 3 года назад

Wow yes you dived right into a big topic. Variational inference is a big way we get around some of the intractability naive bayesian stats can yield. You seem to know that well - thanks for all the details

@kabolat 2 года назад

Great video! Thanks a lot. A little feedback: The example you give in 12:00-13:00 is a bit hard to follow without visualization. The blackboard and simulations you use are very helpful in general. It would be great if you do not leave that section still and only talk. Even some bullet points would be nice.

@Mutual_Information 2 года назад

Thanks - useful, specific feedback is in short supply, so this is very much appreciated. I count yours as a "keep things motivated and visual"-type of feedback, which is something I'm actively working on (but not always great about). Anyway, it's a work in progress and hopefully you'll see the differences in upcoming videos. Thanks again!

@regrefree 3 года назад

Very informative. I had to stop and go back many times because you are speaking and explaining things very fast :-p

@Mutual_Information 3 года назад

I’ve gotten this feedback a few times now. I’ll be working on it for the next vids, though i still talk fast on the vids I’ve already shot.

@MP-if2kf 2 года назад

Cool video! You lost me at the lambda's though... They are chosen to meet the equations... what do they solve exactly?

@MP-if2kf 2 года назад

Are they the Lagrange multipliers?

@MP-if2kf 2 года назад

I guess I get it, the lambda is just chosen to get the maximal entropy distribution given the moment condition...

@MP-if2kf 2 года назад

Amazing video, I will have to revisit it some times though

@MP-if2kf 2 года назад

only didnt understand the invariance bit...

@Mutual_Information 2 года назад

The invariance bit is something that I really didn't explore well. It's something I only realized while I was researching the video. The way I would think about it is.. the motivating argument for max entropy doesn't apply over the continuous domain b/c you can't enumerate "all possibly sequences of random samples".. so if you use the max entropy approach in the continuous domain anyway.. you are doing something which imports a hidden assumption you don't realize. Something like.. minimize the KL-divergence from some reference distribution.. idk.. something weird. As you can tell, I think it's OK to not understand the invariance bit :)

@abdjahdoiahdoai 2 года назад

do you plan to make a video on expectation maximization? loll funny you put a information theory textbook on the desk for this video

@Mutual_Information 2 года назад

Glad you noticed :) Yes EM is on the list! I have a few things in front of it but it's definitely coming.

@abdjahdoiahdoai 2 года назад

@@Mutual_Information nice

@sukursukur3617 2 года назад

Imagine you have a raw set. You want to build a histogram. You dont know bin range, bin start and end locations and number of bins. Can ideal histogram be built by using max entropy law?

@Mutual_Information 2 года назад

I've heard about this and I've actually seen it used as an effective feature-engineering preprocessing step in a serious production model. Unfortunately, I looked and couldn't find the exact method and I forget the details. But there seems to be a good amount of material on the internet for "entropy based discretization." I'd give those a look

@MP-if2kf 2 года назад

One thing is bothering me... The justification of using entropy seems circular. In the first case, where no information is added, we are implicitly assuming that the distribution of the digits is discrete uniform. Because we are choosing the distribution based on the number of possible sequences corresponding a distribution. This is only valid if any sequence is just as likely. But this is only true if we assume the distribution is uniform. Things are a bit more interesting when we add the moment conditions. I guess what we are doing, is conditioning on distributions satisfying the moment conditions, and choosing among these the distribution with the most possible sequences. We seem to be using a uniform prior (distribution for the data), in essence. My question is: why would this be a good idea? What actually is the justification of using entropy? Which right now in my mind is: why should we be using the prior assumption that the distribution is uniform when we want to choose a 'most likely' distribution? Don't feel obliged to respond to my rambling. Just wanted to write it down. Thank you for your video!

@Mutual_Information 2 года назад

lol doesn't sound like rambling to me. I see you're point about it being circular. But I don't think that's the case in fact. Let's say it wasn't uniformly distributed.. Maybe odd numbers are more likely. Now make a table of all sequences and their respective probabilities. Still, you'll find that sequences with uniform counts have a relative advantage.. it may not be as strong due to whatever the actual distribution is.. but the effect of "there are more sequences with nearly even counts" is always there.. even if the distribution of each digit isn't uniform. It's that effect we learn on.. and in the absence of assuming anything about the digit distribution.. that leads you to the uniform distribution. In other words, the uniform distribution is a consequence, not an assumption.

@MP-if2kf 2 года назад

@@Mutual_Information I have to think about it a bit more. In any case, thank you for your careful reply! Really appreciate it.

@janerikbellingrath820 Год назад

nice

@mino99m14 Год назад

Great video! Maybe it’s just me but the explanation of this equation is a bit misleading 3:10. Specifically the part where you say to transform the counts into probabilities. For a moment I thought you meant that nd/N is the probability of having a string with nd copies of d and I was very confused. What is actually saying is that if we have N numbers of 1 digit in which there are n0 copies of 0, n1 copies of 1, and this for all digits (this means that n0+n1+…+n9 = N.) The probability of getting the digit d is nd/N. I got confused because the main problem was about strings of size N and these probabilities just consider a single string N with nd copies of each digit d.

@Mutual_Information Год назад

Yes, there's a change of perspective on the problem. I tried to communicate that with the table, but I see how it's still confusing. You seem to have gotten through it with just a good think on the matter

@mino99m14 Год назад

@@Mutual_Information it's alright. Having the derivation of the expression helped me a lot. I appreciate you take part of your time to add details like these in your videos 🙂...

@ckq Год назад

12:45, I think taking the log would be useful in the squares scenario since then the squaring would become a linear transformation rather than non-linear

@johnbicknell8424 2 года назад

Great video. Are you aware of a way to represent the entropy as a single number, not a distribution? Thanks!

@Mutual_Information 2 года назад

Thanks! And to answer your question, the entropy *is* a single number which measures a distribution.

@Septumsempra8818 2 года назад

WOW!

@Gggggggggg1545.7 3 года назад

Another great video. My only comment would be slow down slightly to give more time to digest the words and graphics.

@Mutual_Information 3 года назад

Thank you and I appreciate the feedback. I’ve already shot 2 more vids so I won’t be rolling into those, but I will for the one I’m writing right now. Also working on avoiding the uninteresting details that don’t add to the big picture.

@SuperGanga2010 Год назад

Is the shaky continuous foundation related to the Bertrand paradox?

@Mutual_Information Год назад

I am not aware of that connection. When researching it, I just discovered that these ideas we're intended for the continuous domain. People extended it into the continuous domain, but then certain properties were lost.

@desir.ivanova2625 3 года назад

Nice video! I think there's an error in your list at 10:42 - the Cauchy distribution is not exponential family.

@Mutual_Information 3 года назад

Thank you! Looking into it, I don't believe it's an error. I'm not claiming here that these are within the exponential family. I'm saying these are max entropy distribution under certain constraints, which is a different set. You can see the cauchy distribution listed here : en.wikipedia.org/wiki/Maximum_entropy_probability_distribution But thank you for keeping an eye out for errors. They are inevitable, but extra eyes are my best chance at a good defense against them.

@desir.ivanova2625 3 года назад

@@Mutual_Information Thanks for your quick reply! And thanks for the link - I can see that indeed there's a constraint (albeit a very strange one) for which Cauchy is the max entropy distribution. I guess then, I was confused by the examples in the table + those that you then list -- all distributions were exponential family and Cauchy was the odd one out. Also, please correct me if I'm wrong, but I think if you do moment matching for the mean (i.e. you look at all possible distributions that realise a mean parameter \mu), then the max entropy distribution is an exponential family one. And the table was doing exactly that. Now, we can't do moment matching for the Cauchy distribution as none of its moments are defined. So that was the second reason for my confusion.

@Mutual_Information 3 года назад

Thanks makes a lot of sense. To be honest, I don’t understand the max entropy exponential family connection all that well. There seems to be these bizarre distributes that are max entropy but aren’t exponential fam. I’m not sure why they’re there, so I join you in your confusion!

@manueltiburtini6528 Год назад

Then, Why is the logistic regression also called Maximum Entropy? Am I wrong?

@Mutual_Information Год назад

You're not wrong. It's the same reason. If you optimize NLL and you leave the function which maps from W'x (coefficients-x-features) open, and then maximize entropy.. then the function you'd get is the softmax! So logistic regression comes from max-ing entropy.

@kristoferkrus 2 года назад

What do you mean that the entropy only depends on the variable's probabilities and not its values? You also said that the variance does depend on its values, but I don't see why the variance would while the entropy would not. You say that you can define the entropy as a measure of a bar graph, but so can the variance.

@Mutual_Information 2 года назад

entropy = - sum p(x) log p(x).. notice only p(x) appears in the equation - you never see just "x" in that expression. For (discrete) variance.. it's sum of p(x)(x-E[x])^2.. notice x does appear on it's own. When I say the bar graph, I'm only referring to the vertical heights of the bars (which are the p(x)'s).. you can use just those set of numbers to compute the entropy. For the variance, you'd need to know something in addition to those probabilities (the values those probabilities correspond to).

@kristoferkrus 2 года назад

@@Mutual_Information Ah, I see! I don't know what I was thinking. For some reason, I thought probability when you said value. It makes total sense now. Great video by the way! Really insightful!

@kristoferkrus 11 месяцев назад

Hm, I tried to use this method to find the maximum entropy distribution when you know all first three moments of the distribution, that is, both the mean, the variance and the skewness, but I end up with an expression that either leads to a distribution completely without skewness or one with a PDF that goes to infinity, either as x approaches infinity or as x approaches minus infinity (I have an x^3 term in the exponent), and which therefore can't be normalized. Is that a case which this method doesn't work for? Is there some other way to find the maximum entropy distribution when you know all first three moments in that case?

@kristoferkrus 11 месяцев назад

Okay, I think I found the answer to my question. According to Wikipedia, this method works for the continuous case if the support is a closed subset S of the real numbers (which I guess means that S has a minimum and a maximum value?), and it doesn't mention the case where S = R. But presume that S is the interval [-a, +a], where a is very large, then this method works. And I realized that the solution you get when you use this method is a distribution that is very similar to a normal distribution, except for a tiny increase in density just by one the of the two endpoints to make the distribution skewed, which is not really the type of distribution I imagined. I believe the reason this doesn't work if S = R is because there is no maximum entropy distribution that satisfies those constraints, in the sense that if you have a distribution that does satisfy those constraints, you can always find another distribution that also satisfies the constraints, but with a lower distribution. Similarly, if you let S = [-a, a] again, you can use this method to find a solution, but if you let a → ∞, the limit of the solution you will get by using this method is a normal distribution. But as you let a → ∞, the kurtosis of the solution will also approach infinity, which may be undesired. So if you want to prevent that, you may also constrain the kurtosis, maybe by putting an upper limit to it or by choosing it to take on a specific value. When you do this, all of a sudden the method works again for S = R.

@TuemmlerTanne11 3 года назад

Honestly your videos get me excited for a topic like nothing else. Reminder to myself not to watch your videos if I need to do anything else that day... Jokes aside, awesome video again!

@Mutual_Information 3 года назад

Thank you very much! I'm glad you like it and I'm happy to hear there are others like you who get excited about these topics like I do. I'll keep the content coming!

@zeio-nara 2 года назад

It's too hard, too many equations, I didn't understand anything. Can you explain it in simple terms?

@Mutual_Information 2 года назад

I appreciate the honesty! I'd say.. go through the video slowly. The moment you find something.. something specific!.. ask it here and I'll answer :)

@whozz Год назад

6:13 In this case, the Gods have nothing to do with 'e' showing up there haha Actually, we could reformulate this result to any other proper basis b and the lambdas would just get shrank by the factor ln(b).

@SystemScientist Год назад

Supercool

@piero8284 Год назад

The math gods work in mysterious ways 🤣

@bscutajar Год назад

Great video, but one pet peeve is that I found your repetitive hand gestures somewhat distracting.

@Mutual_Information Год назад

Yea they're terrible. I took some shit advice of "learn to talk with your hands" and it produced some cringe. It makes me want to reshoot everything, but it's hard to justify how long that would take. So, here we are.

@bscutajar Год назад

@@Mutual_Information 😂😂 don't worry about it man, the videos are great. I think there's no reason for any hand gestures since the visuals are focused on the animations.

@bscutajar Год назад

@@Mutual_Information Just watched 'How to Learn Probability Distributions' and in that video I didn't find the hand gestures distracting at all since they were mostly related with the ideas you were conveying. The issue in this video is that they were a bit mechanical and repetitive. This is a minor detail though I love your videos so far!

@NoNTr1v1aL 3 года назад

Amazing video!