You know that feeling when you find a criminally overlooked channel and you’re about to get in on the ground level of something that’s gonna blow up? This is you now
That is quite nice of you - thank you! I hope you're right, but for now, I'm working on my patience. It can take quite a while to get noticed on RU-vid. I'm trying to keep my expectations realistic.
I can't get over how much fun you make learning about stats, ML, and information theory---not to mention that you teach it with skill like Feynman's and a style that is all your own.
That's quite a compliment - Feynman is a total inspiration for many, myself included. His energy about the topics makes you *want* to learn about them.
Just in the recent 3b1b video about normal distribution, there is a mention about normal distribution maximizes entropy. Then immediately I saw it here on your video displaying the normal distribution as the one that maximizes the entropy while constraining average and variance which are the only two parameters of the normal distribution. That is very nice.
It blew my mind that those famous distributions come naturally as the ones that give maximum entropy when we set the domain and constraints in a general way. Now I kind of know why they are special.
Definitely very cool. In many cases there are also other fascinating characterizations. For example: assume continuous distribution on positive support with the no-memory-property-->solve differential equation-->find that it MUST be the exponential
woah this is such a good explanation. I just randomly discovered this channel but I'm sure it's bound to blow up. Just a bit of critique: Idk if this is only meant for college students but if you want to get a slightly broader audience you could focus a bit more on giving intuition for the concepts.
Thank you very much! Yea the level of technical background I expect of the audience is an important question. I’m partial to keeping it technical. I think it’s OK not to appeal to everyone. My audience will just be technical and small :)
@@Mutual_Information ++ with Alberto le Fisch : "Prof. Feynman... "If you cannot explain something in simple terms, you don't understand it." And I am sure you understand so please explain us! :)
His channel already puts lots of efforts into doing exactly that, the final bit of the video explaining about the FUNDAMENTAL difference from discrete 2 continuous breakage of "label invariance" just blowed my mind, seriously some of the best intuition I've ever received about something.
@@zorro77777 well, he also said something like "If I could summarise my work in a sentence it wouldn't be worth a Nobel price". Which means that although something can be simplified, this doesn't mean it won't take a long time to explain. It's not that easy and he is not your employee you know. Also in that quote he meant to be able to explain to physics undergrads, which you would expect them to have some knowledge already.
This channel is so amazing. I had a fuzzy understanding of a lot of these concepts, but this clarifies it. For example, my intuition suggested that for a given mean and variance, the maximum entropy estimate would be a beta-binomial distribution, but I wasn't really able to prove it to myself. 7:00
Thank you! Always means a lot. And yea, now that I've covered the Fisher Information, I can hit that one soon. Appreciate the suggestion - it's on the list!
@@Mutual_Information I do believe (and I hope this is the case) that we are going through a boom in science channels right now. It seems that the youtube algorithm is identifying that sub-audience that loves this content and recommending these types of channels to them sooner and sooner. So I really hope it happens to you very soon!
I think it would be up to the parametrization to care about area or side length depending on the problem case in that example. I'd like my tools to do their own distilled thing in small, predictable, usable pieces.
I don't know anything about set shaping theory, so.. maybe! Whatever it is, I think it could only *extend* information theory. I believe the core of information theory is very much settled.
Btw i dont know if Mutual Information is a good channel name. The term is pretty stacked and I can't just say "do you know Mutual Information" like I can say "do you know 3blue1brown"... It also makes it harder to find your channel, because if someone looks up mutual information on youtube you wont show up at the top. Maybe thats your strategy though, to have people find your channel when they search for Mutual Information on youtube ;) Anyways, im sure you have thought about this, but thats my take.
I fear you may be correct. I've heard a few people say they tried to find my channel but couldn't when they searched. But part of me thinks I've gone too far. There's actually quite a bit of work I'd have to do to make a title change, and if the cost is my channel is a little bit more hidden, I think that's OK. Weirdly, I'm kinda enjoying the small channel stage (being a bit presumption that I'll eventually not be in this stage :) ). It's less pressure, gives me time to really nail the feedback and it's easier to have 2-way communication with the viewers. Don't get me wrong, I'd like to grow the channel, but I'm OK with leaving some growth hacks on the table. That said, I'm not totally set on "Mutual Information." I'd like to feel it out a bit more. As always, appreciate the feedback!
Your videos are great! I am curious about the connection between maximum entropy and Bayesian Inference. They seem related. Lets think about Bayesian inference in a variational way where you minimize a KL divergence between approximate and true posterior KL(q(z)||p(z|x)) where z is e.g. the vector of all our unknown digits and x the digit mean. Maximizing this KL divergence is equivalent to maximizing the sum of (1) H(q(z)), an entropy maximization objective (2) CE(q(z),p(z)), a cross-entropy term with the prior distribution p(z). This term is constant for a uniform prior (3) E_q(z) log p(x|z), a likelihood term that produces constraints, in our digits case p(x|z) is deterministically 1 if the condition x=mean(z) is fulfilled and 0 otherwise. All z with log p(x|z) = log 0 = -infty must be given a probability of 0 by q(z) to avoid the KL objective reaching negative infinity. On the other hand, once this constraint is fulfilled, all remaining choices of q attain E_q(z) log p(x|z) = E_q(z) log 1 = 0, therefore the entropy terms gets to decide among them.
further, if we choose the digits to be i.i.d. ~ q(z_1) (z_1 being the first digit), as the number of digits N goes to infinity, the empirical mean, mean(z), will converge almost surely to the mean of q(z_1), so in the limit, we can put the constraint on the mean of q(z_1) instead of the empirical mean, as done by the maximum entropy principle. Digits being i.i.d. should be an unproblematic restriction due to symmetry (and due to entropy maximization).
Wow yes you dived right into a big topic. Variational inference is a big way we get around some of the intractability naive bayesian stats can yield. You seem to know that well - thanks for all the details
Great video! Thanks a lot. A little feedback: The example you give in 12:00-13:00 is a bit hard to follow without visualization. The blackboard and simulations you use are very helpful in general. It would be great if you do not leave that section still and only talk. Even some bullet points would be nice.
Thanks - useful, specific feedback is in short supply, so this is very much appreciated. I count yours as a "keep things motivated and visual"-type of feedback, which is something I'm actively working on (but not always great about). Anyway, it's a work in progress and hopefully you'll see the differences in upcoming videos. Thanks again!
The invariance bit is something that I really didn't explore well. It's something I only realized while I was researching the video. The way I would think about it is.. the motivating argument for max entropy doesn't apply over the continuous domain b/c you can't enumerate "all possibly sequences of random samples".. so if you use the max entropy approach in the continuous domain anyway.. you are doing something which imports a hidden assumption you don't realize. Something like.. minimize the KL-divergence from some reference distribution.. idk.. something weird. As you can tell, I think it's OK to not understand the invariance bit :)
Imagine you have a raw set. You want to build a histogram. You dont know bin range, bin start and end locations and number of bins. Can ideal histogram be built by using max entropy law?
I've heard about this and I've actually seen it used as an effective feature-engineering preprocessing step in a serious production model. Unfortunately, I looked and couldn't find the exact method and I forget the details. But there seems to be a good amount of material on the internet for "entropy based discretization." I'd give those a look
One thing is bothering me... The justification of using entropy seems circular. In the first case, where no information is added, we are implicitly assuming that the distribution of the digits is discrete uniform. Because we are choosing the distribution based on the number of possible sequences corresponding a distribution. This is only valid if any sequence is just as likely. But this is only true if we assume the distribution is uniform. Things are a bit more interesting when we add the moment conditions. I guess what we are doing, is conditioning on distributions satisfying the moment conditions, and choosing among these the distribution with the most possible sequences. We seem to be using a uniform prior (distribution for the data), in essence. My question is: why would this be a good idea? What actually is the justification of using entropy? Which right now in my mind is: why should we be using the prior assumption that the distribution is uniform when we want to choose a 'most likely' distribution? Don't feel obliged to respond to my rambling. Just wanted to write it down. Thank you for your video!
lol doesn't sound like rambling to me. I see you're point about it being circular. But I don't think that's the case in fact. Let's say it wasn't uniformly distributed.. Maybe odd numbers are more likely. Now make a table of all sequences and their respective probabilities. Still, you'll find that sequences with uniform counts have a relative advantage.. it may not be as strong due to whatever the actual distribution is.. but the effect of "there are more sequences with nearly even counts" is always there.. even if the distribution of each digit isn't uniform. It's that effect we learn on.. and in the absence of assuming anything about the digit distribution.. that leads you to the uniform distribution. In other words, the uniform distribution is a consequence, not an assumption.
Great video! Maybe it’s just me but the explanation of this equation is a bit misleading 3:10. Specifically the part where you say to transform the counts into probabilities. For a moment I thought you meant that nd/N is the probability of having a string with nd copies of d and I was very confused. What is actually saying is that if we have N numbers of 1 digit in which there are n0 copies of 0, n1 copies of 1, and this for all digits (this means that n0+n1+…+n9 = N.) The probability of getting the digit d is nd/N. I got confused because the main problem was about strings of size N and these probabilities just consider a single string N with nd copies of each digit d.
Yes, there's a change of perspective on the problem. I tried to communicate that with the table, but I see how it's still confusing. You seem to have gotten through it with just a good think on the matter
@@Mutual_Information it's alright. Having the derivation of the expression helped me a lot. I appreciate you take part of your time to add details like these in your videos 🙂...
12:45, I think taking the log would be useful in the squares scenario since then the squaring would become a linear transformation rather than non-linear
Thank you and I appreciate the feedback. I’ve already shot 2 more vids so I won’t be rolling into those, but I will for the one I’m writing right now. Also working on avoiding the uninteresting details that don’t add to the big picture.
I am not aware of that connection. When researching it, I just discovered that these ideas we're intended for the continuous domain. People extended it into the continuous domain, but then certain properties were lost.
Thank you! Looking into it, I don't believe it's an error. I'm not claiming here that these are within the exponential family. I'm saying these are max entropy distribution under certain constraints, which is a different set. You can see the cauchy distribution listed here : en.wikipedia.org/wiki/Maximum_entropy_probability_distribution But thank you for keeping an eye out for errors. They are inevitable, but extra eyes are my best chance at a good defense against them.
@@Mutual_Information Thanks for your quick reply! And thanks for the link - I can see that indeed there's a constraint (albeit a very strange one) for which Cauchy is the max entropy distribution. I guess then, I was confused by the examples in the table + those that you then list -- all distributions were exponential family and Cauchy was the odd one out. Also, please correct me if I'm wrong, but I think if you do moment matching for the mean (i.e. you look at all possible distributions that realise a mean parameter \mu), then the max entropy distribution is an exponential family one. And the table was doing exactly that. Now, we can't do moment matching for the Cauchy distribution as none of its moments are defined. So that was the second reason for my confusion.
Thanks makes a lot of sense. To be honest, I don’t understand the max entropy exponential family connection all that well. There seems to be these bizarre distributes that are max entropy but aren’t exponential fam. I’m not sure why they’re there, so I join you in your confusion!
You're not wrong. It's the same reason. If you optimize NLL and you leave the function which maps from W'x (coefficients-x-features) open, and then maximize entropy.. then the function you'd get is the softmax! So logistic regression comes from max-ing entropy.
What do you mean that the entropy only depends on the variable's probabilities and not its values? You also said that the variance does depend on its values, but I don't see why the variance would while the entropy would not. You say that you can define the entropy as a measure of a bar graph, but so can the variance.
entropy = - sum p(x) log p(x).. notice only p(x) appears in the equation - you never see just "x" in that expression. For (discrete) variance.. it's sum of p(x)(x-E[x])^2.. notice x does appear on it's own. When I say the bar graph, I'm only referring to the vertical heights of the bars (which are the p(x)'s).. you can use just those set of numbers to compute the entropy. For the variance, you'd need to know something in addition to those probabilities (the values those probabilities correspond to).
@@Mutual_Information Ah, I see! I don't know what I was thinking. For some reason, I thought probability when you said value. It makes total sense now. Great video by the way! Really insightful!
Hm, I tried to use this method to find the maximum entropy distribution when you know all first three moments of the distribution, that is, both the mean, the variance and the skewness, but I end up with an expression that either leads to a distribution completely without skewness or one with a PDF that goes to infinity, either as x approaches infinity or as x approaches minus infinity (I have an x^3 term in the exponent), and which therefore can't be normalized. Is that a case which this method doesn't work for? Is there some other way to find the maximum entropy distribution when you know all first three moments in that case?
Okay, I think I found the answer to my question. According to Wikipedia, this method works for the continuous case if the support is a closed subset S of the real numbers (which I guess means that S has a minimum and a maximum value?), and it doesn't mention the case where S = R. But presume that S is the interval [-a, +a], where a is very large, then this method works. And I realized that the solution you get when you use this method is a distribution that is very similar to a normal distribution, except for a tiny increase in density just by one the of the two endpoints to make the distribution skewed, which is not really the type of distribution I imagined. I believe the reason this doesn't work if S = R is because there is no maximum entropy distribution that satisfies those constraints, in the sense that if you have a distribution that does satisfy those constraints, you can always find another distribution that also satisfies the constraints, but with a lower distribution. Similarly, if you let S = [-a, a] again, you can use this method to find a solution, but if you let a → ∞, the limit of the solution you will get by using this method is a normal distribution. But as you let a → ∞, the kurtosis of the solution will also approach infinity, which may be undesired. So if you want to prevent that, you may also constrain the kurtosis, maybe by putting an upper limit to it or by choosing it to take on a specific value. When you do this, all of a sudden the method works again for S = R.
Honestly your videos get me excited for a topic like nothing else. Reminder to myself not to watch your videos if I need to do anything else that day... Jokes aside, awesome video again!
Thank you very much! I'm glad you like it and I'm happy to hear there are others like you who get excited about these topics like I do. I'll keep the content coming!
6:13 In this case, the Gods have nothing to do with 'e' showing up there haha Actually, we could reformulate this result to any other proper basis b and the lambdas would just get shrank by the factor ln(b).
Yea they're terrible. I took some shit advice of "learn to talk with your hands" and it produced some cringe. It makes me want to reshoot everything, but it's hard to justify how long that would take. So, here we are.
@@Mutual_Information 😂😂 don't worry about it man, the videos are great. I think there's no reason for any hand gestures since the visuals are focused on the animations.
@@Mutual_Information Just watched 'How to Learn Probability Distributions' and in that video I didn't find the hand gestures distracting at all since they were mostly related with the ideas you were conveying. The issue in this video is that they were a bit mechanical and repetitive. This is a minor detail though I love your videos so far!