Awesome stuff. Searched this video because I was trying to figure out why the scores/sum scores approach wouldn't work and you addressed it first thing. Great job.
What a great explanation! Thank you very much. The why do we choose this formula versus this formula explanation is truly makes everything clear. Thank you once again :)
Thank you!!! This is so much clearer and straighter than 2 20-minutes videos on Softmax from "Machine Learning with Python-From Linear Models to Deep Learning" from MIT! To be fair, the latter explains multiple perspectives and is also good in its sense. But you deliver just the most importaint first bit of what is softmax and what are all these terms are about.
please note that the outputs of Softmax are NOT probabilities but are interpreted as probabilities. This is an important distinction! The same goes for the Sigmoid function. Thanks
well after applying sigmoid you get only one probability p (the other one you can calculate as 1-p) so actually you only need one number in case of sigmoid
What does dP_i/dS_j = -P_i * P_j mean and how did you get it? I understand dP_i/dS_i because S_i is a single variable. But dP_i/DS_j is a whole set of variables (Sum(S_j) = S_1 + S_2 ... S_n) rather than a single one. How are you taking a derivative of that?
I am new to Data Sceince. However, why would a model output 100, 101 and 102 as three outputs unless the input had similarity to all three classes. Even in our daily lives, we would ignore 2 dollar variance on $100 think but complain if something which was originally free but now costs 2 dollars. Question is, why would we give up the usual practice and use some fancy transformation function here ?
Hey, thank you for a great video! I have a question: in your example, you said that probabilities between 0,1 and 2 should not be different from 100, 101, and 102. But in the real world, the scale which is used to assess students makes difference and affects probabilities. The difference between 101 and 102 is actually smaller than between 1 and 2, because in the first case the scale is probably much smaller, so the difference between scores is more significant. So wouldn't a model need to predict different probabilities depending on the assessment scale?
My point of view is that the softmax scenario is different from sigmoid scenario. In the sigmoid case, we need to capture the changes in relative scale because subtle changes around the 1/2 prob. point result in significant prob. changes(turns the whole thing around, drop out or not); whereas in the softmax case, there are more outputs and our goal is to select the very case which is most likely to happen, so we are talking about an absolute amount rather than a relative amount(final judge). I guess that's why ritvik said" change in constant shouldn't change our model'.
Oh.. softmax is for multiple classes and sigmoid is for two classes. I get that your i here is the class. In the post below though, is their i observations and k the classes? stats.stackexchange.com/questions/233658/softmax-vs-sigmoid-function-in-logistic-classifier
4:00 I thank you have express it in a wrong way you wanted to say that we need to go into depth and not just focus on the application that is the façade which here's deriving formula
are you crazy. the moment he did that, I knew it would be fun listening to him. He was focused. Like he said, theory is relevant only in context of practicality.