Nice explanation. Although few words to highlight some key points regarding the topic. - We take the exponential function while calculating the probablities f(s)i because any returned function score could be negative and the exponential helps to avoid that. While dividing by the sum of exponentials is done in order to make the probabilities sum equal to 1 (normalization). - The logarithmic function is used to calculate cross entropy for given proabilities because products are expensive and log helps to convert product to sum function. Again negative sign is used before sum because log for values between 0 and 1 results negative values.
I've only one question, if the network wrongly classifies a 0 1 0 (one hot) to (0 0.56 0.44) so how cross entropy loss is penalizing for 0 represented as 0.44? is it neglecting?