1. 6
  1. 1

    What are some alternatives to softmax?

    The post doesn’t really explain why softmax works. It basically says “Do this operation, and it will result in probabilities.” That’s fine, but I was hoping to understand why this is true. What is it about e^x and this particular set of operations that results in probabilities?

    According to wikipedia (https://en.wikipedia.org/wiki/Softmax_function) you can also choose a different base other than e. But I’m confused as to why/when you’d want to do this.

    1. 4

      It turns these values into something that looks like a probability. That is a difference. Mapping x->exp(x) limits value ranges to 0 < exp(x).

      With positive values the normalization makes sure that the sum of all softmax-transformed values adds up to 1 (a property of probabilities) and is between 0 and 1 (another property of probabilities).

      It’s not exactly a probability, consider a data set consisting of [0, 0, ln 2]. Then softmax leads to values [0.25, 0.25, 0.5].

      From a maximum likelihood perspective we would however day that p(x=0)=2/3 and for p(LN 2) = 1/3.

      So what softmax rather generates is bounded values between 0 and 1 which are correlated to the inputs of softmax