The post doesn’t really explain why softmax works. It basically says “Do this operation, and it will result in probabilities.” That’s fine, but I was hoping to understand why this is true. What is it about e^x and this particular set of operations that results in probabilities?

It turns these values into something that looks like a probability. That is a difference. Mapping x->exp(x) limits value ranges to 0 < exp(x).

With positive values the normalization makes sure that the sum of all softmax-transformed values adds up to 1 (a property of probabilities) and is between 0 and 1 (another property of probabilities).

It’s not exactly a probability, consider a data set consisting of [0, 0, ln 2]. Then softmax leads to values [0.25, 0.25, 0.5].

From a maximum likelihood perspective we would however day that p(x=0)=2/3 and for p(LN 2) = 1/3.

So what softmax rather generates is bounded values between 0 and 1 which are correlated to the inputs of softmax

What are some alternatives to softmax?

The post doesn’t really explain why softmax works. It basically says “Do this operation, and it will result in probabilities.” That’s fine, but I was hoping to understand

whythis is true. What is it about e^x and this particular set of operations that results in probabilities?According to wikipedia (https://en.wikipedia.org/wiki/Softmax_function) you can also choose a different base other than

`e`

. But I’m confused as to why/when you’d want to do this.It turns these values into something that looks like a probability. That is a difference. Mapping x->exp(x) limits value ranges to 0 < exp(x).

With positive values the normalization makes sure that the sum of all softmax-transformed values adds up to 1 (a property of probabilities) and is between 0 and 1 (another property of probabilities).

It’s not exactly a probability, consider a data set consisting of [0, 0, ln 2]. Then softmax leads to values [0.25, 0.25, 0.5].

From a maximum likelihood perspective we would however day that p(x=0)=2/3 and for p(LN 2) = 1/3.

So what softmax rather generates is bounded values between 0 and 1 which are correlated to the inputs of softmax