理解Softmax

Last Updated: 2024-06-14 22:49:28 Friday

-- TOC --

什么是Softmax？
Softmax这个名称的由来
对Softmax求导
Softmax计算技巧
理解softmax的输出

什么是Softmax？

Softmax is a mathematical function that is often used in machine learning and deep learning to convert a vector of real numbers into a probability distribution. It takes as input a vector of arbitrary real values and normalizes them into a probability distribution in which the sum of all probabilities is equal to 1.

The softmax function is defined as follows:

$$softmax(z_i) = \cfrac{e^{z_i}}{\sum_{j=1}^{K}{e^{z_j}}}$$

where $z_i$ is the $i^{th}$ element of the input vector, $K$ is the number of elements in the vector.

The softmax function can be interpreted as a way of assigning probabilities to multiple categories based on the input values. For example, if we have a vector of scores for different classes, applying the softmax function to this vector will give us a probability distribution over those classes, where the class with the highest probability is the most likely classification.

Softmax is commonly used in the output layer of neural networks for multiclass classification problems. The output of the neural network is passed through the softmax function to obtain a normalized probability distribution over the possible classes. This allows the model to make predictions about the most likely class for a given input.

Softmax这个名称的由来

The term "softmax" is derived from the function's exponentiation and normalization steps, which together produce a probability distribution that is soft, in the sense that it has more gradual transitions between probabilities than a hard or deterministic distribution. The function was first introduced in the context of statistics by John S. White in 1889, but its use in machine learning and neural networks became popular in the 1990s.

对Softmax求导

下面推导一下softmax对$z$的导数：

$$\begin{align} \frac{\partial a_j}{\partial z_j} &= \frac{e^{z_j} (\sum_k e^{z_k}) - {e^{z_j}}^2}{(\sum_k e^{z_k})^2} \notag \\ &= \frac{e^{z_j}}{\sum_k e^{z_k}} - \left(\frac{e^{z_j}}{\sum_k e^{z_k}}\right)^2 \notag \\ &= a_j - (a_j)^2 \notag \\ &= a_j(1-a_j) \notag \end{align}$$

当 $i≠j$ 时：

$$\begin{align} \frac{\partial a_j}{\partial z_i} &= \frac{- e^{z_j} e^{z_i}}{(\sum_k e^{z_k})^2} \notag \\ &= - a_i a_j \notag \end{align}$$

别特注意：$a_k$ 组成一个概率分布，这就意味着，每一个 $a_i$ 不是独立的变量，某一个变，其它所有都要跟着变！在推导计算的时候，要小心。

Softmax计算技巧

这是我曾经遇到过的一个问题，在计算神经网络输出层的softmax概率分布的时候，遇到overflow。有一个非常棒地解决此overflow的技巧。

softmax的计算公式：

$a_j=\cfrac{e^{z_j}}{\Sigma_i e^{z_i}}$

计算$e^n$，很容易overflow。

>>> np.exp(710)
inf

指数到710，就overflow，无法再计算下去了。

指数计算技巧

我们可以在softmax公式的上下同乘一个常数：

$$\cfrac{e^{z_j}}{\Sigma_i e^{z_i}}=\cfrac{c\cdot e^{z_j}}{c\cdot \Sigma_i e^{z_i}}=\cfrac{e^{z_j+\ln{c}}}{\Sigma_i e^{z_i+\ln{c}}}$$

$c$值可以任意选择，只要不等于0。

让计算e的指数不要太大，一般我们选择向量中最大的那个数的负值，即$\ln{c}=-max(z_i)$，这相当于将向量中的每个分量都向负方向平移最大的值，softmax的计算结果保持不变。

测试如下：

>>> a = np.array((11,12,13,14,15))
>>> b = a - np.max(a)
>>> a
array([11, 12, 13, 14, 15])
>>> b
array([-4, -3, -2, -1,  0])
>>> np.exp(a)/np.sum(np.exp(a))
array([0.01165623, 0.03168492, 0.08612854, 0.23412166, 0.63640865])
>>> np.exp(b)/np.sum(np.exp(b))
array([0.01165623, 0.03168492, 0.08612854, 0.23412166, 0.63640865])

对a和b进行softmax计算，结果完全一样。

坑

上述技巧依然无法规避减去最大数之后，存在一个特别大的负数的情况，此时依然会计算出0，0再参与后面的np.log计算，就会出现-inf。

>>> np.exp(-800)
0.0

这个0如果是出现在softmax的分母则没事，还有其它大于0的数字存在，如果是出现在softmax的分子上...

理解softmax的输出

softmax输出一组概率分布，概率越高，confidence就越高。其实这里理解是不太准确的...

loss function中控制regularization的hyperparameter $\lambda$，这个值决定了“概率分布”的peaky or diffuse。这个值越高，对weights的“惩罚”就越高，模型就也有可能得到更小的weights，进而得到更小的输出（比如linear classifier的计算，就是用weights做dot product），最后就是softmax输出的概率分布越均匀，即more diffuse。反之，越peaky。

>>> a = np.array((1,2,3,4))
>>> np.exp(a)/np.sum(np.exp(a))
array([0.0320586 , 0.08714432, 0.23688282, 0.64391426])
>>> a = a/2
>>> np.exp(a)/np.sum(np.exp(a))
array([0.10153632, 0.1674051 , 0.27600434, 0.45505423])

上面的测试，将(1,2,3,4)减小一半之后计算得到的概率分布，更加的diffuse。

Moreover, in the limit where the weights go towards tiny numbers due to very strong regularization strength $\lambda$, the output probabilities would be near uniform. Hence, the probabilities computed by the Softmax classifier are better thought of as confidences where, similar to the SVM, the ordering of the scores is interpretable, but the absolute numbers (or their differences) technically are not.

softmax得到的概率分布，其绝对值的高低没什么意义，他们内部的ordering才有意义！

本文链接：https://cs.pynote.net/ag/ml/ann/202210161/

-- EOF --

-- MORE --