The Derivative of Softmax(z) Function w.r.t z

What will you learn?

Ask any machine learning expert! They will all have to google the answer to this question:

“What was the derivative of the Softmax function w.r.t (with respect to) its input again?”

The reason behind this forgetfulness is that Softmax(z) is a tricky function, and people tend to forget the process of taking its derivative w.r.t its input, $z$ . We need to know this derivative in order to train an Artificial Neural Network. By the end of this post you will have learned the mechanism and the steps required to compute this tricky derivative!

What is a Softmax Unit?

Let’s consider a simple neural network, down below. So we have D-dimensional input data, and some fully connected connections with weights, and our 1 and only output layer. This output layer has only 3 neurons. These neurons, manipulate their inputs $z_i$ using the Softmax function, $S(z_i)$ , and spit out the result, that is $S(z_1)$ , $S(z_2)$ , and $S(z_3)$ .

Now, let’s remind ourselves as to what the Softmax function really is. In general for an arbitrary vector $Z = [z_1, z_2, ... , z_N]$ of inputs, the Softmax function, S, returns a vector $S(Z)$ , and the $i^{th}$ element of this output vector is computed as follows: $S(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{N} e^{z_j}}$

And let us remember that the sum of all $S(z_i)$ ‘s, for all $i$ ‘s is equal to 1:

$\sum_{i=1}^{N} S(z_i) = 1$

That is the beauty of the Softmax function, as its outputs could be treated as probabilities in a neural network.

NOTE: They are NOT probabilities! But can be treated as a measurement for certainty in a Neural Network.

All of this, is beautifully shown down below:

What Makes the Derivative of Softmax Tricky!

The main confusion with this function is the dependencies between the elements of its input vector $Z$ . So, for example, for computing $S(z_1)$ , you will need $z_2$ , and $z_3$ as well. This is the case, because of the common denominator among all $S(z_i)$ ‘s, that is, $\sum_{j=1}^{N} e^{j}$ . If you look below, you will see these dependencies beautifully shown with colorful arrows!

So, for example, if you needed to compute the derivative of $S(Z)$ with respect to just $z_1$ , since you have used $z_1$ for computing all $S(z_1)$ , $S(z_2)$ , and $S(z_3)$ , you will need to compute the derivative of all $S(z_1)$ , $S(z_2)$ , and $S(z_3)$ w.r.t $z_1$ (NOT just the derivative of $S(z_1)$ w.r.t $z_1$ ).

Below, I have elongated the neurons in our simple neural network, and demonstrated the mathematical operations in each and every one of them. You can see the dependencies by tracing the colored arrows down below:

Computing Each of these Derivatives Separately

Before digging into the math, you need to get comfortable with a few basic mathematical principles of taking derivatives:

Rule#1: The law of independence

$\frac{\partial e^{z_i}}{\partial z_j}=0$

Rule#2: The derivative of Exponentials

$\frac{\partial e^{z_i}}{\partial z_i}=e^{z_i}$

Rule#3: The derivative of fractions $h(x)=\frac{f(x)}{g(x)}$

$\frac{\partial h(x)}{\partial x} = \frac{f'(x)\times g(x) - g'(x) \times f(x)}{g(x)^{2}}$

Rule#4: The derivative of a sum is the sum of derivatives

$\frac{\partial (f(x) + h(x))}{\partial x} = \frac{\partial f(x)}{\partial x} + \frac{\partial h(x)}{\partial x}$

Knowing what we know now, we should be totally fine with the fact that if we wanted to find the derivative of the softmax function w.r.t any $z$ , we would need to consider all of our $z$ ‘s, namely in our small example, all $z_1$ , $z_2$ , and $z_3$ .

Let’s start with $S(z_1)$ . Down below, when computing $\frac{\partial S(z_1)}{\partial z_1}$ we are basically considering the first neuron in our neural network, and take the derivative of its output, $S(z_1)$ , w.r.t its input, $z_1$ . Take a look at the steps down below:

So the first line uses rule number 3 from our derivative rules of fractions. And then in the second line, we can see how $\frac{\partial e^{z_1}}{\partial z_1}=e^{z_1}$ as we are using rule number 2 . We are also using rule number 4, as the derivative of a sum is the sum of derivatives, meaning:

$\frac{\partial (e^{z_1}+e^{z_2}+e^{z_3})}{\partial z_1}=\frac{\partial e^{z_1}}{\partial z_1} + \frac{\partial e^{z_2}}{\partial z_1} + \frac{\partial e^{z_3}}{\partial z_1}$

And we can immediately use rule number 1 of independence, and conclude that in the equation above, only the first term survives and the second and the third term will become 0!

In the end, in yellow, you can see that when computing the derivative of the output of a Softmax neuron, $S(z_1)$ , w.r.t its direct input $z_1$ , all we need to do is to $S(z_1) \times (1 - S(z_1))$ , which is neat and great!

Now, what about $S(z_2)$ ? We need to compute its derivative w.r.t $z_1$ as well, right? See how beautifully this works out, down below:

You can see that unlike the case with $S(z_1)$ , now the final result is: $-S(z_1) \times S(z_2)$ . Can you guess what the result would look like for $S(z_3)$ ? See down below:

Wow! Just like $S(z_2)$ , again the final result for the partial derivative of $S(z_3)$ w.r.t $z_1$ is $-S(z_1) \times S(z_3)$ .

Can you see the emerging pattern yet? 🙂

So, is there a MAIN Derivative Rule?

So, it all boils down to the index of $S(z_i)$ , and $z_j$ ! Meaning if we are computing $\frac{S(z_i)}{\partial z_i}$ the rule is always: $S(z_i) \times (1 - S(z_i))$

However, if we are computing $\frac{S(z_i)}{\partial z_j}$ , where we are taking the derivative of the output of neuron $i$ , that is $S(z_i)$ , w.r.t the input of neuron $j$ , that is $z_j$ . In this case the rule changes to: $-S(z_i) \times S(z_j)$ . Below you can see all of this, beautifully and mathematically demonstrated:

Conclusions

Today, you have learned the basics regarding the famous Softmax function, that is commonly used in Artificial Neural Networks for the task of classification. You have learned about the dependencies between the elements inside the Softmax function and seen how this could make computing the gradient, a little bit tricky. I do hope that this has been helpful.

Until then,

On behalf of MLDawn,

Take care 😉

5 thoughts on “The Derivative of Softmax(z) Function w.r.t z”

Pingback: Back-propagation with Cross-Entropy and Softmax – ML-DAWN
Mureithi Mbugua
November 27, 2021 at 9:54 am

Hi. Thank you for this. I now truly understand the softmax derivation.
I have a question. Say the error of output S(Z1) w.r.t z1 is A. That of S(Z1) wrt to Z2 is B, and that of S(Z1) wrt to Z3 is C.

So, what is the total error of the output S(Z1)? Do you add A, B, C or multiply them?

1. mehran@mldawn.com
  November 28, 2021 at 3:19 pm
  
  Thanks a lot. I am not sure what ‘ the error of output S(Z1) w.r.t z1 is A’ really means! Did you mean the derivative instead of error, perhaps?
  
  1. Anonymous
    January 31, 2023 at 9:03 pm
    
    Yes, That’s what he meant, and I’m still curious for the answer, is there a summation of derivatives?
    
    1. mehran@mldawn.com
      April 2, 2023 at 10:31 pm
      
      Unfortunately I am not sure if I follow. Just work it out manually! The answer should emerge pretty quickly.