The main confusion with this function is the dependencies between the elements of its input vector . So, for example, for computing , you will need , and as well. This is the case, because of the common denominator among all ‘s, that is, . If you look below, you will see these dependencies beautifully shown with colorful arrows!
So, for example, if you needed to compute the derivative of with respect to just , since you have used  for computing all , , and , you will need to compute the derivative of all , , and w.r.t (NOT just the derivative of w.r.t ).
Below, I have elongated the neurons in our simple neural network, and demonstrated the mathematical operations in each and every one of them. You can see the dependencies by tracing the colored arrows down below:
Pingback: Back-propagation with Cross-Entropy and Softmax – ML-DAWN
Hi. Thank you for this. I now truly understand the softmax derivation.
I have a question. Say the error of output S(Z1) w.r.t z1 is A. That of S(Z1) wrt to Z2 is B, and that of S(Z1) wrt to Z3 is C.
So, what is the total error of the output S(Z1)? Do you add A, B, C or multiply them?
Thanks a lot. I am not sure what ‘ the error of output S(Z1) w.r.t z1 is A’ really means! Did you mean the derivative instead of error, perhaps?
Yes, That’s what he meant, and I’m still curious for the answer, is there a summation of derivatives?
Unfortunately I am not sure if I follow. Just work it out manually! The answer should emerge pretty quickly.