Each neuron in an ANN has an Activation Function, which we denote with f(). At first this mathematical function, receives the input x (i.e., pre-activation) to the neuron. Next, it does some mathematical manupulation to it (i.e., f()), and finally spits out the result (i.e., f(x)).
The golden question: How should we choose this f()?
You might be slightly tempted to choose a simple linear function (Discussed in the previous post) for the neurons in your ANN, where for each neuron: f(x) = x.
Congratulations! Now you have multiple layers of cascaded linear units across your entire ANN! The result of many many nested lines is still a line! Thus, your entire ANN still produces linear functions. Under-whelming indeed!
We are interested in ANNs that can represent highly sophisticated non-linear functions, which is the type of problem we face in the real world! So, non-linearity is a desirable feature of an activation function!
Can we use a perceptron unit as our activation function across our ANN? The problem is that the perceptron unit is discontinuous at 0, and hence, non-differentiable at 0. As a result, this makes it unsuitable for gradient descent.\
In conclusion, our expectations from f() are two-fold:
- We want f() to be non-linear. This makes the entire ANN a collection of nested non-linear functions, capable of representing some scary non-linear function.
- We want f() to be continuous, and differentiable with respect to its input. This makes the entire ANN trainable using gradient descent. Great stuff!