Let’s consider the neural network that we talked about earlier in this post, where we have 1-dimensional input data , and the activation function in the output neuron is linear. This means that the input and the output of this neuron is identical, as if the neuron makes no changes to its input. This network is slightly different from the popular perceptron that we discussed in our previous post. The difference is simply in the activation functions in both of them:
- The perceptron, has a step function in its output neuron that outputs only 2 values, namely, -1 and +1. This is an architecture designed for a binary classification dataset. Moreover, we encourage the network to learn the weights that would make the network produce the correct +1 and -1 for the + and – examples in our training set.
- Here, however, the activation function is linear. This means that the output of this neural network can be any real number. This is a nice architecture for a regression problem, where we would like the network to produce a real value, predicting a certain metric, measurement, etc. For example, the price of the oil in the next week, or the number of sold cars by tomorrow afternoon.
As a result, the output of the perceptron has changed from:
to the following:
Finally, in order for our Delta Rule to work, we need to have a measure to quantify the performance of our network. Meaning, how far away are the outputs from the ground truth. This measure will be our error function. So, for every input and the choice of weight vector (from the hypothesis space), how far is our output from the ground truth.
One common error function that can be used here, is the Sum Square Error (SSE) error function:
Where:
- : The ground truth for the training example
- : The output of the linear perceptron for the training example
- : The training set
So, if you think about it, this error function measures the difference between the generated output and the ground truth for every example across the whole training set . Note that the error is a function of our weight vector. The Delta rule searches for these weight vectors and uses them to generate the output for a given training example. Then by measuring the error, it would update the previously chosen weights to new values in a way that the output for the subsequent training examples would get closer to the ground truth. So, the more the training goes on the better weights would the Delta rule find, and the error for those weights would become less and less. Until eventually, the network has converged and we say that the model has been trained.
You might wonder why we have the bit, or why this particular error function is chosen. I am not going to answer this question in this post as this has something to do with the Bayes’ rule and it is beyond our current post. Having said that I will leave you with a claim:
From a Bayesian perspective, under certain conditions, it can be shown that the hypothesis that minimizes this particular error function, is also the most probable hypothesis (i.e., weights) given the training data.
In plain English:
The hypothesis that minimizes this particular error function, is the one that maximizes the probability of observing values in the output of our model, as close as possible to the ground truth.
Side note: In case you would like to see the actual derivation of this error function from a Bayesian perspective, when dealing with a neural network with a linear output, I would recommend our course at MLDawn, down below: