In this network, we have a 1-dimensional input data, , with a bias unit, , that is always equal 1, by definition. We have 2 sets of weights and , which we are trying to learn using gradient descent. The actual big blue circle is a linear neuron and represents the output of the neural network. As you can see, the input data and the weights are linearly combined to make , which is also called the pre-activation of the neuron. We will further define an error function, , that we are trying to minimize by learning an appropriate set of values for our weights. Let’s define an error function:
You notice that this error is a function of the weight vector, which is a vector of all the weights in our neural network to be learned. So, for every input training data , in our training set , we generate the output of our model, that is . Then we compute the squared difference between this output and the desired output . We compute this squared difference for every input data, and sum them all up and finally divide it by 2, to compute the total error across the entire training set. So, you can see that minimizing this error across the entire training set, means that for every input data , the output of our model is getting quite close to the target value , which our ground-truth.
One a side note, there is a very good reason why we have the square operation and the division by 2, in this definition of error. I have explained this in details in my course “The Birth of Error Functions in Neural Networks”. Click here to enroll (It is Free!)
So, now we need to take the derivative of our error function with respect to every one of the weights in our network. This will be the gradient of the error with respect to the weights. As a result, for every weight in our network, the derivative of the error with respect to that weight, which is denoted as , is computed as follows: