Back-propagation algorithm
This lecture will cover the back-propagation algorithm, which is a method for training fully connected neural networks. This class of neural networks usually consists of multiple sensor elements, which are called layers. The first layer is called the input layer, and the last layer is called the output layer. The layers in between are called hidden layers. Each layer consists of multiple neurons, which are connected to the neurons in the previous and next layers. The connections between the neurons are called weights, and they determine how much influence a neuron has on the next layer. These neural networks are called multiple layer perceptrons (MLP) or feedforward neural networks.
Training a neural network involves finding the optimal weights that minimize the error between the predicted output and the actual output. The back-propagation algorithm consists of two main steps: the forward pass and the backward pass. In the forward pass, the input data is passed through the network to compute the output using the current weights. Consequently, error signals are computed. In the backward pass, the error signals are propagated back through the network to update the weights.
Multiple layer perceptrons have three main features:
- Every neuron has a non-linear activation function, which allows the network to learn complex patterns in the data.
- The network consists of multiple hidden layers which are not directly connected to the input or output layers.
- The network reach the high degree of connectivity between the layers, which allows for a large number of parameters to be learned.
Denotation
The back-propagation algorithm can be mathematically described using the following notation:
- Indexes are used to denote the neurons implying that neuron is right after neuron and neuron is right after neuron .
- Index is used to denote the training example
- represents the sum of the squared errors for the -th training example and called the error energy
- represents the average error energy over all training examples
- represents the error signal for neuron for the -th training example
- represents the output of neuron for the -th training example
- represents the desired output for neuron for the -th training example
- represents the weight of the connection from neuron to neuron
- represents the change in weight of the connection from neuron to neuron
- is induced local field for neuron for the -th training example, which is the weighted sum of the inputs to neuron and bias
- represents the activation function for neuron
- is the bias for neuron
- represents the input to neuron for the -th training example
- represents the output of neuron for the -th training example
- is the learning rate, which determines how much the weights are updated during training
- is the dimension(number of nodes) of the -th layer(), i.e. is the number of input nodes and is the number of output nodes
- is depth of the network
Common schema
Signal error of output layer:
Error energy of output layer:
and the average error energy over all training examples is:
As the error energy is a function of the weights, the average error energy is a function of the weights as well. Thus, the average error energy for current training set is a cost function, which corresponds to the measure of training effectiveness.
Induced local field of neuron is defined as the weighted sum of the inputs to neuron and bias
where is the number of input nodes to neuron .
The functional signal of neuron is defined as the output of neuron :
The change in weight of the connection from neuron to neuron is proportional to the and might be defined by using chain rule as:
where is the learning rate, which determines how much the weights are updated during training. The negative sign indicates that we want to minimize the error energy by updating the weights in the direction of the negative gradient.
Partial derivatives in the above equation can be calculated as follows:
Thus, the change in weight of the connection from neuron to neuron can be calculated as:
where
is called the local gradient of neuron for the -th training example. The local gradient is a measure of how much the error energy changes with respect to the induced local field of neuron .
Notice that it is possible to distinguish two cases for the local gradient of neuron :
- If neuron is an output neuron, then the local gradient can be calculated by using the corresponding desired output
- If neuron is a hidden neuron, then the local gradient can be calculated by using the local gradients of the neurons in the next layer and the weights of the connections from neuron to the neurons in the next layer. This is because the error energy is a function of the outputs of all neurons in the network, and the output of neuron affects the outputs of all neurons in the next layer. This is a credit assignment problem, which is solved by the back-propagation algorithm.
The case of output layer
For the output layer, the local gradient can be calculated by using the corresponding desired output as follows:
where is the error signal for neuron for the -th training example, and is the derivative of the activation function for neuron with respect to the induced local field.
The case of hidden layer
For the hidden layer, denoted by index , the local gradient can be calculated by using the local gradients of the neurons in the next layer and the weights of the connections from neuron to the neurons in the next layer.
Firstly, local gradient of neuron can be calculated by using the chain rule as follows:
Let’s denote the output neuron by using index , then the energy error can be calculated as follows:
where is the error signal for output neuron for the -th training example.
Thus, the partial derivative of the error energy with respect to the output of neuron can be calculated as follows:
where is the error signal for output neuron for the -th training example. Consecuently, the partial derivative of the error signal with respect to the induced local field of output neuron can be calculated as follows:
Further,
Substituting the above equations into the equation for gives:
Finally, the local gradient of neuron can be calculated as follows:
The equation above called the back-propagation formula, which is used to calculate the local gradient of a hidden neuron by using the local gradients of the neurons in the next layer and the weights of the connections from the hidden neuron to the neurons in the next layer.
Forward pass
The forward pass of the back-propagation algorithm involves passing the input data through the network to compute the output using the current weights.
Assume that the input vector for the -th training example is , where is the number of input nodes. The output of the input layer is simply the input vector itself, i.e. for .
Further, it’s necessary to aplly the activation function:
If the layer is output, then the output of the network for the -th training example is for , where is the number of output nodes. The error signal for output neuron calculated by using the desired output is for .
Backward pass
The backward pass begins at the output layer, presenting it with an error signal. This signal is passed from right to left from layer to layer, with the local gradient calculated in parallel for each neuron. This recursive process involves modifying synaptic weights according to the delta rule. For a neuron located in the output layer, the local gradient is equal to the corresponding error signal multiplied by the first derivative of the nonlinear activation function.
The ratio is then used to calculate the changes in the weights associated with the output layer of neurons. Knowing the local gradients for all neurons in the output layer , we can calculate the local gradients of all neurons in the previous layer, and therefore the amount of adjustment to the weights of connections with this layer. These calculations are performed for all layers in the opposite direction.
However, the back-propagation algorithm itself does not specify how to update the weights, and it can be used with different optimization algorithms to find the optimal weights that minimize the error energy. Several optimization algorithms will be covered in the next lecture.