The generalized delta rule (Rumelhart, Hinton & Williams, 1986) is used to train a multilayer perceptron to mediate a desired input-output mapping. It is a form of supervised learning; a finite set of input-output pairs are presented iteratively, in random order during training. Prior to training, a network is a “pretty blank” slate; all of its connection weights, and all of the biases of its activation functions, are initialized as small, random numbers. The generalized delta rule involves repeatedly presenting input-output pairs and then modifying weights. The purpose of weight modification is to reduce overall network error.

A single presentation of an input-output pair proceeds as follows: First, the input pattern is presented, which causes signals to be sent to hidden units, which activate and send signals to the output units, which activate to represent the network’s response to the input pattern. Second, the output unit responses are compared to the desired responses; an error term is computed for each output unit. Third, an output unit’s error is used to modify the weights of its connections. This is accomplished by adding a weight change to the existing weight. The weight change is computed by multiplying four different numbers together: a learning rate, the derivative of the unit’s activation function, the output unit’s error, and the current activity at the input end of the connection. Up to this point, learning is functionally the same as performing gradient descent training on a perceptron (Dawson, 2004).

The fourth step differentiates the generalized delta rule from older rules: each hidden unit computes its error. This is done by treating an output unit’s error as if it were activity, and sending it backwards as a signal through a connection to a hidden unit. As this signal is sent, it is multiplied by the weight of the connection. Each hidden unit computes its error by summing together all of the error signals that it receives from the output units to which it is connected. Fifth, once hidden unit error has been computed, the weights of the hidden units can be modified using the same equation that was used to alter the weights of each of the output units.

This procedure can be repeated iteratively if there is more than one layer of hidden units. That is, the error of each hidden unit in one layer can be propagated backwards to an adjacent layer as an error signal once the hidden unit weights have been modified. Learning about this one pattern stops once all of the connections have been modified. Then the next training pattern can be presented to the input units, and the learning process occurs again.

**References:**

- Dawson, M. R. W. (2004).
*Minds And Machines: Connectionism And Psychological Modeling*. Malden, MA: Blackwell Pub.
- Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors.
*Nature, 323*, 533-536.

(Added April 2011)