


Momentum is a trick used in some modern training procedures to speed up the learning of an artificial neural network. Imagine an error space that is shaped like a trough. The route to minimum error follows the length of the trough. However, the sides of the trough are much steeper than its descending length. As a result, gradient descent learning will move the network down the sides of the trough instead of the desired direction, which will greatly prolong learning.
Momentum solves this problem by using the learning rule to define a weight change, and then to add a certain proportion of the previous weight change as well (Rumelhart, Hinton & Williams, 1986). The proportion is defined by a constant; this constant is the momentum. In other words when momentum is used the new connection weight is equal to the old connection weight plus the weight change defined by the learning rule plus the momentum times the previous weight change. The effect of momentum builds over time, hence its name, and when the error space is the troughlike in nature it will greatly accelerate learning. However, there is no theoretical support for the notion of momentumdriven learning in cognitive science or in neuroscience.
References:

 Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. In D. E. Rumelhart & G. E. Hinton (Eds.), Parallel Distributed Processing (Vol. 1, pp. 318362). Cambridge, MA: MIT Press.
(Added January 2010)



