Connectionism, Confusion, and Cognitive Science

Michael R.W. Dawson & Kevin S. Shamanski

This is the HTML version of a paper that appeared in The Journal Of Intelligent Systems, 1994. The full reference to this paper is in the index file (which can be accessed from the end of the document. Please cite the journal reference to this paper if you cite any of the material below.

ABSTRACT

This paper argues that while connectionist technology may be flourishing, connectionist cognitive science is languishing. Parallel distributed processing (PDP) networks can be proven to be computationally powerful, but these proofs offer few useful constraints for developing models in cognitive science. Connectionist algorithms -- that is, the PDP networks themselves -- can exhibit interesting behaviours, but are difficult to interpret and are based upon an incomplete functional architecture. While PDP networks are often touted as being more "biologically plausible" than classical AI models, they do not appear to have been widely endorsed by neurophysiologists because they incorporate many implausible assumptions, and they may not model appropriate physiological processes. In our view, connectionism faces such problems because the design decisions governing connectionist theory are determined by engineering needs -- generating the appropriate output -- and not by cognitive or neurophysiological considerations. As a result, the true nature of connectionist theories, and their potential contribution to cognitive science, is unclear. We propose that the current confusion surrounding connectionism's role in cognitive science could be greatly alleviated by adopting a research programme in which connectionists paid much more attention to validating the PDP architecture.

Connectionists appear to be making great advances in the technology of knowledge engineering, and now feel poised to answer the difficult questions about machine intelligence that seem to have passed classical artificial intelligence by. Parallel distributed processing (PDP) models have been developed for a diverse range of phenomena, as a survey of almost any journal related to cognitive science will show. For example, in recent years Psychological Review has published connectionist models concerned with aspects of reading (Hinton & Shallice, 1991; Seidenberg & McClelland, 1989), classical learning theory (Kehoe, 1988), automatic processing (Cohen, Dunbar, & McClelland, 1991), sentence production (Dell, 1986), apparent motion (Dawson, 1991), and dreaming (Antrobus, 1991). In addition, many basic connectionist ideas are being directly implemented in VLSI hardware (e.g., Jabri & Flower, 1991) under the assumption that increases in computer power and speed require radical new parallel architectures (e.g., Hillis, 1985; Mller & Reinhardt, 1990, p. 17). "The neural network revolution has happened. We are living in the aftermath" (Hanson & Olson, 1991, p. 332).

While connectionist technology may indeed be flourishing, connectionist cognitive science is languishing. PDP networks may generate interesting behaviour, but it is not clear that they do so by emulating the fundamental nature of human cognitive processes. In our view, the design decisions governing connectionist theory are determined by engineering needs -- generating the appropriate output -- and not by cognitive or neurophysiological considerations.

The goal of this paper is to illustrate the gap between connectionist technology and connectionist cognitive science. Connectionist networks can be proven to be computationally powerful, but these proofs offer no meaningful constraints for designing cognitive models. Connectionist algorithms -- that is, the PDP networks themselves -- can exhibit interesting behaviours, but are difficult to interpret and are based upon an insufficient functional architecture. While PDP networks are often touted as being more "biologically plausible" than classical or symbolic artificial intelligence (AI) models, they do not appear to have been widely endorsed by neurophysiologists because they incorporate many implausible and unjustified assumptions, and they may not model appropriate physiological processes.

PDP Networks And Information Processing

The explosion of interest in connectionist systems over the past decade has been accompanied by the development of diverse architectures (for overviews, see Cowan & Sharp, 1988; Hecht-Neilson, 1990; Mller & Reinhardt, 1990). These have ranged from simulations designed to mimic (with varying detail) specific neural circuits (e.g., Granger, Ambros-Ingerson, & Lynch, 1989; Grossberg, 1991; Grossberg & Rudd, 1989, 1992; Lynch, Granger, Larson, & Baudry, 1989) to new computer designs that have little to do with human cognition, but which use parallel processing to solve problems that are ill-posed or that require simultaneous satisfaction of multiple constraints (e.g., Hillis, 1985; Abu-Mostafa & Psaltis, 1987). Given this diversity, it is important at the outset to identify the branch of connectionism with which we are particularly concerned. This paper examines the characteristics of what has been called generic connectionism (e.g., Anderson & Rosenfeld, 1988, p. xv), because it appears to have had the most impact on cognitive science in general.

A detailed description of the generic connectionist architecture is provided by Rumelhart, Hinton and McClelland (1986). PDP models are defined as networks of simple, interconnected processing units. A single processing unit is characterized by three components: a net input function which defines the total signal to the unit, an activation function which specifies the unit's current "numerical state", and an output function which defines the signal sent by the unit to others. Such signals are sent through connections between processing units, which serve as communciation channels that transfer numeric signals from one unit to another. Each connection is associated with a numerical strength, which is used to scale transmitted signals. Connection strengths can be modified by applying a learning rule, which serves to teach a network how to perform some desired task. For instance, the generalized delta rule (Rumelhart, Hinton, & Williams, 1986a, 1986b) computes an error signal using the difference between the observed and desired responses of the network. This error signal is then "propagated backwards" through the network, and used to change connection weights, so that the network's performance will improve.

Models created from this generic architecture are usually described as radically different from so-called classical AI systems (e.g., Broadbent, 1985; Churchland & Sejnowski, 1989; Clark, 1989; Fodor & Pylyshyn, 1988; Hawthorne, 1989; Hecht-Nielsen, 1990; McClelland, Rumelhart, & Hinton, 1986; Rumelhart, Smolensky, McClelland, & Hinton, 1986; Schneider, 1987; Smolensky, 1988). However, Fodor and Pylyshyn (1988, pp. 7-11) have argued convincingly that both classical and connectionist theories posit representational or semantic states: in short, both are information processing systems. Information processing systems need to be described at three different levels of analysis -- computational, algorithmic, and implementational -- if they are to be completely understood (e.g., Marr, 1982, Chap. 1; Pylyshyn, 1984). In the following sections, we examine PDP networks from this tri-level perspective in order to ascertain the relationship between connectionist theory and cognitive science. First, we consider the computational power of these networks. Second, we consider the types of algorithms or procedures that these networks define. Third, we consider the relationship between the PDP architecture and neurophysiology. We show, at each level of description, that while current PDP models have some intriguing properties, their potential contribution to cognitive science is uncertain at best.

COMPUTATIONAL DESCRIPTIONS OF PDP NETWORKS

PDP Networks Are Powerful Information Processing Systems

A computational description of an information processor accounts for the system's competence -- it defines the kinds of functions that a system can compute. In cognitive science, descriptions of this sort are generally used to fulfill two different purposes. First, these accounts can be used to rigorously define information processing problems as the first step in a top-down research programme that has as its ultimate goal the creation of a working computer model (e.g., Marr, 1982). Second, computational analyses can be used to assess the potential adequacy of a general class of model, by determining whether the kinds of functions it can compute are sufficiently rich to capture interesting cognitive regularities. Computational analyses of connectionist systems have focused on this second purpose.

Many researchers have argued that, in principle, PDP networks are extremely powerful information processing systems. Below we briefly review the evidence for three different claims of this sort: that PDP networks are functionally equivalent to Universal Turing machines, that PDP networks are arbitrary pattern classifiers, and that PDP networks are universal function approximators.

Connectionist networks are equivalent to Universal Turing Machines. Even a cursory look at connectionist theory indicates that it is very similar to classical associationism (e.g., Bechtel, 1985; Bechtel & Abrahamsen, 1991, pp. 101-103). However, this resemblance is also disconcerting. It can be strongly argued that associationist models are formally equivalent to finite state automata, and as a result are not powerful enough in principle to instantiate human cognition (e.g., Bever, Fodor, & Garrett, 1968). This is why classical AI attempts to design models that are equivalent to Universal Turing Machines (UTMs). If connectionist systems were equivalent to classical associationist models, then their limited computational power would make them extremely unattractive to cognitive science (see also Fodor & Pylyshyn, 1988; Lachter & Bever, 1988).

However, strong arguments have been made that PDP models have the same competence as classical AI systems. In some of the earliest work on neural networks, McCulloch and Pitts (1943/1988) examined finite networks whose components could perform simple logical operations like AND, OR, and NOT. They were able to prove that such systems could compute any function that required a finite number of these operations. From this perspective, the network was only a finite state automaton (see also Hopcroft & Ullman, 1979, p. 47; Minsky, 1972, Chap. 3). However, McCulloch and Pitts went on to show that a UTM could be constructed from such a network, by providing the network the means to move along, sense, and rewrite an external "tape" or memory. "To psychology, however defined, specification of the net would contribute all that could be achieved in that field" (McCulloch & Pitts, 1943/1988, p. 25).

Connectionist networks are arbitrary pattern classifiers. Connectionist networks are also commonly used to classify patterns (for reviews, see Carpenter, 1989; Lippmann, 1987, 1989). Essentially, the set of input activities for a particular stimulus define the location of a point in a multidimensional pattern space. The network "carves" this pattern space into different decision regions, which (potentially) can have different and complex shapes. The network classifies the input pattern by generating the "name" (i.e, a unique pattern of output unit activity) of the decision region in which the stimulus pattern point is located.

When a network is described as a pattern classifier, claims about its computational power focus on the decision regions that it can "carve", because this will define the complexity of the classifications that it can perform. For example, a network with a monotonic activation function in its output unit, and no intermediate processors, only has the ability to decide whether an input belongs to one of two distinct categories: this network can only "carve" a single hyperplane through the multidimensional pattern space. Stimuli located to one side of the hyperplane are assigned one category label, and stimuli located to the other side are assigned the second category label (see Figure 1). Such systems cannot learn the simple XOR relationship, because it requires a more sophisticated partitioning of the pattern space -- specifically, two parallel hyperplanes. This more complicated partitioning can be accomplished either by adding a single layer of hidden processors (e.g., Rumelhart, Hinton, & Williams, 1986b, pp. 319-320) or by using a nonmonotonic activation function (see Figure 1c).

How many additional layers of processors are required to partition the pattern space into arbitrary decision regions, and thus make a network capable of any desired classification? Lippmann (1987, p. 16), by considering the shape of decision regions created by each additional layer of monotonic processors, has shown that a network with only two layers of hidden units (i.e., a three-layer perceptron) is capable of "carving" a pattern space into arbitrary decision regions. "No more than three layers are required in perceptron-like feed-forward nets".

Connectionist networks are universal approximators. Historically, PDP networks have been most frequently described as pattern classifiers. Recently, however, with the advent of so-called radial basis function (RBF) networks (e.g., Girosi & Poggio, 1990; Hartman, Keeler, & Kowalski, 1989; Moody & Darken, 1989; Poggio & Girosi, 1989,1990; Renals, 1989), connectionist systems are now often described as function approximators. Imagine, for example, a mathematical "surface" defined in N-dimensional space. At each location in this space this surface has a definite height. A function approximating network with N input units and one output unit would take as input the coordinates of a location in this space, and would output the height of the surface at this location. How powerful a function approximator can a PDP network be? Rumelhart, Hinton, and Williams (1986, p. 319) have claimed that "if we have the right connections from the input units to a large enough set of hidden units, we can always find a representation that will perform any mapping from input to output". More recently, researchers have attempted to analytically justify this bold claim, and have in fact proven that many different kinds of networks are universal approximators. That is, if there are no restrictions on the number of hidden units in the network and if there are no restrictions on the size of the connection weights, then in principle a network can be created to approximate -- over a finite interval -- any continuous mathematical function to an arbitrary degree of precision. This is true for networks with a single layer of hidden units whose activation function is a sigmoid-shaped "squashing" function (e.g., Cotter, 1990; Cybenko, 1989; Funahashi, 1989), for networks with multiple layers of such hidden units (Hornik, Stinchcombe, & White, 1989), and for RBF networks (Hartman, Keeler, & Kowalski, 1989).

How Are Claims About PDP Competence Related To Cognitive Science?

It is obvious from the discussion above that PDP networks can be described as very powerful computational systems. Nevertheless, there are a number of indications that such competence is not by itself sufficient to claim that PDP models are appropriate for cognitive science:

Competence is irrelevant in the absence of performance. Proofs about computational power can be extremely powerful, insofar as they may rule out a proposal for an architecture of mind (e.g., Bever, Fodor, & Garrett, 1968). Nevertheless, the demonstration that a connectionist architecture has the competence of a UTM only suggests its plausibility for psychological modeling; it does not justify adopting such networks as the most plausible candidate. This is because one can have powerful computational competence in machines whose performance characteristics clearly rule them out for modeling purposes. For example, the tape head of a Turing machine is limited to relative memory access, and thus performs painstakingly slow serial manipulations of input data. As a result, researchers do not seriously propose this particular type of machine for cognitive models, even though it has enormous competence. Instead, they explore architectures like production systems that have the same computational power, but have far more powerful performance characteristics (for an introduction, see Haugeland, 1985, Chap. 4). Thus, while computational claims are important, they do little to constrain or motivate the use of connectionist systems in cognitive science (see also Massaro, 1988).

The structure/process distinction and finite state automata. Dawson and Schopflocher (1992a) have argued that a fundamental difference between classical and connectionist models emerges when one considers how the two approaches distinguish between data structures and the processes that manipulate them. In a classical architecture a set of symbols does not itself comprise an autonomous representational system, because full-fledged computation requires that these symbols be manipulated by additional external processes that are sensitive to symbolic structure. A prototypical example is the Turing machine, whose processes are instantiated in the structure of the tape head, which in turn manipulates tokens written on an external tape. In contrast, PDP networks are designed to "exhibit intelligent behaviour without storing, retrieving, or otherwise operating on structured symbolic expressions" (Fodor & Pylyshyn, 1988, p. 5). This rejection of the structure/process distinction -- called autonomy by Dawson and Schopflocher -- is the characteristic that differentiates a PDP network from a classically defined symbol. "Knowledge is not directly accessible to interpretation by some separate processor, but it is built into the processor itself and directly determines the course of processing" (Rumelhart, Hinton, & McClelland, 1986, pp. 75 - 76).

While the absence of a structure/process distinction may define how a PDP network differs from a classical system, it also imposes limitations on the network's computational power. Recall that in order to achieve the computational power of a UTM, an external memory structure had to be provided to the network (McCulloch & Pitts, 1943/1988). This implies that if one does not distinguish between structure and data in a connectionist system -- if this external data store is not provided -- then one may not be able to produce networks of sufficient computational power to be of interest to cognitive science. Furthermore, this problem is not avoided by proofs that networks are universal approximators: Levelt (1990) has argued from such proofs that PDP networks are merely finite state automata, because they can only approximate functions over a finite interval. This limitation is also at the heart of Fodor and Pylyshyn's (1988) criticism that PDP systems are insensitive to the constituent structure of complex tokens.

Do brains and PDP networks have the same competence? That PDP networks are universal approximators is quite interesting in principle, but less compelling in practice. Proofs of universal approximation place no limits on the number of processing units in the network, or on the strengths of its connections. Thus, to approximate some functions, one might require numbers of processors, or connection weight values, that are impossible in finite biological systems (see also Ballard's [1986] discussion of the packing problem). As well, it is not clear that the brain itself is designed to be a universal approximator, because such devices are not without interesting limitations. For example, after being trained to approximate some function, an RBF network can generalize its performance and respond correctly to new instances. However, this ability to generalize requires that the approximated function be smooth and piecewise continuous (e.g., Poggio & Girosi, 1990). This property is not true of Boolean functions (i.e., functions of the form {0,1}N {0,1}M) typically used to study pattern classification in neural networks. Indeed, universal approximators like RBF networks have difficulties in learning such functions (e.g., Moody & Darken, 1989). Neuroscientists often describe the brain in a fashion suggesting that its primary function is pattern categorization (e.g., Kuffler, Nicholls, & Martin, 1984). If this is true -- if it is not a function approximator -- then proofs that PDP networks are universal approximators may not be pertinent to cognitive science.

The issue of relating the properties of actual neural networks to the properties of universal approximators may be a moot point if the brain indeed is more aptly described as a pattern classifier. This is because Lippmann's (1987) proof shows that PDP networks may be extremely well-suited as models of this kind of processing. However, the nature of Lippmann's proof -- as well as the nature of the universal approximator proofs -- again raises serious questions about the relationship between PDP models and brain-mediated psychological processes. Specifically, Lippmann shows that only two layers of hidden units are required to mediate arbitrary pattern classification. Similarly, Funahashi (1989) shows that only one layer of hidden units is required to create a universal approximator. However, it is also clear that the human brain can be described as being composed of a very large number of processing layers (e.g., Kuffler, Nicholls & Martin, 1984, Chap. 2). If adequate computational power can be achieved with a small number of processing layers, then why does the brain not have a simpler structure?

Furthermore, the demonstrated competence of relatively simple PDP architectures has discouraged researchers from considering more complicated many-layered architectures, which could bear a stronger relationship to actual brain function. For example, after noting that multi-layer RBF networks are possible in principle, Poggio and Girosi (1989, p. 58) point out that "there is no reason to believe, however, that such `multilayer' functions represent a large and interesting class". From the view of computational competence, this claim may indeed be true. However, if one of the intents of connectionism is to provide a bridge between psychology and neuroscience, then this view is disturbing.

Computationalism And Behaviourism

In the radical behaviourism proposed by Skinner, psychological theory amounted to accounts of environmental stimuli and of the observable behaviours they produced. Internal processes were not referred to; they could play no role in a psychological science. In many respects, the computational approach to neural networks echos this endeavour. The primary concern of the computational analyses described above is determining the possible stimulus/response relationships for PDP networks. These proofs pay little attention to the relationship between the structure of these networks and the nature of the processes underlying human cognition. On the one hand, perhaps this state of affairs is to be expected. Descriptions of competence are in essence descriptions of input/output mappings. Further, if a primary focus of connectionism is technological, then there may be more concern about what a system can do than about how it is done. On the other hand, the "behavioural" emphasis of computational analyses appear to move from connectionism away from a science of human cognition. By failing to deal with networks that can be strongly related to human mental processes, the proofs cited above say little about the capabilities or limits of human cognition.

ALGORITHMIC DESCRIPTIONS OF PDP NETWORKS

PDP Networks Are Themselves Algorithms

Informally, an algorithm is a completely mechanical procedure for performing some computation -- "an infallible, step-by-step recipe for obtaining a prespecified result" (Haugeland, 1985, p. 65). In cognitive science, an algorithm is viewed by many researchers not only as a fundamentally important theoretical notion, but also as a practical goal for models-as-explanations. "If the long promised Newtonian revolution in the study of cognition is to occur, then qualitative explanations will have to be abandoned in place of effective procedures" (Johnson-Laird, 1983, p. 6).

In PDP connectionism, a network can itself be described as an effective procedure for computing some function, or for categorizing some patterns. Indeed, a tremendous amount of enthusiasm for connectionism has been fueled by specific demonstrations that PDP networks offer practical algorithms for a diverse range of problems. A wide range of connectionist systems have been proposed to model aspects of memory (e.g., Anderson, 1972; Anderson, Silverstein, Ritz, & Jones, 1977; Eich, 1982; Grossberg, 1980; Knapp & Anderson, 1984; Murdock, 1982). Connectionist pattern recognition networks have a long history (e.g., Selfridge, 1956), have become benchmarks to which other methods are compared (e.g., Barnard & Casasent, 1989), and can outperform standard methods for such tasks as speech recognition (e.g., Bengio & de Mori, 1989). Connectionists have successfullly used networks to solve problems related to locomotion (e.g., Brooks, 1989; Pomerleau, 1991), and have designed systems to mediate behaviours once thought to be exclusive to classical systems, such as performing logical inferences (Bechtel & Abrahamsen, 1991, pp. 163- 174) and performing linguistic transformations or sentence parsing (e.g., Jain, 1991; Lucas & Damper, 1990; Rager & Berg, 1990).

Problems With PDP Algorithms

In spite of these successes, the contributions of connectionist algorithms to cognitive science is somewhat suspect. First, researchers are beginning to challenge the ability of such models to capture the right empirical generalizations. Second, such models are often exceedingly difficult to interpret, which mitigates their explanatory usefulness as effective procedures. Third, in many cases the functional architecture of these networks is not completely specified. We consider each of these problems below.

PDP networks may fail to capture interesting empirical generalizations. Pylyshyn (1980, 1984) has argued strongly that a fundamental goal of cognitive science's theories is to capture rich sets of empirical generalizations. Indeed, the success of computer simulations of psychological phenomena is often measured in the program's ability not only to make the same correct judgements as humans, but similar mistakes as well. Merely generating "intelligent" behaviour does not guarantee a successful niche in cognitive science for an implemented theory, which is why neither computerized chess boards nor pocket calculators are viewed as proposals for how humans play chess or perform mental arithmetic.

One important characteristic of connectionism has been the claim that PDP networks capture the right kinds of generalizations for an empirical cognitive science. For example, one reason for the recognized importance of Rumelhart and McClelland's (1986) network that transforms verbs into the past tense was that during training it produced overgeneralization errors similar to those observed in children. Similarly, much of the interest in distributed connectionist memories is due to the kinds of errors these systems produce (see, for example, Eich, 1982).

The claim that connectionist systems are capable of capturing sufficiently rich empirical generalizations has far-reaching consequences, because the behaviour of PDP networks is putatively mediated by mechanisms that bear little relationship to those proposed in classical models (for an example of strong claims of this sort, see Seidenberg & McClelland, 1989). Connectionists are now challenging the "realistic" status of classical theories -- the view that such accounts reflect actual "theories in the head". PDP researchers are proposing that classical theories are not valid explanations, but are merely instrumentalist descriptions. The proper account of mentality, they argue, is reflected in explanations of the dynamic properties of connectionist models. "Subsymbolic models accurately describe the microstructure of cognition, whereas symbolic models provide an approximate description of the macrostructure" (Smolensky, 1988, p. 12).

This challenge to classical cognitive science requires PDP models to generate the same behaviour as that observed in human subjects. Recently, however, this ability has been strongly contested. Several prominent connectionist models, which have spearheaded the assault on classical models of data, have been carefully examined, and have been found wanting (e.g. Pinker and Prince's [1988] critique of Rumelhart and McCelland's [1986] verb transformation network; Besner, Twilley, McCann & Seergobin's [1990] examination of the Seidenberg and McCelleland [1989] grapheme-to-phoneme network). The general theme of these critiques is that PDP networks capture some, but not all, of the empirical regularities thought to be critical to understanding the psychological phenomena being modeled.

The connectionist response to such criticisms is to moderate their claims about the models. They argue that because of practical limitations, the networks that they create should not be expected to capture all of the relevant empirical generalizations (e.g., Seidenberg & McClelland, 1990). However, because these simple systems can account for some interesting data, it is argued that they warrant serious consideration. The suggestion is that as networks become larger and more sophisticated, they will be able to account for a broader range of empirical phenomena. There is certainly merit in this position, but it should be recognized for what it is: a promissory note. The enthusiastic predictions of connectionists about the future performance of larger networks should be tempered by the knowledge that the properties of small PDP networks often disappear when their size is scaled up (e.g., Minsky & Papert, 1969/1988, pp. 261-266).

PDP algorithms are extremely difficult to interpret. In many cases it is extremely difficult to determine how connectionist networks accomplish the tasks that they have been taught. "One thing that connectionist networks have in common with brains is that if you open them up and peer inside, all you can see is a big pile of goo" (Mozer & Smolensky, 1989, p. 3). There are a number of reasons that PDP networks are difficult to understand as algorithms.

First, they are rarely developed a priori -- instead, a generic learning rule is used to develop useful (algorithmic) structures in a network that is initially random. Thus, one does not need a theoretical account of a to-be-learned task before network is created to do it. Second, general learning procedures can train networks that are extremely large; their sheer size and complexity makes them difficult to interpret. For example, Seidenberg and McClelland's (1989) network for computing a mapping between graphemic and phonemic word representations uses 400 input units, up to 400 hidden units, and 460 output units. Determining how such a large network maps a particular function is an intimidating task. Third, most interesting PDP networks incorporate nonlinear activation functions. This nonlinearity makes these models more powerful than those that only incoporate linear activation functions (e.g. Jordan, 1986), but it also results requires that descriptions of their behaviour be particularly complex. Indeed, some researchers choose to ignore the nonlinearities in a network, substituting a qualitative account of how it works (e.g., Moorehead, Haig & Clement, 1989, p. 798). Fourth, connectionist architectures offer too many degrees of freedom for the generation of working systems. One learning rule can create many different networks -- for instance, containing different numbers of hidden units -- that each compute the same function. Each of these systems can therefore be described as a different algorithm for computing that function. One does not have any a priori knowledge of which of these possible algorithms might be the most plausible as a psychological theory of the phenomenon being studied. The problems facing researchers who want to explain how their networks actually function becomes clear when one examines some of the intepretive strategies that are emerging in connectionist research.

One strategy is to develop networks that are (hopefully) maximally interpretable by reducing the number of their processing units to a minimum (e.g., Hagiwara, 1990; Mozer & Smolensky, 1989; Sietsma & Dow, 1988). For example, Mozer and Smolensky propose a measure of the relevance of each processor to a network's overall performance. They advocate a research strategy in which one starts by training a large network to accomplish some task. Then, the relevance of each processor is computed. Processors with sufficiently small relevance values are removed from the network. This procedure is repeated until each network processor has a high relevance value. A second strategy is to perform statistical analyses of the connection weights from a trained network. For example, Hanson and Burr (1990) illustrate a number of techniques for probing network structure, including compiling frequency distributions of connection strengths, quantifying global patterns of connectivity with descriptive statistics, illustrating local patterns of connectivity with "star diagrams", and performing cluster analyses of hidden unit activations. A third strategy is to map out the response characteristics of each processor in the network. For example, Moorehead, Haig and Clement (1989) used the generalized delta rule to train a PDP network to identify the orientation of line segments presented to an array of input units. Their primary research goal was to determine whether the hidden units in this system developed centre-surround receptive fields analogous to those found in the primate lateral geniculate nucleus. They chose to answer this question by stimulating each input element individually, and plotting the resulting activation in each hidden unit.

The diversity of these methods shows that providing explanations of PDP algorithms is a nontrivial task. "There is a growing suspicion that discovering [how a network does its job] may require an intellectual revolution in information processing as profound as that in physics brought about by the Copenhagen interpretation of quantum mechanics" (Hecht-Nielsen, 1990, p. 10).

The generic connectionist architecture is incomplete. One of the common arguments for using computer simulation methodology in cognitive science is that such models force researchers to be extraordinarily explicit about their assumptions and their theoretical statements. Vague theories do not result in working computer programs.

Connectionist proponents imply that a PDP network -- a program written in the functional architecture of generic connectionism -- defines a particularly explicit theory. The components of generic connectionism are quite simple. As a result, one could imagine using a diagram of a trained network as a circuit diagram; each pictured processor and connection could be emulated by a simple electronic (or biological) component. The result would be a physical device capable of carrying out all of the computations attributed to the original network. Thus, connectionists claim that PDP networks comprise an autonomous representational system -- one need not appeal to external rules or processes to explain how these networks function or learn. "Much of the allure of the connectionist approach is that many connectionist networks program themselves, that is, they have autonomous procedures for tuning their weights to eventually perform some specific computation" (Smolensky, 1988, p. 1, his italics).

However, Dawson and Schopflocher (1992a) have shown that in actuality PDP networks can not be easily implemented in this sense. In short, if one were to build a diagrammed network by replacing its generic connectionist components with functionally equivalent electronic parts, then the electronic network would not be capable of all the behaviours attributed to the network -- it would not be an autonomous system. This is because the components of the generic connectionist architecture are not by themselves sufficient for the intended task.

Dawson and Schopflocher (1992a) make their case by analyzing in detail an extremely simple associative memory model. Figure 2a illustrates the PDP version of the model; diagrams of this system have a long history in the connectionist literature (e.g., Kohonen, 1977 Fig. 1.9; McClelland & Rumelhart, 1988, Chap. 4 Fig. 3; Rumelhart, McClelland, & the PDP Group, 1986, Chap. 1 Fig. 12, Chap. 9 Fig. 18, Chap. 12 Fig. 1, Chap. 18 Fig. 3; Schneider, 1987 Fig. 1; Steinbuch, 1961 Fig.2; Taylor, 1956 Figs. 9 & 10). The purpose of this model is to learn the association between pairs of activity patterns presented simultaneously to the two banks of processing units. Under the restriction that all activity patterns are mutually orthogonal, a model with N units in each input bank is capable of storing information about N different pattern pairs in its connections. Because this network is a distributed memory system, and because its mathematical properties are quite easily described, it is often used to introduce the basic ideas of connectionism (e.g., Jordan, 1986).

Nevertheless, Dawson and Schopflocher (1992a) argue that the network in Figure 2a is not capable of learning associations between activity patterns without the help of a controller that is external to the network. They point out that the Figure 2a network requires an external signal to tell whether it should be learning a new association, or whether it should be recalling an old one. Furthermore, it requires additional processing and controlling abilities to modify connection weights -- connections cannot merely be single, numerical values. Thus Dawson and Schopflocher conclude that PDP networks by themselves are not sufficiently powerful to simultaneously represent and manipulate contents.

Autonomous processing is possible in PDP networks. However, if it is to be achieved, then the kinds of networks typically proposed require substantial elaborations of the PDP architecture. This elaboration must be guided by an explicit statement of a functional architecture capable of solving the various control problems faced by an autonomous system. For example, Dawson and Schopflocher (1992a) propose a slightly elaborated connectionist architecture, and demonstrate how an autonomous pattern associator could be created from it (see Figure 2b). Without such a functional architecture, it is doubtful that connectionism can serve as a viable bridge between computational and physiological descriptions. As a result, it would be a mistake to assume that diagrams created from the generic connectionist architecture provide better, more explicit, or more easily realizable programs than would be available from classical AI.

PDP Networks And Clever Hans

The story of Hans, the clever horse, is well known to Introductory Psychology students (e.g., Santrock, 1988, p. 18). He was trained by a retired math teacher to communicate by tapping his foot and nodding his head. "By the end of his training, Hans could spell words spoken to him, and he excelled in math. He became a hero in Germany -- his picture was on liquor bottles and toys". Originally certified by a panel of thirteen scientists as being an authentic reasoner, it required a series of careful experiments by Oskar von Pfungst to reveal that Hans was a horse of a different colour: an equine prankster who paid attention to tacit behavioural cues, provided by his questioners, which had in the past led to rewards of bread and carrots.

Johnson-Laird (1983, p. 4) has noted that "to understand a phenomenon is to have a working model of it". Interestingly, PDP models appear to prove this statement false, because connectionists can easily replace one unknown (e.g., how the brain mediates some psychological phenomenon) with another -- a functioning but unexplained network. The story of Hans shows that it is very easy to be beguiled by interesting behaviour. Researchers must be careful to remember one can generate interesting behaviour in a PDP networks without understanding how the network actually works.

IMPLEMENTATIONAL DESCRIPTIONS OF PDP NETWORKS

An implementational description of an information processing system attempts to relate its representational and formal properties to the causal laws governing its mechanical structure. Classical cognitive science, because of its functionalist nature, has typically placed little emphasis on this type of description. Connectionism, however, is motivated by quite different considerations.

PDP Networks, Biological Plausibility, and Computational Relevance

Why has connectionism been so enthusiastically adopted by some cognitive scientists? One reason is that PDP models are claimed to be biologically plausible algorithms. In other words, when examining a diagram of a connectionist system, one could imagine that it illustrates a sufficient neural circuitry for accomplishing some task. This, it is argued, is not true of classical models. "No serious study of mind (including philosophical ones) can, I believe, be conducted in the kind of biological vacuum to which cognitive scientists have become accustomed" (Clark, 1989, p. 61).

In what way are PDP models intended to fill this "biological vacuum"? Generally speaking, these systems are "neuronally inspired" -- processing units are roughly equivalent to neurons, and connections between processors are roughly equivalent to synapses (see, for example, the visual analogy rendered in Rumelhart, Hinton, & McCelland, 1986, Fig. 1). Neuronal inspiration also colours general assumptions about processing in PDP networks. As in the brain, all processors are assumed to work in parallel and to send signals that are a nonlinear function of their net input. The "knowledge" of the system is encoded in patterns of connectivity, because synaptic modification appears to be a general description of how the brain remembers information (e.g., Dudai, 1989).

In spite of connectionism's implementational intentions, neuroscientists are quite skeptical about the biological plausibility of the PDP architecture. A number of reasons are often cited for this skepticism. First, one can generate long lists of properties that are true of the PDP architecture, but are clearly not true of the brain (e.g., Crick & Asanuma, 1986; Smolensky, 1988, Table 1). As a result, PDP models are often villified as oversimplifications by neuroscientists; Douglas and Martin (1991, p. 292) refer to them as "stick and ball models". Second, researchers find it extremely unlikely that supervised learning rules like error backpropagation could be physiologically instantiated. This is because it is highly unlikely that the environment could specify a "training pattern" as accurately as is required by such rules (e.g., Barto,Sutton, & Anderson, 1983), and because there is no evidence at all for neural connections capable of feeding an error signal backwards to modify existing connections (e.g., Bechtel & Abrahamsen, 1991, p. 57; Kruschke, 1990). In short, while biological networks are capable of autonomous learning, artificial networks are not (see also Dawson & Schopflocher, 1992a). Reeke and Edelman (1988, p. 144) offer this blunt assessment of the neurophysiological relevance of PDP connectionism: "These new approaches, the misleading label `neural network computing' notwithstanding, draw their inspiration from statistical physics and engineering, not from biology".

However, these criticisms miss the mark. PDP networks are designed to be extreme simplications, glossing over many of the complex details true of neural systems (for an example, see Braham & Hamblen, 1990). This is because the PDP architecture is itself functionalist in nature. It attempts to capture just those properties of biological networks that are computationally relevant. The intent of this enterprise is to describe neural networks in a vocabulary that permits one to make rigorous claims about what they can do, or about why the brain might have the particular structure that it does. For example, claims about the competence of neural networks only arise when one abstracts over neurophysiological details, and describes important aspects of neuronal function either mathematically or logically (e.g., McCulloch & Pitts, 1943/1988). Furthermore, functional descriptions and computational analyses can often shed light on questions that one would imagine neuroscience has basic answers to, but in fact does not. For example, why do different functions appear to be localized in different regions of the brain? Ballard (1986) argues that this type of organization is to be expected of a connectionist system that evolves to solve the so-called packing problem: how to pack an enormous variety of functions into a network (like the brain) with a finite number of processors.

The functionalist philosophy that guides cognitivism has always argued that cognitive phenomena cannot be reduced to a single level of neurophysiological explanation because such an account cannot capture all of the important empirical generalizations (e.g., Pylyshyn, 1980, 1984). The message to neuroscience has been that in cognitivism, accounts at the three levels of computation, algorithm, and implementation are all equally necessary. Largely in response to theories of the "New Connectionism", the egalitarian emphasis of functionalism is apparently ebbing away. Some critics of connectionism have argued that if connectionism is primarily concerned with implementational issues, then it bears no relationship to cognitive science at all, because the fundamental properties of information processing systems must be captured at more abstract or functional levels of description (e.g., Broadbent, 1985; Fodor & Pylyshyn, 1988, pp. 64-69).

This type of argument is too strong, because it ignores the possibility that PDP research has the potential to build a strong bridge between neuroscience and functionalist theories. Classical theories in cognitive science require this bridge: First, cognitive science's commitment to the physical symbol system hypothesis (e.g., Newell, 1980) necessitates a physical account of information processing as well as a formal account. Second, current views in the philosophy of science (e.g., Cummins, 1983) note that functionalist explanations require the dispositional properties of the functional architecture -- the primitive "building blocks" of an information processing system -- must be subsumed under natural laws. Thus, if one proffers a functionalist explanation, it is not enough to merely state that a function has been subsumed; one must also provide an account of the mechanisms that instantiate the function.

However, there are many factors working against realizing connectionism's potential to subsume cognitive theory. This is because connectionist's often make design decisions about their architecture without justifying them as computationally relevant properties of neural circuits. It is perfectly reasonable to propose an architecture that ignores complex properties of neural substrate with the goal of making computationally relevant properties explicit. It is quite another to create an architecture that incorporates properties that make it work, independent of whether these properties bear any relation to neural substrates whatsoever. Below, we review several instances of this latter practice.

Problems Arise Because Of Biologically Undefended Design Decisions

Connectionists adopt monotonic activation functions. To a large extent, changes in conceptualizations of activation functions for processing units have been responsible for the evolution from less powerful, single layer networks of the "Old Connectionism" to the more powerful multiple layer networks of the "New Connectionism". For example, in a perceptron (e.g., Rosenblatt, 1962), the activation function for the output unit is a linear threshold function: If the net input to the unit exceeds a threshold, then it assumes an activation of 1, otherwise it assumes an activation of 0. This kind of activation function is roughly analogous to the "all-or-none law" governing the generation of action potentials in neurons (e.g., Levitan & Kaczmarek, 1991, pp. 37-44). However, linear threshold functions are charaterized by mathematical properties that make them difficult to work with.

The learning procedures developed within the New Connectionism were made possible by reconceputalizing the linear threshold function with more tractable mathematical equations. For example, it has been quite common to adopt sigmoid-shaped "squashing" functions, like the logistic depicted in Figure 2b. The mathematical limits of this nonlinear equation are functionally equivalent to the two discrete states of the linear threshold function. However, the function itself is continuous, and therefore has a derivative. Because of this property, one can use calculus to determine rules that will manipulate weights in such a way to perform a gradient descent in an error space (e.g., Rumelhart et al., 1986b, pp. 322-327). In short, continuous activation functions permit one to derive powerful learning rules.

However, "squashing" functions have another mathematical property that makes these learning rules practical to apply. Such activation functions are monotonic -- they are nondecreasing in relation to increases in net input. The derivation of the generalized delta rule (Rumelhart et al., 1986b, p. 325), and the derivation of learning rules for stochastic autoassociative networks like Boltzman machines (e.g., Mller & Reinhardt, 1990, p. 37) stipulate that activation functions be monotonic. If they are not, then in practice they are not always guaranteed to work. For example, Dawson and Schopflocher (1992b) found that if processing units that had a particular nonmonotonic activation function (the Gaussian illustrated in Figure 2c) were inserted into a network trained with the standard version of the generalized delta rule, then quite frequently the network settled into a local minimum in which it did not respond correctly to all inputs.

The assumption that activation functions are monotonic appears to be a practical requirement for learning procedures. However, adopting this assumption for this reason alone is dangerous practice, because monotonicity does not appear to be universally true of neural mechanisms. For instance, Ballard (1986) uses a relatively coarse behavioural criterion (i.e., the response of a cell as a function of a range of net inputs) to distinguish between integration devices -- neurons whose behaviour is well described with activation functions like the logistic in Figure 2b -- and value units -- neurons whose behaviour is well described with activation functions like the Gaussian in Figure 2c.

Nonmonotonicity becomes even more apparent at a more detailed level of analysis. It should be apparent that the activation function for a connectionist processor is strongly related to the mechanisms that produce action potentials in neurons. A major component of these mechanisms are voltage gated ion channels in nerve membranes (for an introduction, see Levitan & Kaczmarek, 1991, pp. 51 - 124). These channels allow ionic currents to pass through them, and as a result affect a neuron's resting membrane potential. In turn, changes in membrane voltage affect the likelihood that these channels are open or closed. In some cases, such as the potassium channel considered by Levitan and Kaczmarek (Figure 3-9) the relationship between this likelihood and membrane voltage is monotonic. In other important cases, it is decidely nonmonotonic. For instance, as membrane voltage becomes positive, voltage gated sodium channels begin to open. As the voltage continues to increase, the channel becomes inactive. The nonmonotonic character of the sodium channel played an important role in Hodgkin and Huxley's (e.g., 1952) quantitative modeling of the action potential.

PDP learning rules are not limited in principle to monotonic activation functions. For instance, Dawson and Schopflocher (1992b) derived a modified version of the generalized delta rule that is capable of training networks of value units (i.e., processors with Gaussian activation functions). RBF networks (e.g., Moody & Darken, 1989) also have units with nonmonotonic activation functions, although they use a net input function that in effect makes them behave montonoically (see Dawson & Schopflocher, 1992b, Figure 2c). Nevertheless, nonmonotonic activation functions appear to be more the exception than the rule in PDP modeling. Unfortunately, this appears to be due to the fact that monontonic activation functions lead more easily to practical training methods, and not due to the fact that such properties are characteristic of neural substrates.

Connectionists assume modifiable biases. A tenet of the PDP approach is that connection "weights are usually regarded as encoding the system's knowledge. In this sense, the connection strengths play the role of the program in a conventional computer" (Smolensky, 1988, p. 1). Thus the basic goal of a connectionist learning rule is to manipulate the pattern of connectivity in a network. This is in accordance with current understanding of actual neural circuits that are capable of learning. For example, experimental studies of the gill withdrawl reflex in Aplysia Californica have indicated that learning alters the efficacy of synapses between neurons (for a review, see Dudai, 1989, Chap. 4).

However, when networks are trained by supervised learning procedures like the generalized delta rule (e.g., Rumelhart, Hinton, & Williams, 1986a, 1986b), the pattern of connectivity in the network is not all that is changed. It is quite typical to also modify the bias values of the activation function as well. For a sigmoid "squashing" function like the logistic, the bias is a parameter that positions the activation function in net input space; changing bias is equivalent to translating the activation function along an axis representing net input. Biases can be modified by construing them as connection weights emanating from a "bias processor" that is always on (e.g., Rumelhart et al., 1986b, footnote 1). A processing unit's bias is manipulated by modifying the strength of the connection between it and its "bias processor". However, it is important to note that this is merely a description which permits unit bias to be learned. "Bias processors" are not presumed to exist in a network. Instead, modifying the bias of a unit's activation function is analogous to directly modifying a neuron's threshold for generating an action potential.

The assumption that bias can be modified violates the connectionist tenet that all that matters are patterns of connectivity. This in itself is not problematic. The problem arises in justifying this design decision -- in defending the existence of modifiable biases in the connectionist architecture. In point of fact, there is little evidence that threshold membrane potentials in real neural networks are modifiable. For example, Kupfermann, Castellucci, Pinsker, and Kandel (1970) demonstrated that in the neural circuits that mediate the gill withdrawl reflex in Aplysia, the thresholds of motor neurons to constant external current did not change as a function of learning. It was concluded that learning only modified synaptic properties in this neural circuit. Similarly, neuroscientists concerned with learning in the mammalian brain have focused on a particular mechanism, the long term potentiation of synapses (for reviews, see Cotman, Monaghan, & Ganong, 1988; Massicotte & Baudry, 1991). To our knowledge, neuroscientists do not believe that neuron thresholds are themselves plastic.

Why does PDP connectionism use modifiable bias terms, when this manoever does not appear to be supported by extant neurophysiological evidence? The answer appears to be that without modifiable biases, some PDP networks are extremely difficult to train. For example, Dawson, Schopflocher, Kidd, and Shamanski (1992) trained standard backpropagation networks on the encoder problem. In a control condition, typical learning procedures were used, and processor biases were modified. In an experimental condition, the network's were identical with the exception that after being assigned initial random values, all biases were fixed during the training session. While the control networks had little difficulty in learning solutions to the encoder problem, none of the experimental networks did -- even under a variety of training conditions (i.e., different learning rates, momentums, starting states), and even with an extremely relaxed definition of learning to criterion.

This is not to say that connectionist research in general is fundamentally flawed because it requires modifiable biases. Many architectures can use fixed thresholds in their processing units, including Hopfield nets (e.g., Hopfield, 1982), Boltzman machines (e.g., Ackley, Hinton, & Sejnowski, 1985), and value unit networks (e.g., Dawson et al., 1992). The critical issue is that modifiable biases are adopted in some architectures without being neurophysiologically justified. If it is indeed the case that plastic neural circuits do not have directly modifiable thresholds, then such justification is important, because without it certain architectures may be deemed uninteresting to cognitive science.

Connectionists adopt massively parallel patterns of connectivity. The history of connectionism can be presented in capsule form as follows: In the beginning, connectionist networks had no hidden units. Minsky and Papert (1969/1988) then proved that such networks had limited competence, and were thus not worthy of further study. The New Connectionism was born when learning rules for networks with hidden units were discovered (e.g., Ackley, Hinton, & Sejnowski, 1985; Rumelhart et al., 1986a). These rules provided researchers the ability to teach networks that were powerful enough to overcome the Minsky/Papert limitations. (Detailed versions of this history are provided in Hecht-Nielsen, 1990, pp. 14-19; Papert, 1988).

What is interesting about this history is that it pins the blame for the limitations of networks created by Old Connectionism -- perceptrons -- on the number of processing layers. It neglects the fact that Minsky and Papert (1969/1988) were extremely concerned with a different type of limitation, the limited-order constraint. Under this constraint, the neural network is restricted to being local -- there is no single processing unit that can directly examine every input unit's activity. For example, Minsky and Papert (pp. 56 -59) prove that to compute the parity predicate (i.e., to assert "true" if an odd number of input units has been activated), a network requires at least one processor to be directly connected to every input unit. Such proofs still hold for modern multilayer perceptrons (see Minksy & Papert, 1969/1988, pp. 251-252).

There is no doubt that the New Connectionist models are more powerful than their antecedents. However, this increased power is not only due to the addition of layers of hidden units, but is also due to the violation of the limited-order constraint: these new models also permit some processors to have direct connections to every input unit. For example, Rumelhart, Hinton & Williams (1986b, Fig. 6) have trained a multi-layer network to compute the parity predicate. However, the network requires each of its hidden units to be directly connected to every input unit.

The fact that PDP models permit such massively parallel patterns of connectivity between input units and hidden units is unfortunate. While it is true that this design decision will increase the network's competence, it does this at the expense of both biological and empirical plausibility. With respect to the former issue, there is no evidence to indicate that, in human sensory systems, massively parallel connections exist between receptor cells and the next layer of neurons. Indeed, computational modelers of visual processing attempt to increase the biological plausibility of their models by enforcing spatially local connections among processing units (eg., Ullman, 1979). With respect to the latter issue, humans may indeed be subject to computational limits due to the limited-order constraint. For example, Minsky and Papert (1969/1988, p. 13) have used a small set of very simple figures to prove that a perceptron of limited order cannot determine whether all the parts of any geometric figure are connected to one another. Psychophysical experiments have shown that preattentive visual processes involved in texture perception (e.g., Julesz, 1981) and motion perception (Dawson, 1990) are insensitive to this property as well. It is quite plausible to suppose that this insensitivity is related to the fact that the neural circuitry responsible for registering these figures is not massively parallel.

Connectionists assume homogenous processing units. One of the interesting properties of PDP models is their homogeneity (for an exception, see the hybrid networks described by Dawson & Schopflocher, 1992b). It is typically the case that all of the units in PDP networks are of the same type, and that all of the changes that occur during learning in the network are governed by a single procedure. "The study of Connectionist machines has led to a number of striking and unanticipated findings; it's surprising how much computing can be done with a uniform network of simple interconnected elements" (Fodor & Pylyshyn, 1988, p. 6).

In some sense, the homogenous structure of PDP networks can be construed as "neuronally inspired". At a macroscopic level, neurons themselves appear to be relatively homogenous; Kuffler, Nicholls and Martin (1984, Chap. 1) note that the nervous system uses only two basic types of signals, which are virtually identical in all neurons. Furthermore, these signals appear to be common to an enormous range of animal species; much of our molecular understanding of neuronal mechanisms comes from the study of invertebrate systems. "The brain, then, is an instrument, made of 1010 to 1012 components of rather uniform materials, that uses a few stereotyped signals. What seems so puzzling is how the proper assembly of the parts endows the instrument witht he extraordinary properties that reside in the brain" (Kuffler, Nicholls, & Martin, p. 7). The answer to this puzzle, according to both neurophysiologists and connectionists, lies in understanding the complex and specific patterns of connectivity between these homogenous components. Getting (1989, p. 186) has noted that in the neuroscience of the late 1960's "the challenge of uncovering the secrets to brain function lay in the unravelling of neural connectivity".

However, a more microscopic analysis of (apparently) relatively simple plastic neural circuits has revealed neural networks have properties that are far more diverse and complicated than was anticipated. "No longer can neural networks be viewed as the interconnection of many like elements by simple excitatory or inhibitory synapses" (Getting, 1989, p. 187). For example, Getting notes that there is an enormous variety of properties of neurons, synapses, and patterns of connectivity. These serve as the building blocks of neural circuits, and importantly can change as a function of both intracellular and extracellular contexts. As a result, a detailed mapping of the connectivity pattern in a neural network is not sufficient to understand its function. The functional connectivity in the network -- the actual effects of one cell on another -- can change as the properties of the network's "building blocks" are modulated, even though the anatomical connections in the network are fixed (see Getting, 1989, Figure 2 for a striking example).

Getting (1989, p. 199) has painted quite a different picture of neural networks than would appear to be reflected in the PDP architecture: "The comparative study of neural networks has led to a picture of neural networks as dynamic entities, constrainted by their anatomical connectivity but, within these limits, able to be organized and configured into several operational modes". The dynamic changes in biological networks would appear to be computationally relevant. Thus, if connectionism is to make good its promise to provide a more biologically feasible architecture than is found in classical systems, it would appear that the generic architecture must be elaborated extensively. The homogeneity of processors assumption must be abandoned, and in its place should be processing units and connections that have diverse and dynamic properties.

FROM CONFUSION TO COGNITIVE SCIENCE

Without a doubt, connectionist researchers have developed a number of successful and interesting models for an extremely diverse range of psychological phenomena. These successes have resulted in connectionism's current popularity and credibility. Nevertheless, connectionism currently occupies an uncertain -- if not downright mysterious -- position in mainstream cognitive science.

For instance, what aspects of cognition are connectionist models intended to be about? Some arguments would suggest that PDP systems are implementational models: "In our view, people are smarter than todays computers because the brain employs a basic computational architecture that is more suited to deal with a central aspect of the natural information processing tasks that people are so good at" (McClelland, Rumelhart, & Hinton, 1986, p. 1). However, when connectionism is described (and, in particular, criticised) as being implementational (e.g., Broadbent, 1985; Fodor & Pylyshyn, 1988), connectionists beg to differ. "Our primary concern is with the computations themselves, rather than the detailed neural implementation of these computations" (Rumelhart & McClelland, 1986, p. 138). Should connectionism thus be viewed as being primarily concerned with issues of competence? Not necessarily --it has been claimed that PDP models are essentially procedural or algorithmic (e.g. Rumelhart & McClelland, 1985). This diversity of opinions gives the impression that connectionists are blithely ignoring the sage advice that you can't please all of the people all of the time!

Why is there this striking uncertainty about the nature of connectionism, and its role in cognitive science? We believe that confusion emerges because one can create interesting and sophisticated behaviours in connectionist systems without requiring detailed a priori information about tasks to be solved, or about internal network structure. Consider Pylyshyn's (e.g., 1980, 1984) position on comparing two information processing systems. If the two systems are weakly equivalent, they compute the same mapping from input to output, but do so using quite different procedures. If the two systems are strongly equivalent, not only do they compute the same input/output mapping, but they do so because they use identical procedures -- specifically, the same program running on functionally equivalent architectures. Within this framework, cognitive science must strive for strongly equivalent models of human psychological processes to account for human mentality. We are concerned that connectionists do not strive in this direction. Instead, they develop systems that are at best weakly equivalent to human processes (for a detailed practical example of this type of argument, see Pinker & Prince, 1988).

What is required of researchers if connectionist models are to be developed that at least have the potential to be strongly equivalent to human information processing? The same that is required of any modelers in cognitive science: they must provide evidence that the functional architecture for their networks is functionally equivalent to that of the systems they model. The preceding sections of this paper show that current connectionist research does not provide this type of evidence. We envision two directions in which a future research programme could develop.

In one direction, connectionists would continue to distance themselves from Classical theorists by focusing on the biological plausibility of their networks. However, as we have shown above, such a focus requires substantial elaboration of proposals for PDP architectures. Many more computationally relevant properties must be considered, including nonmonotonic activation functions, fixed biases, and limited patterns of connectivity. These considerations have motivated our own research on the value unit architecture (Dawson & Schopflocher, 1992b; Dawson, Schopflocher, Kidd, & Shamanski, 1992). Specific proposals are required for incorporating learning rules directly into the architecture (see also Dawson & Schopflocher, 1992a). Different levels of processing may also need to be explored. For instance, the implicit assumption underlying much of connectionism is that processing units are analogous to neurons, and connection are analogous to synapses. However, the recent development of the silicon neuron resulted by describing processing at the level of ion gates in nerve membranes (Mahowald & Douglas, 1991). We are currently exploring connectionist systems in which processing unit activation indicates whether ion gates are open or closed, and in which connections represent ion currents.

In the other direction, connectionists could move towards an integration with the classical approach, by abandoning the notion that they offer a paradigm shift (c.f. Schneider, 1987), and by treating their networks as active data structures or dynamic symbols capable of being manipulated in serial by rules (c.f., Bechtel, 1988; Hawthorne, 1989). Such an integration requires substantial development of principles governing interactions between networks, and a willingness to reject the uniformity hypothesis that all cognition is explicable in terms of "generic" connectionism's architecture (see Clark, 1989, p. 128). From this perspective, PDP networks would fulfill the same theoretical role in cognitive science as have other proposals for representational primitives, such as schemas and images. There is emerging consensus among vision researchers that hybrid models are required to account for existing data on human perception (e.g., Hurlbert & Poggio, 1985; Pylyshyn, 1989; Treisman, 1986). Perhaps dynamic properties of networks-as-symbols could be used to provide a rigorous framework for such models.

Conclusion

As it is currently practiced, connectionism produces systems that generate interesting and sophisticated behaviours. However, connectionism has little to offer in terms of theoretical accounts of the internal structures, procedures, and representations that produce this behaviour. In many respects, the New Connectionism is becoming the New Behaviourism: to echo Hillis' (1988, p. 176) concern, connectionist networks allow "for the possibility of constructing intelligence without first understanding it". This is perfectly legitmate if a primary goal is merely to build artifacts that generate useful behaviour. Unfortunately, the theories of cognitive science must meet additional criteria, which require connectionists to exert a great deal more effort relating the specific properties of their networks to the implementational or architectural properties of the systems which provide them "neuronal inspiration".

References

Abu-Mostafa, Y.S., & Psaltis, D. (1987). Optical neural computers. Scientific American, 256(3), 88-95.
Ackley, D.H., Hinton, G.E., & Sejnowski, T.J. (1985). A learning algorithm for Boltzman machines. Cognitive Science, 9, 147-169.
Anderson, J.A. (1972). A simple neural network generating an interactive memory. Mathematical Biosciences, 14, 197-220.
Anderson, J.A., & Rosenfeld, E. (1988). Neurocomputing: Foundations of research. Cambridge, MA: MIT Press.
Anderson, J.A., Silverstein, J.W., Ritz, S.R., & Jones, R.S. (1977). Distinctive features, categorical perception, and probability learning: Some applications of a neural model. Psychological Review, 84, 413-451.
Antrobus, J. (1991). Dreaming: cognitive processes during cortical activation and high afferent thresholds. Psychological Review, 98, 96-121.
Ballard, D.H. (1986). Cortical connections and parallel processing: Structure and function. Behavioural and Brain Sciences, 9, 67-120.
Barnard, E., & Casasent, D. (1989). A comparison between criterion functions for linear classifiers, with an application to neural nets. IEEE Transactions on Systems, Man, and Cybernetics, 19, 834-846.
Barto, A.G., Sutton, R.S., & Anderson, C.W. (1983). Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions On Systems, Man, & Cybernetics, 13, 835-846.
Bechtel, W. (1985). Contemporary connectionism: Are the new parallel distributed processing models cognitive or associationist? Behaviourism, 13, 53-61.
Bechtel, W. (1988). Connectionism and rules and representation systems: are they compatible? Philosophical Psychology, 1, 5-16.
Bechtel, W., & Abrahamsen, A. (1991). Connectionism and the mind. Cambridge, MA: Basil Blackwell.
Bengio, Y., & de Mori, R. (1989). Use of multilayer networks for the recognition of phonetic features and phonemes. Computational Intelligence, 5, 134-141.
Besner, D., Twilley, L., McCann, R.S., & Seergobin, K. (1990). On the association between connectionism and data: Are a few words necessary? Psychological Review, 97, 432-446.
Bever, T.G., Fodor, J.A., & Garrett, M. (1968). A formal limitation of associationism. In T.R. Dixon and D.L. Horton (Eds.) Verbal behavior and general behavior theory. Englewood Cliffs, N.J.: Prentice-Hall.
Braham, R., & Hamblen, J.O. (1990). The design of a neural network with a biologically motivated architecture. IEEE transactions on neural networks, 1, 251-262.
Broadbent, D. (1985). A question of levels: Comment on McClelland and Rumelhart. Journal of Experimental Psychology: General, 114, 189-192.
Brooks, R.A. (1989). A robot that walks; emergent behaviours from a carefully evolved network. Neural Computation, 1, 253-262.
Carpenter, G.A. (1989). Neural network models for pattern recognition and associative memory. Neural Networks, 2, 243-257.
Churchland, P.S., & Sejnowski, T. (1989). Neural representation and neural computation. In L. Nadel, L.A. Cooper, P. Culicover, & R.M. Harnish (Eds.) Neural connections, mental computation. Cambridge, MA: MIT Press.
Clark, A. (1989). Microcognition. Cambridge, MA: MIT Press.
Cohen, J.D., Dunbar, K., & McClelland, J.L. (1991). On the control of automatic processes: A parallel distributed processing account of the Stroop effect. Psychological Review, 97, 332-361.
Cotman, C.W., Monaghan, D.T., & Ganong, A.H. (1988). Excitatory amino acid neurotransmission: NMDA receptors and Hebb-type synaptic plasticity. Annual Review of Neuroscience, 11, 61-80.
Cotter, N.E. (1990). The Stone-Weierstrass thereom and its application to neural networks. IEEE Transactions On Neural Networks, 1, 290-295.
Cowan, J.D., & Sharp, D.H. (1988). Neural nets and artificial intelligence. In S. Graubard (Ed.) The artificial intelligence debate. Cambridge, MA: MIT Press.
Crick, F., & Asanuma, C. (1986). Certain aspects of the anatomy and physiology of the cerebral cortex. In J. McClelland, D. Rumelhart, & the PDP Group (Eds.) Parallel Distributed Processing, V.2. Cambridge, MA: MIT Press.
Cummins, R. (1983). The nature of psychological explanation. Cambridge, MA.: MIT Press.
Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems, 2, 303-314.
Barto, A.G., Sutton, R.S., & Anderson, C.W. (1983). Neuronlike elements that can solve difficult learning control problems. IEEE Transactions On Systems, Man, and Cybernetics, 13, 834-846.
Dawson, M.R.W. (1990). Apparent motion and element connectedness. Spatial Vision, 4, 241-251.
Dawson, M.R.W. (1991). The how and why of what went where in apparent motion: Modeling solutions to the motion correspondence problem. Psychological Review, 98, 569-603.
Dawson, M.R.W., & Schopflocher, D.P. (1992a). Autonomous processing in PDP networks. Philosophical Psychology, in press.
Dawson, M.R.W., & Schopflocher, D.P. (1992b). Modifying the generalized delta rule to train networks of nonmonotonic processors for pattern classification. Connection Science, in press.
Dawson, M.R.W., Schopflocher, D.P., Kidd, J., & Shamanski, K.S. (1992). Training networks of value units. Proceedings of the Ninth Canadian Artificial Intelligence Conference, in press.
Dell, G.S. (1986). A spreading-activation theory of retrieval in sentence production. Psychological Review, 93, 283-321.
Douglas, R.J., & Martin, K.A.C. (1991). Opening the grey box. Trends In Neuroscience, 14, 286-293.
Dudai, Y. (1989). The neurobiology of memory. New York: Oxford University Press.
Eich, J.M. (1982). A composite holographic associative recall model. Psychological Review, 89, 627-661.
Fodor, J.A., & Pylyshyn, Z.W. (1988). Connectionism and cognitive architecture. Cognition, 28, 3-71.
Funahashi, K. (1989). On the approximate realization of continuous mappings by neural networks. Neural Networks, 2, 183-192.
Getting, P.A. (1989). Emerging principles governing the operation of neural networks. Annual review of neuroscience, 12, 185-204.
Girosi, F., & Poggio, T. (1990). Networks and the best approximation property. Biological Cybernetics, 63, 169-176.
Granger, R., Ambros-Ingerson, J., & Lynch, G. (1989). Derivation of encoding characteristics of layer II cerebral cortex. Journal of Cognitive Neuroscience, 1, 61-87.
Grossberg, S. (1980). How does the brain build a cognitive code? Psychological Review, 87, 1-51.
Grossberg, S. (1991). Why do parallel cortical system exist for the perception of static form and moving form? Perception & Psychophysics, 49, 117-141.
Grossberg, S., & Rudd, M. (1989). A neural architecture for visual motion perception: Group and element apparent motion. Neural Networks, 2, 421-450.
Grossberg, S., & Rudd, M. (1992). Cortical dynamics of visual motion perception: Short-range and long-range apparent motion. Psychological Review, 99, 78-121.
Hagiwara, M. (1990). Novel backpropogation algorithm for reduction of hidden units and acceleration of convergence using artificial search. Proceedings of the IEEE Joint Conference On Neural Networks, Vol. I, 625-630.
Hanson, S.J., & Burr, D.J. (1990). What connectionist models learn: Learning and representation in connectionist networks. Behavioural and Brain Sciences, 13, 471-518.
Hanson, S.J., & Olson, C.R. (1991). Neural networks and natural intelligence: Notes from Mudville. Connection Science, 3, 332-335.
Hartman, E., Keeler, J.D., & Kowalski, J.M. (1989). Layered neural networks with Gaussian hidden units as universal approximation. Neural Computation, 2, 210-215.
Haugeland, J. (1985). Artificial intelligence: The very idea. Cambridge, MA: MIT Press.
Hawthorne, J. (1989). On the compatibility of connectionist and classical models. Philosophical Psychology, 2, 5-15.
Hecht-Nielsen, R. (1990). Neurocomputing. Reading, MA: Addison-Wesley.
Hillis, W.D. (1985). The connection machine. Cambridge, MA: MIT Press.
Hillis, W.D. (1988). Intelligence as emergent behaviour, or, the songs of Eden. In S.R. Graubard (Ed.) The artificial intelligence debate. Cambridge, MA: MIT Press.
Hinton, G.E., & Shallice, T. (1991). Lesioning an attractor network: Investigations of acquired dyslexia. Psychological Review, 98, 74-95.
Hodgkin, A.L., & Huxley, A.F. (1952). A quantitative description of membrane current and its application to conduction and excitation in nerve. Journal of Physiology, 117, 500-544.
Hopcroft, J.E., & Ullman, J.D. (1979). Introduction to automata theory, languages, and computation. Reading, MA: Addison-Wesley.
Hopfield, J.J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences, USA, 79, 2554-2558.
Hornik, M., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networkds are universal approximators. Neural Networks, 2, 359-366.
Hurlbert, A., & Poggio, T. (1985). Spotlight on attention. Trends In Neurosciences, 8, 309-311.
Jabri, M., & Flower, B. (1991). Weight perturbation: An optimal architecture and learning technique for analog VLSI feedforward and recurrent multilayer networks. Neural Computation, 3, 546-565.
Jain, A.N. (1991). Parsing complex sentences with structured connectionist networks. Neural Computation, 3, 110-120.
Johnson-Laird, P.N. (1983). Mental models. Cambridge, MA: Harvard University Press.
Jordan, M.I. (1986). An introduction to linear algebra in parallel distributed processing. In D. Rumelhart, J. McClelland & the PDP Group (Eds.) Parallel Distributed Processing, V.1. Cambridge, MA: MIT Press.
Julesz, B. (1981). Textons, the elements of texture perception, and their interactions. Nature, 290, 91- 97.
Kehoe, E.J. (1988). A layered network model of associative learning: Learning to learn and configuration. Psychological Review, 95, 411-433.
Knapp, A., & Anderson, J.A. (1984). A signal averaging model for concept formation. Journal of Experimental Psychology: Learning, Memory and Cognition, 10, 616-37.
Kohonen, T. (1977). Associative memory: A system-theoretical approach. New York: Springer-Verlag.
Kohonen, T. (1984). Self-organization and associative memory. New York: Springer-Verlag.
Kruschke, J.K. (1990). How connectionist models learn: The course of learning in connectionist networks. Behavioural and Brain Sciences, 13, 498-499.
Kuffler, S.W., Nicholls, J.G., & Martin, A.R. (1984). From neuron to brain, 2nd edition. Sunderland, MA: Sinauer Associates.
Kupfermann, I., Castelluci, U., Pinsker, H., & Kandel, E.R. (1970). Neuronal correlates of habituation and dishabituation of the gill-withdrawal reflex in Aplysia. Science, 167, 1743-1745.
Lachter, J., & Bever, T.G. (1988). The relation between linguistic structure and associative theories of language learning -- A constructive critique of some connectionist learning models. Cognition, 28, 195-247.
Levelt, W.J.M. (1990). Are multilayer feedforward networks effectively Turing machines? Psychological Research, 52, 153-157.
Levitan, I.B., & Kaczmarek, L.K. (1991). The neuron: Cell and molecular biology. New York: Oxford University Press.
Lippmann, R.P. (1987). An introduction to computing with neural nets. IEEE ASSP Magazine, April, 4-22.
Lippmann, R.P. (1989). Pattern classification using neural networks. IEEE Communications Magazine, November, 47-64.
Lucas, S.M., & Damper, R.I. (1990). Syntactic neural networks. Connection Science, 2, 195-221.
Lynch, G., Granger, R., Larson, J., & Baudry, M. (1989). Cortical encoding of memory: Hypotheses derived from analysis and simulation of physiological learning rules in anatomical structures. In L. Nadel, L.A. Cooper, P. Culicover, & R.M. Harnish (Eds.) Neural connections, mental computation. Cambridge, MA: MIT Press.
Mahowald, M, & Douglas, R. (1991). A silicon neuron. Nature, 354, 515-518.
Marr, D. (1982). Vision. Sna Francisco: W.H. Freeman.
Massaro, D.W. (1988). Some criticisms of connectionist models of human performance. Journal of Memory and Language, 27, 213-234.
Massicotte, G., & Baudry, M. (1991). Triggers and substrates of hippocampal synaptic plasiticity. Neuroscience & Biobehavioural Reviews, 15, 415-423.
McClelland, J.L., & Rumelhart, D.E. (1988). Explorations in parallel distributed processing. Cambridge, MA: MIT Press.
McClelland, J.L., Rumelhart, D.E., & Hinton, G.E. (1986). The appeal of parallel distributed processing. In D. Rumelhart, J. McClelland, & the PDP Group (Eds.) Parallel Distributed Processing, V.1. Cambridge, MA: MIT Press.
McCulloch, W.S., & Pitts, W. (1988). A logical calculus of the ideas immanent in nervous activity. In J. Anderson and E. Rosenfeld (Eds.) Neurocomputing: Foundations of research. Cambridge, MA: MIT Press. (Originally published in 1943).
Minksy, M. (1972). Computation: Finite and infinite machines. London: Prentice-Hall.
Minsky, M., & Papert, S. (1988). Perceptrons. Cambridge, MA: MIT Press. (Originally published in 1969.)
Moody, J., & Darken, C.J. (1989). Fast learning in networks of locally-tuned processing units. Neural Computation, 1, 281-294.
Moorhead, I.R., Haig, N.D., & Clement, R.A. (1989). An investigation of trained neural networks from a neurophysiological perspective. Perception, 18, 793-803.
Mozer, M.C., & Smolensky, P. (1989). Using relevance to reduce network size automatically. Connection Science, 1, 3-16.
Mller, B., & Reinhardt, J. (1990). Neural networks. Berlin: Springer-Verlag.
Murdock, B.B. (1982). A theory for the storage and retrieval of item and associative information. Psychological Review, 89, 609-626.
Newell, A. (1980). Physical symbol systems. Cognitive Science, 4, 135-183.
Papert, S. (1988). One AI or many? In S.R. Graubard (Ed.) The artificial intelligence debate. Cambridge, MA: MIT Press.
Pinker, S., & Prince, A. (1988). On language and connectionism: Analysis of a parallel distributed processing model of language acquisition. Cognition, 28, 73-193.
Poggio, T., & Girosi, F. (1989). A theory of networks for approximation and learning. MIT AI Lab Memo No. 1140.
Poggio, T., & Girosi, F. (1990). Regularization algorithms for learning that are equivalent to multilayer networks. Science, 247, 978-982.
Pomerleau, D.A. (1991). Efficient training of artificial neural networks for autonomous navigation. Neural Computation, 3, 88-97.
Pylyshyn, Z.W. (1980). Computation and cognition: Issues in the foundations of cognitive science. Behavioural and Brain Sciences, 3, 111-169.
Pylyshyn, Z.W. (1984). Computation and cognition. Cambridge, MA.: MIT Press.
Pylyshyn, Z.W. (1989). The role of location indexes in spatial perception: A sketch of the FINST spatial-index model. Cognition, 32, 65-97.
Rager, J., & Berg, G. (1990). A connectionist model of motion and government in Chomsky's government-binding theory. Connection Science, 2, 35-52.
Reeke, G.N., & Edelman, G.M. (1988). Real brains and artificial intelligence. In S.R. Graubard (Ed.) The artificial intelligence debate. Cambridge, MA: MIT Press.
Renals, S. (1989). Radial basis function network for speech pattern classification. Electronics Letters, 25, 437-439.
Rosenblatt, F. (1962). Principles of neurodynamics. Washington: Spartan Books.
Rumelhart, D.E., McClelland, J.L., & the PDP Group. (1986). Parallel Distributed Processing, V.1. Cambridge, MA: MIT Press.
Rumelhart, D.E., Hinton, G.E., & McClelland, J.L. (1986). A general framework for parallel distributed processing. In D. Rumelhart, J. McClelland, & the PDP Group (Eds.) Parallel Distributed Processing, V.1. Cambridge, MA: MIT Press.
Rumelhart, D.E., Hinton, G.E., & Williams, R.J. (1986a). Learning representations by back-propogating errors. Nature, 323, 533-536.
Rumelhart, D.E., Hinton, G.E., & Williams, R.J. (1986b). Learning internal representations by error backpropagation. In D. Rumelhart, J. McClelland, & the PDP Group (Eds.) Parallel Distributed Processing, V.1. Cambridge, MA: MIT Press.
Rumelhart, D.E., & McClelland, J.L. (1986). On learning the past tenses of English verbs. In J. McClelland, D. Rumelhart, & the PDP Group (Eds.) Parallel Distributed Processing, V.2. Cambridge, MA: MIT Press.
Rumelhart, D.E., & McClelland, J.L. (1985). Levels indeed! A response to Broadbent. Journal of Experimental Psychology: General, 114, 193-197.
Rumelhart, D.E., Smolensky, P., McClelland, J.L., & Hinton, G.E. (1986). Schemata and sequential throught processes in PDP models. In J. McClelland, D. Rumelhart, & the PDP Group (Eds.) Parallel Distributed Processing, V.2. Cambridge, MA: MIT Press.
Santrock, J.W. (1988). Psychology: The science of mind and behaviour. Dubuque, Iowa: Wm. C. Brown Publishers.
Schneider, W. (1987). Connectionism: Is it a paradigm shift for psychology? Behavior Research Methods, Instruments, & Computers, 19, 73-83.
Seidenberg, M.S., & McClelland, J.L. (1989). A distributed, developmental model of word recognition and naming. Psychological Review, 96, 523-568.
Seidenberg, M.S., & McClelland, J.L. (1990). More words but still no lexicon: Reply to Besner et al. (1990). Psychological Review, 97, 447-452
Selfridge, O.G. (1956). Pattern recognition and learning. In C. Cherry (Ed.) Information theory. London: Butterworths Scientific Publications.
Sietsma, J., & Dow, R.J.F. (1988). Neural net pruning -- why and how. Proceedings of the IEEE Joint International Conference On Neural Networks, Vol. I, 325-333.
Smolensky, P. (1988). On the proper treatment of connectionism. Behavioural and Brain Sciences, 11, 1-74.
Steinbuch, K. (1961). Die lernmatrix. Kybernetik, 1, 36-45.
Taylor, W.K. (1956). Electrical simulation of some nervous system functional activities. In C. Cherry (Ed.) Information theory. London: Butterworths Scientific Publications.
Treisman, A. (1986). Features and objects in visual processing. Scientific American, 255(5), 114B- 125.
Ullman, S. (1979). The interpretation of visual motion. Cambridge, MA: MIT Press.

Acknowledgements

This paper was supported by Natural Sciences and Engineering Research Council of Canada operating grant 2038 and equipment grant 46584, both awarded to the first author. We would like to thank the following members of the Biological Computation Project for their helpful comments: Istvan Berkeley, Matthew Duncan, Tim Gannon, David Hall, James Kidd, and Don Schopflocher. Thanks as well to Nancy Digdon and Dallas Treit. Address reprint requests to: Dr. Michael Dawson, Biological Computation Project, Department of Psychology, University of Alberta, Edmonton, AB CANADA T6G 2E9. Electronic mail: mike@psych.ualberta.ca.

Pearl Street | "An Invitation To Cognitive Science" Home Page | Dawson Home Page |