1	PSYCO 452 Week 4: Building Block 2: Nonlinearity, or Making Decisions Building Associations Hebb Learning Delta Learning Making Decisions Linear Activation Function Nonlinear Activation Functions Perceptrons, Pros and Cons
2	“Neural” Law Of Association “When two elementary brain-processes have been active together or in immediate succession, one of them, on reoccurring, tends to propagate its excitement into the other” (James, 1890) “When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes place in firing it, some growth process or metabolic change takes place in one or both cells such that A’s efficiency, as one of the cells firing B, is increased” (Hebb, 1949)
3	Distributed Associative Memory Modern views of neural association involve the strengthening of synapses (both excitatory and inhibitory) as well as the weakening of synapses These two processes have been combined to create many interesting models of distributed associative memory
4	Problems With These Memories Hebb rule has many problems Only learns orthogonal patterns Produces error when overtraining Unable to deal with linear dependence The delta rule overcomes many of these problems Can deal with some correlated patterns Only modifies weights when errors exist Still cannot deal with linear dependence
5	Distributed Associative Memory Sequences One possibility for overcoming these problems would be to build a more powerful network For example, perhaps a chain of distributed associative memories would serve the purpose In this chain, the output of one DAM would be passed along as input to another, so that layers of connections would be exploited
6	Sequences Won’t Work Linear algebra shows that these sequences can be reduced to a memory with one layer of connections In other words, the sequences don’t add power r = W₁(W₂c) = (W₁W₂)c r = Xc
7	Why Won’t Sequences Work? Why won’t these sequences add power? It is because unit activation is a linear function of net input For layers to add something that can’t be removed by linear algebra, a nonlinear transformation of net input must be provided In short, we need to use a nonlinear activation function in our processors Fortunately, many are available
8	Threshold Device
9	Integration Device
10	Value Unit
11	RBF Unit
12	Sinusoid Unit
13	The Perceptron A perceptron can be viewed as a distributed memory whose output units use nonlinear activation functions It is used to associate an input pattern with a category name A perceptron was a trainable pattern classifier!
14	Training A Perceptron We would like very much for a perceptron to learn how to categorize patterns This is exactly what Frank Rosenblatt (e.g., 1958) was able to provide: a learning rule that was guaranteed to train a perceptron to represent the solution to any problem that a perceptron could solve
15	Delta Rule For Perceptrons Assume that the perceptron uses a threshold activation function in its output. Rosenblatt used this rule W_ij(new) = W_ij(old) + h(t_j – o_j)a_i Compare this learning rule to the delta rule for DAM. Why does this rule make sense? D_t+1 = h ((t - o) · c^T)
16	Gradient Descent Rule Assume that the perceptron uses a sigmoid activation function Calculus can be used to determine a gradient descent rule that moves the network downhill in error space as fast as possible The calculus is only possible because the sigmoid is a continuous approximation of the threshold function
17	Deriving A Gradient Descent Rule Define a “least squares” error term E = S(t – o)² Use calculus to determine how this error term is changed by a weight change Use this information to define the fastest decrease in error possible For f(net) = 1/1+exp(-net): W_ij(new) = W_ij(old) + h(t_j – o_j)f’(net)a_i W_ij(new) = W_ij(old) + h(t_j – o_j)(a_i)(1 - a_i)a_i
18	Why Does It Work? The gradient descent rule treats error as a surface It tries to find the lowest point of this surface It uses calculus to find the steepest slope downhill from its current location on the space
19	What can you do with a perceptron? The perceptron is an improvement over the linear distributed associative memories that we have already discussed It can solve psychologically interesting problems It can learn to store associations between linearly dependent vectors
20	Probability Matching Gallistel has argued that choice behaviour in animals mirrors reinforcement contingencies “Every day two naturalists go out to a pond where some ducks are overwintering and station themselves about 30 yards apart. Each carries a sack of bread chunks. Each day a randomly chosen one of the naturalists throws a chunk every 5 seconds; the other throws every 10 seconds. After a few days experience with this drill, the ducks divide themselves in proportion to the throwing rates; within 1 minute after the onset of throwing, there are twice as many ducks in front of the naturalist that throws at twice the rate of the other. One day, however, the slower thrower throws chunks twice as big. At first the ducks distribute themselves two to one in favor of the faster thrower, but within 5 minutes they are divided fifty-fifty between the two â€œforaging patches.â€ â€¦ Ducks and other foraging animals can represent rates of return, the number of items per unit time multiplied by the average size of an item” (Gallistel, 1990).
21	Probability Matching By Perceptrons Perceptron responses are literally a probability judgment about being reinforced, as shown by Dawson et al. (2009) Response shows probability matching Probability matching quickly adapts to changing contingencies
22	Defining Convexity Draw a figure on the retina Choose any two points on the retina (p and r) Choose any point q on this line For any one of these triplets, if p and r are in the figure, but q is not, then the figure is not convex P ___________Q___________ R
23	Convexity Examples
24	Convexity Perceptron j_pqr is a mask; = 1 if p and r are in but q is out; = 0 otherwise Let there be one such mask for every combination of 3 points, and let each mask have a weight of 1 Calculate S j_pqr If S j_pqr= 0 the figure is convex If S j_pqr³1 the figure is not convex
25	Perceptron Limitations In their book Perceptrons, Minsky and Papert used mathematics to investigate what perceptrons could and could not learn to do They discovered some interesting, and serious, limitations to the capabilities of perceptrons The result was an extreme decline in neural network research
26	Pattern Recognition Networks are frequently used to classify patterns They carve a pattern space into decision regions Patterns are classified according to these decision regions
27	The AND Problem
28	The AND Pattern Space
29	An AND Network A single, straight cut through the pattern space solves the AND problem This means this problem is linearly separable The networks of Old Connectionism could learn to solve such problems
30	The XOR Problem
31	The XOR Pattern Space
32	Linear Nonseparability XOR is not a linearly separable problem This is because more than 1 cut is required As a result, Old Connectionism could not train networks to deal with this problem XOR is a problem for New Connectionism Or, a problem for a perceptron with a more sophisticated activation function!
33	Value Unit Named after Ballard (1986) Gaussian activation function G(net_pj) = exp[-p(net_pj - m_j)²]
34	An Elaborated Error Term Standard error term in gradient descent rule Dawson & Schoplocher error term This second term keeps some of the patterns in the middle of the distribution!
35	A New Learning Rule For G(net_pj) = exp[-p(net_pj)²] W_ij(new) = W_ij(old) + h(t_j – o_j)G’(net)a_i+ h(t_j * net)G’(net)a_i Using the Gaussian, and the Rumelhart Hinton & Williams chain rule procedure, one can derive a learning rule for value units: Dw_ij = h(d_pi - e_p_i) a_pj Essentially the same as the gradient descent rule, with the exception of an elaborated (two component) error term
36	Another XOR Network
37	Perceptron Performance Let’s use a perceptron program to explore some of the issues raised this lecture Ability to perform beyond DAM Ability to deal with most of Boolean logic Integration device vs. value unit power in terms of small, linearly nonseparable problems Limitations still exist – we will need to add layers of nonlinear processors to deal with them – and will talk about how to do this next week