We recognize visual objects quickly and effortlessly. How?
"Object recognition is the activation in memory of a representation of a stimulus class -- a chair, a giraffe, or a mushroom -- from an image projected by an object to the retina." Why is this difficult?
First, there is an infinite number of differnt 2D patterns that correspond to the same 3D object, ignoring the greater number of patterns produced when parts of the 3D object are (e.g.) occluded. "It is precisely this variation -- and the apparent success of our visual system and brain at achieving recognition in the face of it -- that makes the problem of pattern recognition so interesting."
Second, there are problems relating visual information to semantic information. What is a class? Biederman compares and contrasts basic level vs subordinate level vs. superordinate level a la Rosch. "Most of our knowledge of the visual world can be accessed through the basic level." In the literature, this notion of level is refined into the "entry level", to accommodate the finding that sometimes atypical exemplars (e.g., ostrich, duck) are classified more slowly.
Third, there is the problem of capacity/processing speed, as related to the number of objects that can be classified. "There are approximately three thousand entry-level terms for familiar concrete objects that can be identified on the basis of their shape rather than on surface properties of color or texture or on their position in a scene." Perhaps, too, there are 10 different perceptual models required for each of these 3000 classes of objects.
Basic question: "How is the activity of individual photreceptors employed by the brain to create a representation of an object that allows it to be recognized under such highly varied conditions?"
V1 is described as the frontline of shape perception, because of its orientation-selective simple cells, as well as send-stopped cells that might serve as contour or vertex detectors. "Activation of simple and end-stopped cells are generally believed to provide the initial cortical activity of shape representation." Issue: how do we get more abstract representations that generalize over different V1 activity patterns?
Biederman considers two aspects of this issue: (1) grouping separate features into more abstract parts, and (2) generating an invariant description: "it is particularly useful to ahve a rerpesentation that is the same whatever the viewpoint." How are these problems solved?
Biederman focuses on the detection of viewpoint-invariant properties. Some edge properties exist independent of viewpoint, and are thus "nonaccidental properties". Identifying certain features in an image lets us assume, with confidence, the existence of certain features of an object. "For example, a straight edge in the image is perceived as being a projection of a straight edge in the three-dimensional world. The visual system ignores the possibility that a (highly unlikely) accidental alignment of eye and a curved edge is projecting the image." Biederman presents a number of such properties in Figure 4.2. (NB: Are these invariants natural constraints?)
"Complex visual entities almost always invite a decomposition of their elements into simple parts. ... Whenever there is a pair of matched cusps (discontinuites at minima of negative curvature), people will express a strong intuition that the object should be segmented at this region." This is called the transversality principle. (NB: Why might we consider this a natural constraint??)
In general, researchers face two related issues: (1) how to represent image information, and (2) how to match the represented image with information in memory. These problems define a continuum of models. The more complex the image representation, the easier the recognition; the simpler the image representation, the harder the recognition. So, there is always going to be a tradeoff between detecting image features and recognizing objects. Biederman now proceeds to review three different object recognition models, each placed at different points along this continuum.
The first is the Lades et al. face recognition system. Input is a column of Gabor filters called a Gabor jet. "A column of these filters, each tuned to different orientation and scales but with a maximum responsiveness on the same region of the visual field is termed a Gabor jet. It roughly corresponds to the simple cells of a V1 hypercolumn." Image input is accomplished by a lattice of such filters. "A particular image results in activation of the different filters to various extents." Matching essentially involves similarity of Gabor jet representations --> match image lattice to stored set of lattices; best match produces the recognition label. This is a view based model, "becuase activation values are dependent on the specific view or aspect of the object." Related to this kind of model is the Poggio and Edelman radial basis function model.
The second model reviewed by Biederman is Lowe's SCERPO model. "A number of theoriets have proposed schmes that reduce the degre of matching required by considering only those objects in memory that share certain features, which are initially extracted from the image, and only those poses of the object that are consisten with those features." Int his case, the input is matched to a stored 3D model. It basically works as follows -- 1) give it the image. 2) it detects edges. 3) it groups edges "according to the viewpoint-invariant properties of collinearity, parallelism, and cotermination." 4) a preliminary match between some of these features and stored models is attempted. 5) Preliminary matches are used to guide search for new features in a top-down fashion.
The last model Biederman considers is his own geon-based system. It "assumes that a given view of an object is represented as an arrangement of simple, viewpoint-invariant, volumetric primitives called geons." There are 24 of these primitives. "The geons have two particularly desirable properties: they canb be distinguished from each other from almost any viewpoint, and their identification is highly resistant to visual noise." Basic move is to segment the image into regions, and then represent each segmented region with a geon, which is similar in function to Marr's notion of a generalized cone.
"The set of geons is defined sot hat they can be differentiated on the basis of dichotomous or trichotomous contrasts of viewpoint-invariant properties to produce twenty-four types of geons." That is, each geon is defined by a set of binary and trinary feature values, where each feature is an invariant property that can be detected in an image. This measn that you can go from detected image features to specification of a geon. "Deriving the geons from contrasts in viewpoint-invariant properties renders the geons themselves largely invariant under changes in viewpoint."
Objects are defined by relating geons to one another -- i.e., an object is a set of related parts, where each part is a geon. "There are eighty one combinations of pairwiase relations and fifteen attributes. A rperesentation that specifies parts (geons), attributes, and relations indpendently and explicitly is termed a structural description. Over 10,000,000 2-geon objects can be defined, and over 306 billion 3-geon objects. Therfore, there is a big discrepancy between the number of objects in your vocabulary (e.g., 30,000) and the number of possible objects -- implication: you should only need at most 3 geons to specify an object!
This in turm implies the principle of geon recovery: "if an arrangement of two or three geons can be recovered from the image, objects can be quickly recognized even when they are occluded, rotated in depth, novel, extensivley degraded, or lacking customary detail, color, and texture." Experimental results confim this. Furthermore, single geon objects are appropriate for entry level classification, which also requires color and texture information.
Issue -- how does vision deliver a structural description? Biederman has a neural network model in which different layers correspond to different stages of biological extraction of geons. The only really interesting part of the model is that the variable binding problem is solved by using temporal synchronity to associate different kinds of features with the same object.
The two main assumptions of geon theory is that objects are represented in terms of their simple parts, and that parts are characterized by differences in viewpoint-invariant properties. What evidence supports these assumptions?
One test involves the use of complementary images. These are image pairs created by "deleting every other edge and vertex from each geon to create the two images of each object shown in Figure 4.12, Thge two images, when superimposed, form the intact picture shown in the far left column with no overlap in contour." Biederman has used these images in a priming study, showing that priming occurs, and suggesting that the priming is part-based.
"If object recognition is mediated by a representation of an object's parts, recognition should be particularly difficult if contour is deleted in locations that reduce the recoverability of the parts from the image." This is tested by deleting the same amount of contour in figures, but in some cases the info deleted is required to compute geons, while in other cases geons can still be compted. Basic finding -- remove information required for geon detection hurts recognition, but not when this information is intact and other contours are deleted.
Are the parts that produce priming in the complementary image studies really geons? This was tested by distorting images. In one case, distortion affects an invariant property, in the other case it does not. Basic finding -- changing a viewpoint-invariant property slows down recognition, which is consistent with the notion that these parts are geons.
Future wrinkles include extending this theory to scene perception, by defining scenes as "geon clusters".
Pearl Street | "An Invitation To Cognitive Science" Home Page | Dawson Home Page |