Computer Vision and Machine Learning

Friday, February 5, 2010

Shape Context Vs FORMS

References:

[1] Serge Belongie, Jitendra Malik and Jan Puzicha Shape Matching and Object Recognition Using Shape Contexts PAMI, 24(4):509-522, April 2002.

[2] Zhu and Yuille. FORMS: A Flexible Object Recognition and Modelling System.

D’arcy Thompson’s vision of modeling shapes of objects and comparing related forms using the precise language of mathematics can be seen more or less, in both the papers. While “Shape context” (SC) approach defines a shape as discrete set of points sampled from internal and external contours of the object, FORMS argues that shape of objects are hierarchical in nature, and can be modeled by three levels of granularity.

One of the first things to note is that we are interested in “comparison of related forms” and hence the above vision does not seem unrealistic, as even in mathematics we do compare similar family of curves etc. and hence there is an analogy. The only difference (and hence problem) is that while the curves etc (or existing families of basic shapes like circle, ellipse) have well defined equations and parameters which decides their properties, in nature the contours of objects have no well defined local structure. But the good thing is that globally shapes of related forms do look similar and hence can be transformed into each other.

Any model which relaxes this “rigidity” present due to “very well defined equations of exact shapes” and allow for small deformations locally will have better matching capability. FORMS does the same by first decomposing the objects into primitives and allowing for the deformations at that level (Bottom-up). But in my opinion SC will be more robust to variation across related forms or deformation for matching as it allows this flexibility at point level, which the most basic level possible. Also the number of descriptors is much higher for SC than FORMS and hence chances of working under occlusions are good. But the matching will be more time consuming in SC.

It initially appealed to me that SC will be more suitable for modeling shapes of leaf of plants and FORMS will be suitable for matching animate objects (animals etc), because in my opinion primitives considered for leafs may be very coarse and finer variations along the periphery may not be captured properly. Comparison of plant leafs (or in general shapes with minute variations) may be better captured by SC. The notion of primitives for animate objects makes more sense to me, and even if neglect the slight variations, matching still will be consistent.

Also cloths or external occlusion may be a problem to both the methods, but I would expect SC to perform better. It seems we need some different approach to model foldable parts on animate object and the model should capture the continuous trace of folding or aware of this phenomenon to incorporate while matching, because image may contain any state of that part. Images of foldable objects should have inner edges otherwise it will difficult to match. Both the methods are only applicable to objects which have well represented 2-D silhouettes. Also different views can generate very different 2-D silhouettes. So data for different views should be available.

In sum, both the approaches seem to work really well in some specific scenarios (Handwritten digit for SC, well defined 2-D silhouettes of animate objects for FORMS) but at the same time, may fail in different settings. Both methods have their own limitations and hence I would consider the problem of modeling shapes “perfectly” (as usually done in mathematical science but here in approximate sense) and matching to be still open.

Conditional Independence Assumptions - How far we can take it ?

References:

[1] J. Yamato, J. Ohya, and K. Ishii, “Recognizing Human Action in Time-Sequential Images Using Hidden Markov Model,” CVPR ’92, pages 379-385.

[2] Crandall, Felzenszwalb and Huttenlocher. Object Recognition by Combining Appearance and Geometry.

Paper by Crandall et. al. [2] proposes a model called k-fans for part-based object recognition. They discuss the trade-offs that arise due conditional independence assumption on the part of the objects. On one side, models have been used to capture the spatial dependencies of all pair of parts, but accurate detection and localization which rely on search heuristics become computationally intractable and on the other side we have models which assume no conditional dependence between parts and hence detection and localization is much easier. But while this model yields computationally tractable recognition and learning procedures, it is unable to accurately represent multi-part objects since it captures no relative spatial information. They resort to a model between the two extremes that can be defined by making certain conditional independence assumptions.
In the other paper [1] also, they use conditional independence based HMM model to do human action recognition.

But in my opinion even with this intermediary model, not all categories of object can be recognized. If the object parts have large degree of freedom with respect to each other then it will still be difficult to localize. They propose that reference objects with spatial priors can capture the geometric relationship while non-reference parts with conditional independence assumption will make the model more tractable for search. First problem I see is that learning such model will be difficult. The search space for maximum likelihood model can become large for high value of k. Also, finding the optimal value of k itself seems a difficult task to me, for some objects. They have shown results on the motorbike and airplane dataset where the object parts are fixed, there are no results for other category where parts are not fixed. For human action recognition the conditional independence assumptions between time sequential images has similar issues of learning and accurate localization. For actions in which this time sequential frames has less dependencies, this model will not work.

In general I think conditional independence assumptions and models based on them are appropriate for some kind of problem (speech recognition, online handwriting recognition ), but it depends on the nature of problem as to how much this conditional independence assumption will be valid and upto to what extent. They can work in some scenarios while fail completely in other. It should be evaluated properly before applying to any recognition problem. For example modeling spatial/temporal dependencies for object parts may not be correct for all object/action recognition task.

Thursday, August 20, 2009

Psychology and Vision: Relevancy

References:
[1] E. Rosch, C. Mervis, W. Gray, D. Johnson, and P. Boyes-Braem, ``Basic Objects in Natural Categories'', Cognitive Psychology, 8:382--439.
[2] Biederman, I. (1987). Recognition--by--components: A theory of human image understanding. Psychological Review, 94(2):115--147.

Which paper is more relevant to Vision?

The paper by Rosch et. al. defines and argues about the existence of “basic category” which carry the most common information, has the highest category cue validity and are more differentiated from others. Categories one level more abstract is called super-ordinate category whose members share only a few attributes among each other and categories below the basic level is termed subordinate categories that contain many attributes which overlaps with other categories. Further, they also showed from their experiments on visual detection and priming of classification, that basic level is the most abstract level at which perceptual identification of an object could be aided.

The second paper by Biederman proposes a theory of human image understanding – Recognition by Components (RBC), which is based on geometrical cones (geons) which can be derived from edge properties in image. They further argue that human visual system parses the regions of concavity to determine the primitive components first, and then matches the arrangement to the pre-stored representation to identify the object. Also, contour based features are more efficient than color and texture in most categories.

Both the paper provides results and conclusions which gives insight into perception and recognition process of human brain. In Computer Vision (CV), one of the central problems is to recognize objects in a image. Since human visual system and reasoning for recognizing object is very developed and efficient, it makes sense, to find how it works and what are the steps involved. At the same time, a vision system is mostly interested in recognizing a particular category or object type, not everything that exists, in this world. Because the purpose of a CV system is to aid humans in automating certain process, we are interested in building a particular object detector or recognizer, specific to the process. So the knowledge about existence of a basic level do not help much except that it will be known, that recognizing a basic category will be easier as compared to super and sub- ordinate category. For example, it will be difficult to recognize both - “furniture” or “a sleeping chair”, than simply a chair. But it does not reduce the complexity of a given problem, just reasons why certain object recognition task is difficult than others.

In my opinion, the second paper, to some extent, is more relevant to CV as the theory proposed, if not for everything, at least in some context, can be used for making a object recognition system. Building a primitive component recognizer, and then based on the arrangement and edge properties, identification of a particular object may work well in some cases. But before this task is done, the problem of matching a particular component and finding relationship between them is itself a difficult problem. While 3-D to 2-D transformation produces a unique image, 2-D to 3-D may have multiple possibilities of arrangements in 3-D. RBC can be successful if we are able to recover the full arrangement and relation among the components. Solving this itself has been a challenging. A given image of objects can come from different object arrangement and view point. Although we can provide this knowledge in the vision system to some extent, it may fail when exception occurs. We humans due to other senses and capability of using and relating past experiences perform well, even during the exceptional case, but for a vision system this is not trivial. More interestingly, even for humans recognizing object in images is more difficult than in 3-D world. For example, there exist, many image based optical illusions which even confuses human brain. I do not recall many 3-D illusions, except few. We have to admit the fact that 2-D image formation has resulted in loss of information, and trying to figure out objects in 3-D world can always be tricked. But we are interested in average performance of any vision system and mostly in non-exceptional scenarios and hence achieving that should not be so impossible. The second paper also proposes that cue based on primal sketch is more important than color and texture in many cases. This is also useful when vision system is designed, we can prioritize the features or cue used. But the bottom line is, unless a CV system has sufficient knowledge of 3-D world and physical principles governing the image formation, it will be difficult to mimic human visual system and perception.
Hence in sum, both the papers discuss some basic question in psychology and vision – Why recognition of certain objects, is difficult? How humans perceive visual information? But unless, we have a system with other capabilities of human brain, it will be difficult to make use of these theories to full extent.

Monday, August 10, 2009

Locally Linear Embeding (LLE) Vs Locality Preserving Projections(LPP)

References:

[1] Saul and Roweis: Think Globally, Fit Locally, Unsupervised Learning of Nonlinear Manifolds (U. Penn. Tech Report CIS-02-18).

[2] He, Yan, Hu, Niyogi, and Zhang. Face Recognition Using Laplacianfaces.

Saul and Roweis [1] presents a method to compute a low dimensional embedding of high dimensional data assumed to lie on non linear manifold. It tries to preserve the property that nearby points in the high dimensional space remain nearby and similarly co-located with respect to one another in the low dimensional space.

Laplacian face paper [2] extends the idea of preserving local neighborhood distance by obtaining a subspace of given high dimensional data which may lie of manifold.
More specifically, the manifold structure is modeled by a nearest-neighbor graph which preserves the local structure of the image space. A face subspace is obtained
by Locality Preserving Projections (LPP). Each face image in the image space is mapped to a low dimensional face subspace, which is characterized by a set of feature images, called Laplacian faces. They claim that their method is the first method to face analysis, which explicitly considers the nonlinear manifold structure of the data.

The main contribution of Laplacian face approach is that the method can handle supervised learning tasks like face recognition and even novel data can be represented in the computed subspace. Methods based on LLE yield maps that are defined only on the training data points and its evaluation on novel test data points is not so clear.
Although Saul and Roweis suggests a way to generalize LLE to novel data points using non-parametric model and parametric model, the former has a big disadvantage that it requires access to the entire set of previously analyzed inputs and outputs and hence potentially a large demand in storage. Parametric model using mixture models are also suggested but obtaining a global coordinate system by patching together the local coordinate systems of individual components in mixture model is difficult and unclear.

LLE does not explicitly consider the structure of the manifold on which the data/images possibly reside. Kernel based techniques for face recognition can discover the nonlinear structure of the face images but they are computationally expensive.

Hence in my opinion the Laplacian method has advantages in supervised learning and applications where knowing the structure of manifold is important, while LLE is good for dimensionality reduction and visualization of a given data.

Machine learning in Vision: Challenges and Issues

References:

[1] O. Boiman, E. Shechtman and M. Irani. In Defense of Nearest-Neighbor Based Image Classification.

[2] P. Viola and M. Jones. Robust real-time object detection. Technical Report 2001/01, Compaq CRL, February 2001

When it comes to handling the data from vision problems, the first problem faced by the most machine learning techniques is the high-dimensionality of feature space. Most of the effort goes in reducing the high dimension to a dimension where it is computationally feasible and tractable. Such dimensionality reduction is essential for many learning based classifiers but it affects the discriminative power and degrades the accuracy of classification. As mentioned in paper by Boiman et al. [1], dimensionality reduction is mostly harmful in the case of non-parametric classification because there is no training phase to compensate for this loss of information. They further explain that quantization of long-tail descriptors affects the NN based classification.

Coming up with good distance measure is sometime difficult. In the paper by Boiman et. al. they argue that while image-to-image distance is central to the kernel based methods it is not good for non-parametric classifiers like NN. This limitation is much severe for classes with large diversity. Also when the number of classes to be classified is huge (ex - image classification), then also, many learning algorithms which were initially designed as binary classifier (For example – Support Vector Machines) needs to be extended or applied multiple times to get the multiclass classification results. This becomes infeasible when the number of classes is huge.

In paper by viola and Jones [2] they address the problem of detecting face in an image at a very fast rate (15 frames per second). Hence an extra effort to make the system work in real time is mostly required. They came up with a novel representation of an image called “Integral Image” using which feature computation and evaluation is very fast. Frequently an additional amount work in terms of representing the data or handling the scale of data is required before a suitable machine learning technique can be applied. Features used for training should be robust to rotation, translation and scale, which is very common in vision problems. They further use a combination of weak classifiers (ADABOOST), to decide the important feature among the large number of features available. This makes me think about another issue that most learning based classifiers which give equal weight to the features lacks the capability in itself to select the best features to increase its computational time/performance. An explicit cascading is required to get the best features available.

Another problem which is not exactly attributed to application in vision but inherent to machine learning techniques is over-fitting of the data that can happen frequently. But due to high-dimension this problem is not easily tractable in vision problems. I mean several experiments are needed to obtain a good generalization.

In sum, I would say that although machine learning have greatly influenced the vision applications in terms of their power of automatic learning of parameters of model there is always some preprocessing of data which is required to make it suitable for the technique to be applied and need of some workaround to handle the above mentioned issues.