Friday, February 5, 2010

Shape Context Vs FORMS

References:

[1] Serge Belongie, Jitendra Malik and Jan Puzicha Shape Matching and Object Recognition Using Shape Contexts PAMI, 24(4):509-522, April 2002.

[2] Zhu and Yuille. FORMS: A Flexible Object Recognition and Modelling System.

D’arcy Thompson’s vision of modeling shapes of objects and comparing related forms using the precise language of mathematics can be seen more or less, in both the papers. While “Shape context” (SC) approach defines a shape as discrete set of points sampled from internal and external contours of the object, FORMS argues that shape of objects are hierarchical in nature, and can be modeled by three levels of granularity.

One of the first things to note is that we are interested in “comparison of related forms” and hence the above vision does not seem unrealistic, as even in mathematics we do compare similar family of curves etc. and hence there is an analogy. The only difference (and hence problem) is that while the curves etc (or existing families of basic shapes like circle, ellipse) have well defined equations and parameters which decides their properties, in nature the contours of objects have no well defined local structure. But the good thing is that globally shapes of related forms do look similar and hence can be transformed into each other.

Any model which relaxes this “rigidity” present due to “very well defined equations of exact shapes” and allow for small deformations locally will have better matching capability. FORMS does the same by first decomposing the objects into primitives and allowing for the deformations at that level (Bottom-up). But in my opinion SC will be more robust to variation across related forms or deformation for matching as it allows this flexibility at point level, which the most basic level possible. Also the number of descriptors is much higher for SC than FORMS and hence chances of working under occlusions are good. But the matching will be more time consuming in SC.

It initially appealed to me that SC will be more suitable for modeling shapes of leaf of plants and FORMS will be suitable for matching animate objects (animals etc), because in my opinion primitives considered for leafs may be very coarse and finer variations along the periphery may not be captured properly. Comparison of plant leafs (or in general shapes with minute variations) may be better captured by SC. The notion of primitives for animate objects makes more sense to me, and even if neglect the slight variations, matching still will be consistent.

Also cloths or external occlusion may be a problem to both the methods, but I would expect SC to perform better. It seems we need some different approach to model foldable parts on animate object and the model should capture the continuous trace of folding or aware of this phenomenon to incorporate while matching, because image may contain any state of that part. Images of foldable objects should have inner edges otherwise it will difficult to match. Both the methods are only applicable to objects which have well represented 2-D silhouettes. Also different views can generate very different 2-D silhouettes. So data for different views should be available.

In sum, both the approaches seem to work really well in some specific scenarios (Handwritten digit for SC, well defined 2-D silhouettes of animate objects for FORMS) but at the same time, may fail in different settings. Both methods have their own limitations and hence I would consider the problem of modeling shapes “perfectly” (as usually done in mathematical science but here in approximate sense) and matching to be still open.

Conditional Independence Assumptions - How far we can take it ?

References:

[1] J. Yamato, J. Ohya, and K. Ishii, “Recognizing Human Action in Time-Sequential Images Using Hidden Markov Model,” CVPR ’92, pages 379-385.

[2] Crandall, Felzenszwalb and Huttenlocher. Object Recognition by Combining Appearance and Geometry.


Paper by Crandall et. al. [2] proposes a model called k-fans for part-based object recognition. They discuss the trade-offs that arise due conditional independence assumption on the part of the objects. On one side, models have been used to capture the spatial dependencies of all pair of parts, but accurate detection and localization which rely on search heuristics become computationally intractable and on the other side we have models which assume no conditional dependence between parts and hence detection and localization is much easier. But while this model yields computationally tractable recognition and learning procedures, it is unable to accurately represent multi-part objects since it captures no relative spatial information. They resort to a model between the two extremes that can be defined by making certain conditional independence assumptions.
In the other paper [1] also, they use conditional independence based HMM model to do human action recognition.


But in my opinion even with this intermediary model, not all categories of object can be recognized. If the object parts have large degree of freedom with respect to each other then it will still be difficult to localize. They propose that reference objects with spatial priors can capture the geometric relationship while non-reference parts with conditional independence assumption will make the model more tractable for search. First problem I see is that learning such model will be difficult. The search space for maximum likelihood model can become large for high value of k. Also, finding the optimal value of k itself seems a difficult task to me, for some objects. They have shown results on the motorbike and airplane dataset where the object parts are fixed, there are no results for other category where parts are not fixed. For human action recognition the conditional independence assumptions between time sequential images has similar issues of learning and accurate localization. For actions in which this time sequential frames has less dependencies, this model will not work.

In general I think conditional independence assumptions and models based on them are appropriate for some kind of problem (speech recognition, online handwriting recognition ), but it depends on the nature of problem as to how much this conditional independence assumption will be valid and upto to what extent. They can work in some scenarios while fail completely in other. It should be evaluated properly before applying to any recognition problem. For example modeling spatial/temporal dependencies for object parts may not be correct for all object/action recognition task.