BibTeX entry
@PHDTHESIS{200512Robert_Fergus,
AUTHOR={Robert Fergus},
TITLE={Visual Object Category Recognition},
SCHOOL={University of Oxford},
MONTH=Dec,
YEAR=2005,
URL={http://www.bmva.org/theses/2005/2005-fergus.pdf},
}
Abstract
We investigate two generative probabilistic models for category-level object recognition. Both schemes are designed to learn categories with a minimum of supervision, requiring only a set of images known to contain the target category from a similar viewpoint. In both methods, learning is translation and scale-invariant; does not require alignment or correspondence between the training images, and is robust to clutter and occlusion. The schemes are also robust to heavy contamination of the training set with unrelated images, enabling them to learn directly from the output of Internet image search engines.
In the first approach, category models are probabilistic constellations of parts, and their parameters are estimated by maximizing the likelihood of the training data. The appearance of the parts, as well as their mutual position, relative scale and probability of detection are explicitly represented. Recognition takes place in two stages. First, a feature-finder identifies promising locations for the model’s parts. Second, the category model is used to compare the likelihood that the observed features are generated by the category model, or are generated by background clutter.
The second approach is a visual adaptation of “bag of words” models used to extract topics from a text corpus. We extend the approach to incorporate spatial information in a scale and translation-invariant manner. The model represents each image as a joint histogram of visual word occurrences and their locations, relative to a latent reference frame of the object(s). The model is entirely discrete, making no assumptions of uni-modality or the like. The parameters of the multi-component model are estimated in a maximum likelihood fashion over the training data. In recognition, the relative weighting of the different model components is computed along with the model reference frame with the highest likelihood, enabling the localization of object instances.
The flexible nature of both models is shown by experiments on 28 datasets containing 12 diverse object categories, including geometrically constrained categories (e.g. faces, cars) and flexible objects (such as animals). The different datasets give a thorough evaluation of both methods in classification, categorization, localization and learning from contaminated data.