BibTeX entry
@PHDTHESIS{200608Josef_Sivic,
AUTHOR={Josef Sivic},
TITLE={Efficient visual search of images and videos},
SCHOOL={Oxford University},
MONTH=Aug,
YEAR=2006,
URL={http://www.bmva.org/theses/2006/2006-sivic.pdf},
}
Abstract
This thesis investigates visual search of videos and image collections, where the query is specified by an image or images of the object. We study efficient retrieval of particular objects, human faces, and object classes. Particular objects are represented by a set of viewpoint invariant region descriptors, so that recognition can proceed successfully despite changes in viewpoint, illumination and partial occlusion. Efficient retrieval is achieved by employing methods from statistical text retrieval, including inverted file systems, and text and document frequency weightings. This requires a visual analogy of a word – `a visual word’ – and it is provided by vector quantizing the region descriptors. We also develop a representation for 3D and deforming objects, suitable for retrieval, based on multiple exemplars naturally spanning (i) different visual aspects of a 3D object and thereby implicitly representing its 3D structure, or (ii) different appearances of a deforming object. Multiple exemplar models are built automatically from real world videos, using novel tracking and motion segmentation techniques. For retrieval of faces of a particular person in video, we focus on close-to-frontal faces delivered by a face detector, and develop a specialized visual vocabulary for faces by vector quantizing the appearance of facial features. Faces in the video are associated into face sets by tracking, and the multiple exemplar representation naturally models different appearances (such as closed and open eyes) within the set. This representation is also compact, to enable efficient retrieval. To retrieve visual object classes, we build a new visual vocabulary of quantized local regions, which is tolerant to some amount of intra-class deformation. We employ a probabilistic latent variable model from statistical text analysis. In text, this model is used to discover topics in a corpus using the bag-of-words document representation. Here, we treat object categories as topics. We apply the probabilistic model in the visual domain, and show that models of visual object classes can be learnt from an unlabelled image collection without supervision. The learnt models are then applied to object class retrieval. We demonstrate rapid retrieval in entire feature length movies despite significant amount of background clutter, variations in camera viewpoint and lighting conditions, and partial occlusion