BibTeX entry
@PHDTHESIS{200509Antonio_Micilotta,
AUTHOR={Antonio Micilotta},
TITLE={Detection and Tracking of Humans for Visual Interaction},
SCHOOL={University of Surrey},
MONTH=Sep,
YEAR=2005,
URL={http://www.bmva.org/theses/2005/2005-micilotta.pdf},
}
Abstract
This thesis contributes, in essence, four developments to the field of computer vision. The first two present independent methods of locating and tracking body parts of the human body, where the main interest is not 3D biometric accuracy, but rather a sufficient discriminatory representation for visual interaction. Making use of a single uncalibrated camera, the first algorithm employs background suppression and a general approximation to body shape, applied within a particle filter framework. In order to maintain real-time performance, integral images are used for rapid computation of particles. The second method presents a probabilistic framework of assembling detected human body parts into a full 2D human configuration. The face, torso, legs and hands are detected in cluttered scenes using body part detectors trained by AdaBoost. Coarse heuristics are applied to eliminate obvious outliers, and body configurations are assembled from the remaining parts using RANSAC. An a priori mixture model of upper-body configurations is used to provide a pose likelihood for each configuration, after which a joint-likelihood model is determined by combining the pose, part detector and corresponding skin model likelihoods; the assembly with the highest likelihood is selected. The third development is applied in conjunction with either of the aforementioned human body part detection and tracking techniques. Once the respective body parts have been located, the a priori mixture model of upper-body configurations is used to disambiguate the hands of the subject. Furthermore, the likely elbow positions are statistically estimated, thereby completing the upper body pose. A method of estimating the 3D pose of the upper human body from a single camera is presented in the final development. A database consisting of a variety of human movements is constructed from human motion capture data. This motion capture data is then used to animate a generic 3D human model which is rendered to produce a database of frontal view images. From this image database, three subsidiary databases consisting of hand positions, silhouettes and edge maps are extracted. The candidate image is then matched against these databases in real time. The index corresponding to the subsidiary database triplet that yields the highest matching score is used to extract the corresponding 3D configuration from the motion capture data. This motion capture frame is then used to extract the 3D positions of the hands for use in HCI, or to render a 3D model.