BMVA 
The British Machine Vision Association and Society for Pattern Recognition 

BibTeX entry

@PHDTHESIS{200502Timothy_Roberts,
  AUTHOR={Timothy Roberts},
  TITLE={Efficient Human Pose Estimation from Real World Images},
  SCHOOL={University of Dundee},
  MONTH=Feb,
  YEAR=2005,
  URL={http://www.bmva.org/theses/2005/2005-roberts.pdf},
}

Abstract

Reliable, efficient human pose estimation from images is a precursor to many useful applications including advanced human computer interfaces, surveillance systems, image archive analysis and smart environments. Whilst progress has been made on human pose estimation, the research often makes strong assumptions about the appearance. In particular, assumptions are often made regarding the background, foreground, self and other object occlusion and number of viewpoints. These assumptions limit the application of computer human pose estimation systems. In contrast, the focus of this thesis is pose estimation from single real world images or monocular sequences of poorly constrained scenes. Furthermore, it aims to accomplish this efficiently. The body of the thesis is structured into three layers: formulation, likelihood and estimation. First, the popular, generative, part based approach is extended to allow pose hypotheses that have different numbers of parts to be compared. This partial configuration formulation allows pose estimation in the presence of other object occlusion, enables efficient estimation and automatic (re)initialisation and gives robustness to body parts with a non-contrasting or poorly modelled appearance. The problem of comparing partial configurations is stated as a Bayesian decision problem of discriminating between the class of people and of backgrounds. To describe the body part shape a probabilistic model is learnt from manually segmented and aligned training data of multiple subjects in various poses. In order to obtain a low dimensional model, variations due to intra-person differences and clothing as well as difficult to observe degrees of freedom and differences between certain similar body parts are marginalised over. The resulting model allows uncertainty in measurements to be quantified as well as improving estimation efficiency. Finally, a prior is developed to encode inter-part constraints and it is shown that due to these constraints smaller configurations contain much of the information of larger configurations. Although a strong likelihood model is critical in determining the success of human pose estimation many existing models have limitations in terms of discrimination and efficiency when applied to real world images. Therefore, two novel techniques are developed to discriminate people with complex, textured appearance from cluttered backgrounds. A boundary model is developed based upon the divergence between the appearance distribution of the foreground region and its adjacent background. In particular, the distribution of the divergence between the joint colour histograms of these regions is learnt for correct and incorrect configurations. In order to provide a quantitative empirical evaluation the statistics of intensity edges on and off human boundaries are also learnt. It is shown that the new boundary model is more discriminatory and searchable. This is particularly important as early identification of body parts focuses the estimation. Next, a model is proposed that encodes the spatial structure of human appearance. In particular, the statistics of the similarity between regions on the surface of correct and incorrect configurations are learnt. Encoding inter-part similarity is important in discriminating larger incorrect configurations, and due to the combinatorial growth in the number of large configurations is key to efficient estimation in real world images. In addition to these likelihood models a foreground model is developed that encodes the expectation of temporal consistency in appearance for use in human tracking applications. It builds upon previous techniques by matching feature distributions and using clothing structure to improve estimation of the adapting foreground appearance. Once the model and likelihood have been defined pose estimation can be performed. Two approaches to pose estimation can be identified in the literature. The combinatorial approach identifies body part candidates in the image and then combines the results, for example using dynamic programming, to estimate the overall body pose. Whilst such methods are efficient they rely heavily upon body part detection which is particularly difficult in the presence of occlusion and clutter. In contrast, the full state space approach searches for whole body configurations and thereby models the complex self occluding appearance. However, due to the high dimensional space such methods use local rather than global sampling and require manual initialisation. By taking advantage of the partial configuration formulation and the strong likelihood model a straightforward deterministic search algorithm is able to recover many of the body parts and results of such a search to challenging scenes are presented.