BMVA 
The British Machine Vision Association and Society for Pattern Recognition 

BibTeX entry

@PHDTHESIS{201512Xiatian_Zhu,
  AUTHOR={Xiatian Zhu},
  TITLE={Semantic Structure Discovery in Surveillance Videos},
  SCHOOL={Queen Mary, University of London},
  MONTH=Dec,
  YEAR=2015,
  URL={http://www.bmva.org/theses/2015/2015-zhu.pdf},
}

Abstract

For automatically processing and interpreting the enormous amount of video data generated by the rapid expansion of surveillance cameras, developing autonomous vision systems is essential. One generic mechanism for automated visual content analysis is to discover and understand the intrinsic meaningful data structures. Nonetheless, semantic structure discovery for large scale surveillance video data remains challenging due to the inherent visual ambiguity and uncertainty, potentially unreliable high-dimensional feature representations with noisy and irrelevant data, or large and unknown cross-camera variations in viewing conditions. This thesis proposes ap- proaches to several critical video surveillance problems by deriving advanced machine learning algorithms for more accurately quantifying and mining the underlying data structure semantics. More specifically, this thesis investigates and has developed new methods for addressing four different problems as follows:

Chapter 3. The first problem is unsupervised visual data structure discovery, i.e. estimating the underlying data group memberships from visual observations. This is inherently challenging as visual signals can be inevitably ambiguous/noisy, e.g. due to uncontrollable variation sources like illumination and background clutter, particularly so on typical surveillance videos. More- over, visual features are often high-dimensional, with many but unknown less-reliable feature data. To that end, this thesis proposes to identify and explore discriminative features rather than the whole feature space when measuring pairwise relationships between noisy data samples for accurately uncovering the semantic data neighbourhood structures. Specifically, a random forest based data similarity inference framework is designed, characterised by accumulating weak and subtle similarity over informative feature subspaces. This method can be utilised along with a graph based clustering algorithm for clustering visual data.

Chapter 4. The second problem is semi-supervised visual data structure discovery where pairwise constraints/relationships over data samples (i.e. must-link, cannot-link) are accessi- ble. It is non-trivial to exploit pairwise constraints for helping the disclosure of meaningful data structure. This is because (1) often sparse constraints are available, thus providing only very lim- ited information; (2) constraints are not necessarily accurate, hence misleading guidance may be imposed onto the discovery process if blindly trusting them all. In this thesis, a Constraint Propa- gation Clustering Random Forest model is formulated specially to leverage sparse pairwise links for more reliably measuring pairwise similarities between data pairs either constrained a priori or not. Moreover, this semi-supervised model is also characterised with favourable robustness against invalid pairwise constraints.

Chapter 5. The third one is multi-source video data structure discovery, significantly different from the above single-source cases. Specifically, semantic video structure analysis is investigated given heterogeneous visual and non-visual source data. Inherently, it is challenging to jointly learn such multi-source data which significantly differ in representation, scale and covariance, let alone when both visual and non-visual data in isolation can be inaccurate or incomplete. To overcome the challenges, this thesis formulates a Multi-Source Clustering Forest capable of correlating visual data and independent non-visual auxiliary information to better describe the underlying relationships among data and then facilitate video cluster revelation. The discovered clusters can be exploited to precisely summarise subtle physical events in complex scenes.

Chapter 6. The last problem is to discover person identity structure distributed across non-overlapping camera views, also called person re-identification (ReID). Visual data are drawn from multiple camera views, versus single-camera data involved in the above three problems. Therefore, visual ambiguity may be significant because of cross-view illumination variations, viewpoint differences, cluttered background and inter-object occlusions. Different from most existing appearance based models wherein ReID is achieved by matching single or multiple per- son images, the proposed Discriminative Video Ranking method is unique in learning a robust space-time ReID model instead from person image sequences of arbitrary starting/ending frame, random length, and unknown background clutter and occlusion. Moreover, the joint learning of both spatial appearance and space-time features in this model demonstrates significant advantages over existing methods in ReID.