BibTeX entry
@PHDTHESIS{200607Neil_Robertson,
AUTHOR={Neil Robertson},
TITLE={Automatic Causal Reasoning for Video Surveillance},
SCHOOL={Oxford University},
MONTH=Jul,
YEAR=2006,
URL={http://www.bmva.org/theses/2006/2006-robertson.pdf},
}
Abstract
This thesis is concerned with producing high-level descriptions and explanations of human activity in video from a single, static camera. The scenarios we focus on in this work are urban surveillance and sports video where the person is in the medium scale, around 150 pixels high. The final output is in the form of text descriptions which not only describe what is happening but also explain the interactions which take place. In order to achieve this goal, some significant issues pertinent to action recognition and human behaviour estimation have been addressed. In particular, we have developed novel solutions for estimating where an imaged person is looking even when the face image is low-resolution. We have extended the Bayesian fusion techniques used to solve the gaze recognition problem to activity recognition in general. By computing non-static descriptors based on instantaneous target motion and combining them with position and velocity via an efficient non-parametric database search, we compute distributions over spatio-temporal actions. Probabilistic distributions over behaviour are further estimated from a set of Hidden Markov Models which encode stochastic sequences of actions. Automatic commentaries of most likely action sequences and/or higher-level behaviour at a human-readable level can be derived by computing the Maximum Likelihood or Maximum a Posteriori estimate at any time step, respectively. In the latter case we use domain knowledge as a smoothing prior to refine the estimates. Finally, we draw these components together to achieve the main objective of this thesis: causal reasoning in video. Using an extensible, rule-based architecture we compute explanations of observed activity. The input to this reasoning process is the information obtained at the action/behaviour recognition stage, which represents an abstraction from the image data. The output of best explanations of global scene activity, particularly where interesting events have occurred, is thus achieved.