A system for detecting and tracking faces was previously described [ 8 ]. It combined motion detection by spatio-temporal filtering with an appearance-based face model in the form of a neural net. Multiple person tracking was performed using time-symmetric matching and Kalman filtering. In this section, the use of colour as a cue for detection and tracking is described. Colour provides a computationally efficient yet effective method which is robust under rotations in depth and partial occlusions. It can be combined with motion and appearance-based face detection.
Human skin forms a relatively tight cluster in colour space even when different races are considered [ 5 ]. Figure 3 shows the colour distribution of three faces in hue-saturation (H-S) space. Face colour distributions were modelled as Gaussian mixtures of the form:
The mixing parameter
P
(
j
) corresponds to the prior probability that the data,
, was generated by component
j
. Each mixture component,
, is a Gaussian with mean
and covariance matrix
. Given
n
face pixels
,
, Expectation-Maximisation (EM) provides an effective maximum-likelihood
algorithm for learning a Gaussian mixture model [
9
]. An expectation (E) step consists of evaluating the posterior
probabilities
for each mixture component. Let the sum of these probabilities be
. A maximisation (M) step then updates the mixture components as
follows:
The E and M steps are iterated until convergence. If M =1, the parameters of the Gaussian are estimated directly.
In practice, an H-S model of a single person functions well with other
races. The mixture model is used to assign a probability to each pixel
in an image and faces are detected by grouping suitably sized areas of
high probability. A face is tracked by estimating the position as the
mean
and the spatial extent as the vertical and horizontal standard
devaitions
of the local colour probability distribution in the image plane. For a
given frame t, the box position
is estimated as an offset from the position
:
where
ranges over all image coordinates in the region of interest and is the
colour point at image position
. To improve accuracy, probabilities
are thresholded. Values lower than the threshold are taken to be
background and are consequently set to zero in order to nullify their
influence on the estimation of
and
. The size of the bounding box is estimated by computing the standard
deviation weighted by the pixel probabilities:
Figure 2 shows a sequence of a face being tracked with a moving camera against a cluttered background. The tracker's ability to deal with changes in scale, large rotations in depth and partial occlusion are all clearly demonstrated.
The colour-based tracking system has been implemented on a 200MHz Pentium PC equipped with a Matrox Meteor frame grabber and a Sony EVI-D31 active camera. The camera can be driven by maintaining the mean position, m , at the centre of the image. Tracking is performed at approximately 15 frames per second. Some problems are inevitably caused by large changes in the spectral composition of scene illumination. It has been found necessary to use at least two colour models, one for interior lighting and one for exterior natural daylight.
Shaogang Gong