Robust Recursive Structure and Motion Recovery under Affine Projection

 

Miroslav Trajkovic and Mark Hedley

Department of Electrical Engineering

University of Sydney, NSW 2006, Australia

miroslav@ee.usyd.edu.au

 

Abstract

In this paper we present an algorithm for structure and motion (SM) recovery under affine projection from video sequences. The algorithm tracks the motion of a single structure, be it an object or the entire scene itself, allowing for any type of camera motion. This could be used for example to track the motion of a vehicle in a warehouse (single object, static camera) or for visual navigation from a moving platform (track scene from moving camera). The algorithm requires a set of features to be detected in each frame, and that at least four features are correctly matched between each three consecutive frames. Compared to previous algorithms this novel algorithm has a lower computational cost, making it attractive for real time applications. Our algorithm also provides dynamical detection of outliers and allows for previously lost features to reappear in the sequence.

 

1 Introduction

A structure and motion recovery is one of the fundamental problems in computer vision due to large spectrum of applications ranging from mobile robots navigation to multimedia and different tracking problems.

One of the most widely used approaches to structure and motion (SM) recovery is the factorization algorithm proposed by Tomasi and Kanade [8]. Image feature points are tracked over several frames and the SM parameters are computed using a singular value decomposition (SVD). This algorithm cannot be used for real time applications as this is a batch method and can only compute the parameters once all measurements have been made. The algorithm is also computationally very expensive (due to the use of the SVD). Morita and Kanade [5] developed a sequential version that has almost the same accuracy but has a much lower computational burden. The issues not addressed in these papers were how to prevent outliers incorrectly biasing the result, and how to handle a set of feature points that changes dynamically.

Held [1] partially extended the Morita and Kanade algorithm to include new features. The video sequence was divided into subsequences, and the features tracked over each subsequence. To track the features between the subsequences common tracked features between consecutive subsequences were needed. The algorithm was verified using synthetic measurements only, with added Gaussian noise. McLauchlan et al. [2,3] used the variable state-dimension filter (VSDF) to handle the problem of missing and new features. They posed the structure from motion problem as a parameter estimation problem and solved it by using the Extended Kalman Filter (EKF). They updated the structure and only the last motion estimate at each time step and therefore achieved a low computational cost O ( k ) compared to when the complete motion matrix has been updated. The authors mentioned outlier rejection, but they gave neither a method nor a reference to handle this problem. Another limitation of their algorithm is that features that have been obscured and reappear are treated as new feature points, and are not recognised as a previously tracked feature.

In this paper we present an algorithm that iteratively computes the structure and motion parameters at each time step. It has the same computational cost O ( n + k ) as McLauchlan et al , but it updates both the structure and the complete motion parameters, so it converges to the true solution. It is made possible by noting that structure should remain constant over time and hence we refine our estimate at each time step. Motivated by the work of Reid and Murray [7] we have extended their algorithm so that it can easily include new features and recover the complete structure and motion provided that in each set of three consecutive frames there are at least four successfully matched feature points. This is a reasonable condition in practice and is almost always fulfilled when using a good corner detector as the feature point detector. Further, we have detected outliers, which are feature points that are inconsistent with the recovered motion model, at each time step. We detect outliers as points that have a large influence on the estimated parameters and their standard variation. Finally we have developed a procedure that checks whether newly appeared features are genuinely new, or old features that have reappeared.

2. Problem Statement

In this paper we assume an affine camera model which has the form:

p = MS+t

(1)

where S is 3 ´ 1 is the world coordinate of a feature point, p is 2 ´ 1 image projection of S , M is 2 ´ 3 projection matrix and t is 2 ´ 1 translation vector. This model is a generalisation of orthographic camera model [6,9] and is a good approximation of perspective projection when change in depth is small comparing to average distance from the optical center and this condition is almost always satisfied for independently moving objects. If the object of interest undergoes rigid motion the image projections will not change if we fix the world coordinates of the object of interest and change camera parameters M , t accordingly. Hence, without loss of generality, we may assume that object of interest is static and that the camera is moving.

Given k projections (images) of n scene points we want to recover camera motion parameters M ( j ), t ( j ), j =1,2,…, k and structure parameters S I , i= 1,2,…, n (world vectors of tracked scene points). These parameters are related by the measurement equation:

p I ( j ) = M ( j ) S I + t ( j )

(2)

where p I ( j ) is the projection of i th point onto the j th image, and i and j vary from 1… n and 1… k respectively. To compute the SM parameters using all the measurement equations, one must minimise the objective function:

(3)

but for real time implementation the parameters must be computed following the acquisition of each frame. It may be shown that the translational component t ( j ) at each frame is given by the centroid of the feature locations and equation (2) can now be rewritten as where , or in matrix form:

W = MS

(4)

where , and , and this is the equation obtained (although in different way) by Tomasi and Kanade [9]. To solve it, they employed singular value decomposition of measurement matrix and showed that and ,where and are submatrices of U , S and V corresponding to the three largest singular values of W . In addition, the motion and structure parameters can be determined separately without computing the full SVD, since

(5)

Hence the structure and motion are given by the eigenvalue decompositions of and respectively.

Note that equation (4) does not have a unique solution. In fact, if A is an arbitrary invertible 3 ´ 3 matrix the matrices MA and are valid motion and structure matrices respectively [9]. To ensure a unique solution, we fixed the first two rows of M , so that j in equation (3) will range from 2… k .

3 Complete algorithm

3.1 Recursive structure and motion recovery

Unlike Morita and Kanade who have computed motion and structure by iteratively updating equation (4) and McLauchlan who applied EKF to the measurement equation (2), we have employed direct minimisation of the cost function (3). For some initial number of frames (usually 3) we have initialised motion and structure matrices by solving equation (4). After a new frame has been acquired the cost function is iteratively minimised taking previously computed structure and motion parameters as initial values. Here we explain three methods, two of them we developed (1 and 2).

First, let us suppose that cost function C ( k ) has been minimised, i.e. that the values M , t and S ( k ) are known. Our goal is to minimise cost function at the next time step:

(6)

where is a binary weight which expresses presence ( = 1) or absence ( = 0) of i th feature in j th frame. Since it does not effect the rest of derivation it will be omitted for the clarity. Equation (5) may be rewritten in more concise form as:

(7)

It will be seen later that it is helpful to group motion parameters into a motion vector rather than using M and t separately and we define the motion vector as . To minimise the cost function we differentiate (6) to obtain

 

(8)

where

.

There are three different ways to solve this equation, and we will show them and give some comparison.

  1. Linearizing (7) around previously estimated structure and motion and assuming that D I and E ( j ) are constant. Since

equation (7) can be written as the linear system of n + k +1 equations

 

(9)

where

(10)

and this is the same set of equations as obtained in [4]. The authors have shown that this set can be solved in o (min( n 3 + k 3 )) computational time, because of some decomposable properties (otherwise the computational load would be o (( n + k ) 3 ) ). We have investigated convergence properties of this algorithm and have shown that this algorithm has linear rate of convergence (the proof is omitted due to lack of space).

  1. Linearizing equation (7) around the previously found structure and motion parameters. In this case we linearize not only v i ( j )

but the whole expression. Interestingly, we obtained a set of linear equations of the same form as (8), the only difference being that

.

Clearly, computational cost remains the same (the cost of computing is negligible), but because of this correcting part, the rate of convergence is quadratic.

  1. Applying the method of simple iteration; In terms of m and S , h i ( j )

can be written as:

and thus, equation (7) can be rewritten as

or,

(11a)

(11b)

where A ( j ) and C i are defined in (10). This method also have linear convergence, but computational cost is o ( n + k ). We have also found that this method has more stable convergence for small number of features (5 to 15) than previous two, and that in practice three to five iterations suffices.

We used the second method because it has quadratic converge, and usually found three to five iterations sufficient. For the initial estimates of the parameters at each stage of the algorithm we used the parameters determined in the previous stage.

3.2 Matching and outlier rejection

In order to obtain the measurement matrix we have to correctly match corners from C 1 to C 2 , from C 2 to C 3 , … etc., where C 1 , C 2 ,…, C n refers to the set of corners in images I 1 , I 2 ,…, I n respectively. Usually, matching two sets of corners involves the use of cross correlation between small patches around each corner and this has been described elsewhere [10]. This procedure, however, will almost always give some (usually) small number of incorrect matches (outliers), which can (and usually will!) significantly bias structure and motion estimates. It is therefore necessary to detect incorrect matches and disregard them when solving for structure and motion. For this task we have developed a two step technique, based on the use of robust statistics. The Least Median of Squares (LMedS) estimator is used. This will work when number of outliers is less than 50% of total points, which is a reasonable assumption and is almost always fulfilled in practice. For more details about LMedS interested reader is referred to [5].

1) Matching between two images.

Having matched the corners between two consecutive images, we randomly select four matched corners to obtain a sample matrix where x i is a vector containing the x coordinates of the four selected corners in image i , and y i is likewise defined. We can decompose this 4 ´ 4 matrix using the SVD:

.

where M S and t S can be found using equation (5), then we find the structure vector S R for the remaining set of matched corner points C R by solving the equation

,

where W R is the measurement matrix corresponding to C R . The position error from the model ( M S , t S ) is given by

The standard variation of the error vector e = [| d i |] is computed using the equation [4]

where n is number of points in C R and p is the number of sample points (4 in our case). The number of samples of four sets of matched corners which have to be taken to assure that at least one consists only of inliers with probability higher than g is given by

where p is the expected percentage of outliers in the sample. We find the sample with the lowest standard variance, and use the corresponding vector e .

The next step is outlier rejection. For each point we determine the initial weight w i .

Then we compute the robust standard deviation estimate as

Finally, outliers are found as points whose residual error is outside the confidence interval 2.5 s .

2) Matching between three images

By matching feature points between three images we can make use of the rigidity constraint. For each set of three frames, the corners are matched between the consecutive frames and the outliers rejected. Then, given the position of a feature in any two images, and the motion parameters M and t, its position in the third image is uniquely determined. Hence, by using this procedure, we can detect those outliers which are "correctly" matched over each pair of consecutive images (satisfy the epipolar constraint) but do not satisfy the rigidity constraint over three frames.

The whole procedure of corner matching is now repeated to match the corners from C 2 to C 3 .

3.3 Inclusion, Reappearance and Deletion of Features

New feature points will be included in the measurement matrix if they appear and are matched in three consecutive frames. Consider frames n -2, n -1, n and n +1, and let C 1 denote a set of features which are matched over frames n -2 to n , while C 2 denotes the set of features which have been matched only over frames n -1 to n +1. Let P be an arbitrary feature from C 2 and let p ( n ) denote its image projection in the n th frame. Since the motion (and structure) parameters are estimated over the first n frames, the structure parameter S of this point is found by solving the set of equations,

.

Hence the initial parameters for the algorithm in Section 3.1 have been found.

Before we add the point from C 2 to the measurement matrix, and perform an update, we first check whether this point is genuinely new or whether it is an old feature that has reappeared. Let P i be a previous feature that had disappeared and P l be an arbitrary feature from C 2 . We can say that P i and P l represent the same feature if the world distance between them is small enough, i.e . if

(12)

where a is an unknown threshold to be determined.

While we do not know what is small in the structure space (especially because it is only unique up to an affine transformation), we can say what is small in the image space (typically three to five pixels). If the features are close in structure space then the distance between their projections in the image plane has to be small but the reverse need not be true. We use the criteria that the two points are the same one if the distance between their image projections is less than threshold b for all camera locations. Mathematically, this condition can be expressed as

(13)

The difference between image projections is given by where the index j has been dropped for clarity. By applying the Euclidean norm to both sides we obtain where the index S denotes the spectral norm, which is defined as a square root of the maximum eigenvalue of the matrix , i.e. . Since the spectral norm is subordinate to Euclidean norm there exists at least one D S such that equality holds, and from equation (13) we get . If , then by combining (12) and (13) the threshold a = b / l where l is the largest value of l j .

4 Experimental results

Our algorithm has been verified by testing on several video sequences.

POV image sequence

This sequence consists of 90 frames of an object rotating around the fixed axes (see Fig. 1 a–c). The rotation is a constant four degrees per frame. Corners are detected and matched using an algorithm similar to [10]. This is a difficult sequence, because no single corner appears in all the frames, and because the number of points is small - it varies from 6 to 13 per frame. For this reason, both algorithms proposed by Tomasi and Kanade [9] and Morita and Kanade [6] are not applicable for this sequence as the number of features is not constant. The VSDF algorithm may be used, but in its original form (without the reappearance of features) this will have a growing number of features as old features reappear, this will lead to slower computation and greater errors in the calculated parameters.

 

a)

b)

c)

d)

Fig. 1: POV sequence: 1 st (a), 30 th (b) and 60 th (c) image in the sequence of 90 images. d) consistent features and recovered structure (the lines are shown for visualisation only).

Using our algorithm we obtained total of 22 features, and by disregarding all of them which appeared in less than 20 frames we were left with total of 17 features which is quite close to the 16 distinguished corners on the object. The corners that have been effectively tracked over all the frames are shown in Fig 1d. As we can see they are all close to the true locations, including those which are currently occluded or not detected. Some distortion is present due to imperfection of affine model, but the structure has been correctly recovered.

Hotel Sequence

This sequence consists of 197 points tracked over 181 frames. Corners were detected in the first image and their positions in consequent images were determined using an optical flow technique [8] (see Fig. 2 a, b). The SM parameters were calculated using the SVD algorithm and our algorithm, which produced the same results, but the former took about an order of magnitude longer to compute. As in [4], we have found that updating motion parameters over all previous frames in each step is unnecessary, but unlike them,

a)

b)

c)

Fig.2: 1 st (a) and 181 st (b) image from the hotel image sequence; c) Plots of the mean square error (cost function) with (lower curve) and without (upper curve) outlier rejection

we performed multiple iterations for accuracy. A very important step here is the outlier detection. Since we have matches over all frames, the rigidity constraint were checked not over last three frames only, but over the whole sequence. The comparison of mean squared error with, and without outlier removal is given in Fig. 2c.

5. Conclusion

This paper addresses a practical problem in structure and motion recovery for real image sequences. The first problem is speed to allow real time operation, and we presented a fast batch recursive algorithm for structure and motion parameter computation and have shown that it gives the same results as Tomasi and Kanade method. Furthermore, we have developed a technique for detecting feature points inconsistent with the motion model, and by disregarding these the performance of the algorithm is enhanced. This paper is the first one (to our knowledge) that deals the problem of feature dropout and reappearance. We developed a procedure that allows previously lost features to reappear in the sequence and have experimentally verified its operation.

Acknowledgments

The hotel image sequence (along with corners and displacements) was provided by the Modeling by Videotaping group in the Robotics Institute, Carnegie Mellon University. Financial support was provided by the Australian Research Council.

References

[1] Held A., "Piecewise Shape Reconstruction by incremental factorization", Proc. 7 th British Machine Vision Conference, vol I , Edinburgh, UK, 1996, pp. 333-342.

[2] Koenderink J. J. and van Doorn A. J., "Affine structure from motion", J. Opt. Soc. Am. A, vol 8, pp. 377-385, 1991.

[3] McLauchlan P. et al . "Recursive affine structure and motion from image sequences" Proc. 3 rd ECCV, vol. I , Stockholm, pp. 217-224, 1994.

[4] McLauchlan P. and Murray D.. "A unifying framework for structure and motion recovery from image sequences" Proc. 5 th ICCV , MIT, pp. 314-320, 1995.

[5] Meer P. et al . "Robust regression methods for computer vision: a review", Intl. Journal Computer Vision vol. 6., pp. 59-70, 1991.

[6] Morita T. and Kanade T., "A sequential factorization method for recovering shape and motion from image streams", Proc. 1994 ARPA Image Understanding Workshop, vol. II , pp. 1177-1188, Monterey, CA1994.

[7] Reid I. D. and Murray D. W., "Active Tracking of Foveated Feature Clusters Using Affine Structure", Intl. Journal Computer Vision, vol. 18, pp. 41-60, 1996.

[8] Shi J. and Tomasi C., "Good Features to Track". IEEE Conference on Computer Vision and Pattern Recognition , June 1994, pp. 593-600.

[9] Tomasi C. and Kanade T., "Shape and motion from image streams under orthography: A factorization approach", Intl. Journal Computer Vision, vol. 9, pp. 137-154, 1991.

[10] Trajkovi } M. and Hedley M., "Fast feature detection and matching for machine vision", Proc. 7 th British Machine Vision Conference, vol. I , Edinburgh, UK, 1996, pp. 91-100.