Next: 4 Establishing correspondences Up: Document Mosaicing Previous: 2 Obtaining suitable images

3 Detecting features for efficient matching

Before two images can be mosaiced, it is necessary to find the transformation which aligns one image with the other . This is the well known ``motion estimation'' or ``registration'' problem [ 2 ]. Techniques for solving this problem can be divided into two broad categories [ 2 , 7 ]: feature-based and featureless techniques. When the motion between the two images is small, for instance when the two images are from successive frames in a video sequence, the motion parameters can be estimated using optical flow [ 7 ]. When the motion is not small, as in this application, the motion could be estimated using generalised cross-correlation [ 10 ]. However, this requires a computationally expensive combination of coarse-to-fine and iterative gradient-based search [ 7 ]. The final option is to use a feature-based approach [ 4 ], estimating the motion from point correspondences. Such techniques are often criticized on the grounds that it is difficult to find the features automatically, or that the features are sensitive to noise and occlusion [ 7 ]. In this particular application, however, there is a plentiful supply of stable, detectable features. We therefore adopt a feature-based approach.

Instead of using generic ``corner'' features, we exploit knowledge of the domain to extract a more organised set of features. Each image is automatically segmented into a hierarchy of columns, lines and words: it is relatively straightforward to match such organised sets of features across images. The algorithm for producing the hierarchical segmentation is summarised below.

Skew angle estimation

The first step is to estimate the angle that the rows of text make with the image raster lines (the skew angle ), which is assumed to lie in the range . To achieve this, a small patch of text is selected randomly, and then rotated in small steps over the range until the variance of the pixel intensities summed along raster lines is maximised [ 1 ] -- see Figure 2 . The exhaustive search is not computationally expensive since small rotations ( ) can be well approximated by vertical shears, which are fast to compute. To ensure that an accurate skew angle is found, the calculation is performed at several image patches, and the final estimate is the average of the individual angles weighted by the variance of the pixel intensities at each patch. All subsequent feature detection operations are performed on the de-skewed images.

Figure 2: Finding the skew angle. The text is rotated from its original orientation (a) until the variance of the grey levels summed along the image raster lines is maximized (b). (c) shows the results on a real image, where the central portion of text has been de-skewed.

Finding columns, lines and words

De-skewed document images can be reliably and intuitively segmented into a hierarchy of columns, lines and words. To remove any sensitivity to illumination and page coloration, we work on binary gradient images . These are obtained by applying a Sobel operator to the de-skewed greyscale image and thresholding the output [ 11 ] -- see Figure 3 .

Figure 3: Binary gradient images . To reduce sensitivity to lighting conditions and page coloration, a Sobel operator is applied to the original images (a). The absolute value of the output is thresholded to produce the binary gradient image (b).

Columns are easily segmented from the de-skewed, binary gradient images by summing pixels vertically -- see Figure 4 (a). A similar horizontal process is used to find the baselines of each row within each column (we do not assume the columns are aligned). Finally, a vertical process is applied within each row to segment individual words. A complete segmentation of a real image is shown in Figure 4 (b).

The mosaic will be built up by matching the lower right hand corners of words in pairs of overlapping images. The segmentation scheme finds these points reliably, and also places them in the context of a hierarchy of rows and columns: in effect, the points have been grouped [ 6 ]. As we shall see in Section 4 , the grouping can be exploited by an intuitive and efficient matching algorithm.

The segmentation process involves a considerable amount of summing in the binary gradient image, which could be computationally expensive if not implemented efficiently. To this end, we construct a matrix of partial sums [ 9 ] whose elements are given by

The matrix of partial sums can be calculated in one pass through the binary gradient image. Summing pixels within any rectangular area of the binary gradient image is then very efficient [ 9 ]:

Figure 4: Segmentation of a page of text. Columns can be found by summing the binary gradient images in vertical strips and looking at the difference between neighbouring sums (a). The strips are swept from left to right across the image, and local minima with magnitudes exceeding the threshold T indicate column start points. Likewise, significant local maxima indicate column end points. Since the columns are so prominent, the procedure is not sensitive to the parameter T . The optimal width d of the strips depends on the inter-column separation in pixels, which is largely a function of the imaging resolution (dpi). Thus d needs adjusting only when the camera is moved or the lens is changed. A similar process can be used to find rows within columns and words within rows: a full segmentation is superimposed on the original (skewed) greyscale image in (b). The lower right hand corner of each word provides a suitable point for matching across images.

Next: 4 Establishing correspondences Up: Document Mosaicing Previous: 2 Obtaining suitable images

A.H. Gee
Wed Jun 25 11:02:12 BST 1997