Before two images can be mosaiced, it is necessary to find the
transformation which aligns one image with the other
. This is the well known ``motion estimation'' or ``registration''
problem [
2
]. Techniques for solving this problem can be divided into two broad
categories [
2
,
7
]: feature-based and featureless techniques. When the motion between the
two images is small, for instance when the two images are from
successive frames in a video sequence, the motion parameters can be
estimated using optical flow [
7
]. When the motion is not small, as in this application, the motion
could be estimated using generalised cross-correlation [
10
]. However, this requires a computationally expensive combination of
coarse-to-fine and iterative gradient-based search [
7
]. The final option is to use a feature-based approach [
4
], estimating the motion from point correspondences. Such techniques are
often criticized on the grounds that it is difficult to find the
features automatically, or that the features are sensitive to noise and
occlusion [
7
]. In this particular application, however, there is a plentiful supply
of stable, detectable features. We therefore adopt a feature-based
approach.
Instead of using generic ``corner'' features, we exploit knowledge of the domain to extract a more organised set of features. Each image is automatically segmented into a hierarchy of columns, lines and words: it is relatively straightforward to match such organised sets of features across images. The algorithm for producing the hierarchical segmentation is summarised below.
The first step is to estimate the angle that the rows of text make with
the image raster lines (the
skew angle
), which is assumed to lie in the range
. To achieve this, a small patch of text is selected randomly, and then
rotated in small steps over the range
until the variance of the pixel intensities summed along raster lines is
maximised [
1
] -- see Figure
2
. The exhaustive search is not computationally expensive since small
rotations (
) can be well approximated by vertical shears, which are fast to
compute. To ensure that an accurate skew angle is found, the calculation
is performed at several image patches, and the final estimate is the
average of the individual angles weighted by the variance of the pixel
intensities at each patch. All subsequent feature detection operations
are performed on the de-skewed images.
Figure 2:
Finding the skew angle.
The text is rotated from its original orientation (a) until the variance
of the grey levels summed along the image raster lines is maximized (b).
(c) shows the results on a real image, where the central portion of text
has been de-skewed.
De-skewed document images can be reliably and intuitively segmented into
a hierarchy of columns, lines and words. To remove any sensitivity to
illumination and page coloration, we work on binary gradient images
. These are obtained by applying a Sobel operator to the de-skewed
greyscale image and thresholding the output [
11
] -- see Figure
3
.
Figure 3:
Binary gradient images
. To reduce sensitivity to lighting conditions and page coloration, a
Sobel operator is applied to the original images (a). The absolute value
of the output is thresholded to produce the binary gradient image (b).
Columns are easily segmented from the de-skewed, binary gradient images by summing pixels vertically -- see Figure 4 (a). A similar horizontal process is used to find the baselines of each row within each column (we do not assume the columns are aligned). Finally, a vertical process is applied within each row to segment individual words. A complete segmentation of a real image is shown in Figure 4 (b).
The mosaic will be built up by matching the lower right hand corners of words in pairs of overlapping images. The segmentation scheme finds these points reliably, and also places them in the context of a hierarchy of rows and columns: in effect, the points have been grouped [ 6 ]. As we shall see in Section 4 , the grouping can be exploited by an intuitive and efficient matching algorithm.
The segmentation process involves a considerable amount of summing in the binary gradient image, which could be computationally expensive if not implemented efficiently. To this end, we construct a matrix of partial sums [ 9 ] whose elements are given by
The matrix of partial sums can be calculated in one pass through the binary gradient image. Summing pixels within any rectangular area of the binary gradient image is then very efficient [ 9 ]:
Figure 4:
Segmentation of a page of text.
Columns can be found by summing the binary gradient images in vertical
strips and looking at the difference between neighbouring sums (a). The
strips are swept from left to right across the image, and local minima
with magnitudes exceeding the threshold
T
indicate column start points. Likewise, significant local maxima
indicate column end points. Since the columns are so prominent, the
procedure is not sensitive to the parameter
T
. The optimal width
d
of the strips depends on the inter-column separation in pixels, which is
largely a function of the imaging resolution (dpi). Thus
d
needs adjusting only when the camera is moved or the lens is changed. A
similar process can be used to find rows within columns and words within
rows: a full segmentation is superimposed on the original (skewed)
greyscale image in (b). The lower right hand corner of each word
provides a suitable point for matching across images.
A.H. Gee