Script and Language Identification from

Document Images

 

G.S.Peake and T.N.Tan

Dept. Of Computer Science, University of Reading

England, RG66AY

G.S.Peake@reading.ac.uk

 

 

Abstract

 

In this paper we present a review of current script and language identification techniques. The main criticism of the existing techniques is that most of them rely on either connected component analysis or character segmentation. We go on to present a new method based on texture analysis for script identification which does not require character segmentation. A uniform text block on which texture analysis can be performed is produced from a document image via simple processing. Multiple channel (Gabor) filters and grey level co-occurrence matrices are used in independent experiments in order to extract texture features. Classification of test documents is made based on the features of training documents using the K-NN classifier. Initial results of over 95% accuracy on the classification of 105 test documents from 7 scripts are very promising. The method shows robustness with respect to noise, the presence of foreign characters or numerals, and can be applied to very small amounts of text.

1 Introduction

 

The world we live in is becoming increasingly multilingual and, at the same time, increasingly automated. In the optical character recognition (OCR) and document image processing (DIP) communities this is beginning to present a problem. Almost all existing work on OCR makes an important implicit assumption that the language of the document to be processed is known beforehand. Individual OCR tools have been developed to deal best with only one specific language. An English OCR package will deal very well with English, and may be able to cope passably with some other Roman alphabet languages such as French or German. It will not, however, be very helpful if given a Chinese document to process. In an automated environment such document processing systems relying on OCR would clearly need human intervention to select the appropriate OCR package, which is obviously not desirable. A pre-OCR language identification system would enable the correct OCR system to be selected in order to achieve the best character interpretation of the document. This area has not been very widely researched to date, despite its growing importance to the document image processing community and the progression towards the "paperless office".

In the first part of the paper, we present a review of the existing work. We discuss the principles, merits and weaknesses of each approach. Our review of existing work has motivated us to develop a novel approach to script identification. The new approach is described in detail in the second part of the paper. Experimental results are included to illustrate the performance of the new method.

 

2 Previous work

 

In this section, we present a review of existing work on script and language identification from document images. Each of the research groups currently involved in this area has taken a different approach. These approaches are discussed in the following.

 

2.1 Text symbol templates

 

Hochberg et al. present a method for automatically identifying script from a binary document image using cluster-based text symbol templates [1]. Text symbols are rescaled to 30 ´ 30 pixels. In the case of scripts where the characters of whole words are connected (such as Arabic), the entire word is rescaled to this size. Symbols are grouped into clusters based on Hamming distances. The centroid (template) of each cluster is calculated. Classification requires N textual symbols from the test image which are compared with all template symbols with a reliability of at least R %. Here also the comparison is based on the Hamming distance. The mean matching score for each script is calculated and the script with the best mean match chosen as the script of the document. A very high level of accuracy (up to 98%) can be obtained for both test images and "challenge" images (containing graphics and foreign characters).

Misclassifications generally arise due to test images containing markedly different fonts to the training images. This problem may be solved by including training images with a wider variety of fonts. It is claimed that the method is insensitive to page skew angles of up to 10 ° (and the presence of multiple skew angles in the document), without any requirement for separate skew correction. In the dataset used, all the images were scanned at a resolution of 200 dpi or higher. In images scanned at lower resolutions, the text symbols will be much coarser which may present problems.

 

2.2 Projection profiles

 

Wood et al. present some observations about the characteristics of various scripts, with particular reference to the effect such characteristics have on the horizontal and vertical projections of the document image [2]. Here is a summary of the main observations: Roman alphabet languages will show dominant peaks at the top and bottom of the x zone (the vertical extent and position of small characters such as x) in the horizontal projection profile. Russian text has a dominant midline. Arabic text has a strong baseline, but nothing corresponding to the top of the x zone in Roman text. Similar profiles can be seen in handwritten texts and skewed texts from other languages. The right edge of Korean characters is usually stronger than the left, and dominates the vertical projection profile. Chinese characters are roughly the same size and have the same aspect ratio, are generally arranged in a regular array and have very irregularly placed horizontal lines. These characteristics can be observed in the images in Fig. 1.

Wood et al. comment that other parts of the text (e.g. foreign characters), which have different characteristics, interfere with the definitive profiles. In order to remove these interfering effects and enhance the desired characteristics (and increase robustness with respect to skew), they apply a number of filtering methods to the original image including medial axis transforms, erosion, dilation and run length filtering. However, it is not clear how, (or if) the projection profiles can be analyzed automatically to determine the script, and no general system is suggested. The method would require no character segmentation, may be insensitive to point size, and would cope with a small amount of page skew. It is possible that particularly flamboyant fonts and italicized text may present some problems.

 

2.3 Upward concavities

 

Spitz proposes a method for distinguishing between Asian and European languages by examining the upward concavities [3] of connected components. Upward concavities are described as:

" Where two runs of black pixels appear on a single scan line of the raster image, if there is a run on the line below which spans the distance between these two runs, an upward concavity is formed on the line. "

Spitz's observation is that European and Asian scripts have very different distributions of upward concavities. In European scripts, there are usually at most 2 or 3 upward concavities in any one character, generally located at the top or bottom of the x zone and in the curve of descenders. Asian scripts, on the other hand, have on the whole many more upward concavities per character (with no consistent arrangement) due to the higher level of character complexity.

The position of each upward concavity is related, by its vertical distance, to a "stable fiducial point" - namely the centre of mass of the connected component or, if text line information has been obtained, the component baseline.

The vertical distribution of the upward concavities is shown to be markedly different between Asian and European languages. Gross script identification is performed by analyzing the variance of the distribution. In [5] this analysis is described as a "simple heuristically determined value of variance".

The work is then continued to discriminate between Chinese, Japanese and Korean if the script is classified as Asian. An optical density function is computed whereby the total number of black pixels in each character (its density) is tabulated in reading order across the text. Study of the distributions of these optical density functions shows distinct differences between Korean, Chinese and Japanese, although there are some problems discussed regarding actually performing classification in some cases. Sibun and Spitz go on to perform the classification of European languages on the basis of character shape codes in [4]. Character shape codes relate to the dimensions of characters rather than the actual characters themselves. Language identification of several Roman alphabet languages is performed by statistical analysis of frequent combinations of character shape codes in the languages investigated. The work in [3] and [4] is combined in [5].

 

2.4 Neural networks

 

Lee and Kim use a self-organizing neural network [6] in order to determine not the script of the entire document, but the script of individual characters within the document. The only scripts they actually discuss are Chinese, Korean and English.

Initially, a non-linear normalization of character shapes (based on character density and character dimension information) is performed. There is no discussion of how the characters themselves are obtained. Zero, first, second and third order features are calculated using a Mesh feature system, overlapping contour direction codes, or Kirsch masks (which are explained in depth in [6]). There are then two classification stages, a coarse classifier (using a self organizing feature map which clusters characters into groups of all English, all Chinese, all Korean or mixed characters), followed by a fine classifier which classifies the characters in the mixed groups and presumably performs actual character identification .

Different combinations of these classification methods are combined with the different methods of obtaining features. The results they present show above 95% accuracy for all experiments, with as high as 98.27% being achieved for classification based on mesh features.

These results are obviously limited to English, Korean, Chinese, numerals and a handful of special characters. Also, the method does seem to be perhaps overly complicated and not ideally suited to the problem of script classification of a whole document image. It would, however, be very useful in recognizing small areas of foreign/special characters in a larger document.

 

2.5 Remarks

 

Some of the above approaches rely on accurate character segmentation or connected component analysis. The problem of character segmentation presents a paradox similar to that presented by OCR, namely that character segmentation can best be performed when the script of the document is known. Some scripts, such as Chinese, have the characters laid out in a regular array, making character segmentation a relatively simple matter. Korean and Japanese tend not to have overlapping characters, but, at the other extreme, contain horizontally disjoint characters. Spitz has developed a method for dealing with these scripts in [7]. Mono-spaced Roman fonts present little problem, but the use of proportional fonts, italics and kerning produces characters which are conjoined (such as the f joining to the i in "fi"), and which persistently overlap in terms of character bounding box space. The Arabic scripts are even more difficult to segment due to the deliberate overlapping and conjoining of cursive characters during the typesetting process. A method has been proposed, however, by Hashemi et al. for the segmentation and recognition of Persian and Arabic characters [8]. The process of scanning can also contribute to these artifacts, and furthermore can cause characters to become unintentionally split.

 

Fig. 1: Examples of typical documents used: a) Chinese, b) English, c) Greek, d) Korean, e) Malayalam, f) Persian, g) Russian.

 

As can be seen from the above discussion, there is a need to apply different processing methods depending on the script of the document so as to achieve the best results possible from character segmentation. Performing character segmentation before the script of the document is known may prove to be inefficient. What we propose here is a method for identifying the script or language family of a document, without requiring character segmentation, or placing any emphasis on the information provided by individual characters themselves.

 

3. The new algorithm

 

The new algorithm is inspired by the simple observation that a uniform block of text (where the line and word spacing are normalized), written in any language, can clearly be seen to have a texture [12]. Different scripts produce different textures. The texture differences are due to the variations in character density and stroke orientation. The previous researchers have commented on the spatial characteristics of different scripts, but do not really exploit them: Hochberg's templates rely on the general differences in character shapes, Spitz's upward concavities take advantage of different character densities and stroke positions, Lee and Kim use the orientations of contours, but Wood is the only one so far to regard the text as a whole rather than examining the individual characters.

Texture analysis is quite difficult to apply directly to documents - it is quite rare to find uniform blocks of text occurring in documents - there are usually interfering aspects such as variable word spacing in fully justified documents, and variable line spacing between paragraphs. We have found, however, that with some simple processing, a uniform block of text can be extracted from document images, to which established texture analysis methods can be applied. This presents a major extension of our previous work [12], where Gabor multichannel filtering is used to produce rotation invariant features to which texture classification is applied. The text blocks used there, however, are unrealistic as they do not have uniform spacing. Here we have extended the language set from 6 to 7 languages (now including Korean). We have used multiple frequencies instead of 1 and removed the rotation invariance requirement as this is irrelevant if the text has been skew-compensated. We have also made a comparison between the Gabor filter method and the use of grey level co-occurrence matrices (GLCMs).

 

3.1 Creating a uniform block of text

 

The input to this stage is a binary image of a section of a document which has been skew-compensated [9] and from which graphics and pictures have been removed (at present the removal of non-textual information is performed manually, though page segmentation algorithms (e.g. [13]) could have readily been employed to perform this automatically). The text may contain lines with different point sizes and variable spaces between lines, words and characters. Punctuation symbols and foreign characters (such as Arabic numerals in Chinese text) may appear.

There are four main steps to obtain a uniform block of text from an arbitrarily formatted document:

 

3.1.1 Text line location

 

The horizontal projection profile (HPP) of the document is computed and smoothed. The peaks correspond to the centre of the text lines, and the valleys correspond to the blank areas between lines. Only a limited amount of smoothing over a small window can be applied here, as too much smoothing will cause the required peaks and valleys to be lost due to the generally very close spacing of peaks.

The height of each text line is then taken as being the width of the corresponding peak, up to the point on either side of the peak maximum where the curvature of the HPP either smooths out (in the case where there is a large white space above or below a line), or the lowest point in a sharp valley between two peaks. Text lines are then checked to ensure they do not overlap.

 

3.1.1 Outsize text line removal

 

At present there is no method implemented for standardizing the height of the text lines, so all lines with a height much greater or much smaller than the mean line height are removed. The necessary thresholds are computed statistically in the following way: The mean and the standard deviation of the text line heights are calculated. All text lines which fall outside the range defined by are removed. This process is repeated (including calculating the new mean and standard deviation), providing that the repetition will not cause all remaining text lines to be removed.

 

  1. Spacing normalization

 

This process comprises 3 steps:

  1. Eliminate white space between lines: the vertical bounds of each line are known, so it is a simple matter to adjust the lines so that all line spacings are set to be the same predefined value.
  2. Left justification: this step is important with respect to the padding stage (see 3.1.4). The x (horizontal) position of each text line is updated so that the leftmost point of the first character on the line is at x=0.
  3. Normalize inter-word spacing to a maximum of 5 pixels: The pixels of each text line are projected vertically and examined from left to right. All runs of white pixels greater than 5 pixels wide are reduced to 5. This step forces the maximum gap between two characters to be 5 pixels. Gaps smaller than this are allowed to remain, because they may well, depending on the script, be gaps between characters rather than words.

 

3.1.4 Padding

 

The text is then padded to a block of a predefined size (here 128 ´ 128 pixels) in 2 stages. First the space between the last filled pixel on each text line and the right hand side of the image is calculated. An inter-word space of 5 pixels is subtracted from the gap size. This number of pixels is then copied from the start of each scan line in the text line to the space. The gap between the end of the existing text and the beginning of the copied text is always known to be 5 pixels because of the left-justification described in 3.1.1. Padding may be performed in this way because the texture is content-independent. Secondly the block is padded to the required height (if necessary) in the same manner by copying the appropriate number of scan lines from the top of the image. The block is then extracted from the top left hand corner of the image and texture features may then be extracted from the block.

 

3.2 Feature extraction

 

In principle any texture analysis technique (see [14] for a recent survey) can be applied to the uniform text blocks created in the way discussed above. Here, two established methods are implemented to obtain texture features: Gabor filtering, and GLCMs. The former is becoming very popular and the latter widely recognized as the benchmark technique in texture analysis.

 

3.2.1 Gabor filtering

 

Gabor filters have been shown to be a good model of the processing that takes place in the human visual cortex, and have been used successfully in both texture segmentation [10] and texture classification [11]. The mathematical details of Gabor filters can be found in [15] and elsewhere (e.g. [16]).

The Gabor filter requires as input: an N ´ N pixel image (where N is a power of 2), an angle ( q ) and a central frequency ( f ). Parameters q and f specify the location of the Gabor filter on the frequency plane. The filtering is performed in the frequency domain using FFT. Commonly used frequencies are powers of 2. In [11] it has been shown that, for an image of size N ´ N , the important frequency components are likely to be found within N / 4 cycles/degree, so here we are using frequencies at 4, 8, 16, and 32 cycles/degree. The width of the filter is determined by sigma ( s ) of the modulating Gaussian function, which varies inversely proportionally to f .

For each central frequency, filtering is performed at 0 ° , 45 ° , 90 ° and 135 ° . This results in 16 output images (4 from each frequency) from which the texture features are extracted as follows: the mean and the standard deviation of each output image is calculated, giving 32 features per input image. Testing was conducted using all 32 features and various subsets of the features (e.g. the 16 mean features, the 8 features from a single frequency etc.).

 

3.2.2 Grey level co-occurrence matrices

 

GLCMs are in general very expensive to compute due to the requirement that the size of each matrix is N ´ N , where N is the number of grey levels in the image. In our case, however, because there are only 2 grey levels, it is reasonable to use GLCMs.

GLCMs were constructed for five distances ( d =1..5) and four directions 0 ° , 45 ° , 90 ° and 135 ° . This gives, for each input image, 20 matrices of dimension 2 ´ 2. In other applications, where the size of the GLCM is prohibitively large to allow the use of the matrix element values directly, measures such as energy, entropy, correlation and so on are computed from the matrix and used as features [17]. Here, however, there are only 4 elements in each matrix, and, due to diagonal symmetry, 2 of those values are identical, giving 3 independent values from each matrix. Now we have 60 features per input image. Again, testing was conducted using various subsets of these features.

 

4 Experimental Results

 

The images used were scanned from newspapers, magazines, journals and books at 150 dpi greyscale. 25 examples were chosen for each of: Chinese, English, Greek, Korean, Malayalam, Persian and Russian. These languages represent most of the major scripts of the world (other and more languages could have been included were they available). Each image was selected from an entire page scan so that it would contain no graphics, and would resemble reasonably closely the output from a document segmentation system (although sections with multiple paragraphs were included because these are easily dealt with by the uniform block system, and images with multiple columns were used). Foreign characters, numerals and italicized text were present in many of the images. A small amount of page skew was inevitably introduced during the scanning process. This was compensated for using the method outlined in [9]. Images were scaled to have approximately the same average text height so that, in the absence of point size normalization, vastly different point sizes between documents would not affect the texture. Fig. 1 shows some examples of typical document images used in the experiments.

The images were divided first into 15 training and 10 test images per script (Set A), followed by 10 training and 15 test images (Set B). Images in the training sets did not appear in the test sets. Testing was conducted using different combinations of the features. All classification was performed using the K-NN classifier, with K=5. Table 1 shows the results from the Gabor filter method.

These results show that certain combinations of features produce a very high level of accuracy in the classification process. The single most useful frequency is 32, but this combined with a frequency of 16 produces the best results. For the bold box at " f=16&32 (means only) " in Set A, the total number of features used was 8: there were four output images at f =16 and another four at f =32 and only the mean from each output image was used in this test. The bold body cells highlight the highest score for each set of tests. For Set A (15 training, 10 test images per language, 70 test images in total) 95.71% (=67/70) of the images were classified correctly. For Set B (10 training, 15 test images per language, 105 test images in total) 95.23% (=100/105) images were classified correctly. Table 1 also shows that in general the use of more training samples leads to higher classification accuracy.

 

The best result obtained for the GLCMs was only 77.14% accuracy (=54/70 images classified correctly from Set A), using all 60 texture features (5 values of d) ´ (4 directions) ´ (3 matrix values). Tests were not performed on Set B using the GLCM features because the Set A results (which should be more accurate than the Set B results) were so poor.

features

all

means

only

std. devs.

only

all at

f=4

all at

f=8

all at

f=16

all at

f=32

all at

f=16 & 32

f=16 & 32

(means only)

Set A

94.29

94.29

84.29

50.00

77.14

78.57

94.29

94.29

95.71

Set B

90.48

90.48

80.95

48.57

63.81

72.38

92.24

95.23

92.24

Table 1: Results of Gabor Filter Method - % of documents classified correctly.

 

5 Conclusions

 

We have reviewed the current work in the area of document script and language identification, and concluded that, although there is progress being made in the field, much of it relies on character segmentation, which we would like to avoid at the pre-language-determination stage. We have presented a novel method for script identification based on texture analysis which does not require character segmentation. Two texture analysis methods have been implemented: Gabor filters and grey level co-occurrence matrices. In tests conducted on exactly the same sets of data (documents taken from 7 languages), the Gabor filters proved to be far more accurate than the GLCMs, producing results which are over 95% accurate (comparable to results obtained in existing work). The key points of this method are:

Future work will involve all of the extensions outlined above, plus research into the area of discriminating between languages which are written in the same script, such as the family of Roman alphabet languages (e.g. English, French, Spanish and so on).

 

References

 

  1. J. Hochberg, L. Kerns, P. Kelly and T. Thomas, Automatic Script Identification from Images Using Cluster-based Templates, IEEE PAMI, Vol. 19, No. 2, February 1997, pp. 176-181.
  2. S. L. Wood, X. Yao, K, Krishnamurthi, L. Dang, Language Identification For Printed Text Independent of Segmentation, Proc. of IEEE ICIP 95, pp. 428-431.
  3. A. L. Spitz, Script and Language Determination from Document Images, Proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval, 11-13 April 1994, pp. 229-235.
  4. P. Sibun and A. L. Spitz, Language Determination: Natural Language Processing from Scanned Document Images, Proc. of ANLP '94, pp. 15-21.
  5. A. L. Spitz, Determination of the Script and Language Content of Document Images, IEEE PAMI, Vol. 19, No. 3, March 1997, pp 235-245.
  6. S-W. Lee and J.-S. Kim, Multi-lingual, Multi-font, Multi-size Large-set Character Recognition using Self-Organizing Neural Network, Proc. of IDCAR '95, pp. 23-33.
  7. A. L. Spitz, Text Characterization by Connected Component Transformations, SPIE Proceedings, Vol. 2181, 1994, pp. 97-105.
  8. M. R. Hashemi, O. Fatemi and R. Safavi, Persian Cursive Script Recognition, Proc. of IDCAR '95, pp. 869-873.
  9. G. S. Peake and T. N. Tan, A General Algorithm For Document Skew Angle Estimation, submitted to IEEE ICIP '97.
  10. T. N. Tan, Texture Edge Detection by Modelling Visual Cortical Channels, Pattern Recognition, Vol. 28, No. 9, 1995, pp. 1283-1298.
  11. T. N. Tan, Texture Feature Extraction via Visual Cortical Channel Modelling, Proc. 11th IAPR Inter. Conf. Pattern Recognition, Vol. III, 1992, pp. 607-610.
  12. T. N. Tan, Written Language Recognition Based on Texture Analysis, Proc. of ICIP '96, Lausanne, Switz., September 1996, Vol. 2, pp. 185-188.
  13. A. K. Jain and Y. Zhong, Page Segmentation using Texture Analysis, Pattern Recognition, Vol. 29, 1996, pp. 743-770.
  14. T. Reed and J. M. Hans Du Buf, A review of recent texture segmentation and feature extraction techniques, CVGIP: Image Understanding, Vol.57, 1993, pp. 358-372.
  15. D. Gabor, Theory of Communication, J. Inst. Elec. Engng. 93, 1946, pp. 429-459.
  16. J. G. Daugman, Uncertainty Relation for Resolution in Space, Spatial Frequency, and Orientation Optimized by Two-Dimensional Visual Cortical Filters, J. Opt. Soc. Am. A, Vol. 2, 1985, pp. 1160-1169.
  17. R. M. Haralick, Statistical and Structural Approaches to Texture, Proc. of IEEE, Vol. 67, 1979, pp.786-804.