Improving the OCR of Low Contrast, Small Fonts, Dark Background Forms Using Correlated Zoom and Resolution Technique (CZRT)

Many formal institutions, companies, hospitals, laboratories need some time to exchange hand signed reports through modern communication means such as Fax, E-mails, and others. A problem is faced due to the quality of both scanned documents and originally used paper, which results in problems in converting such images to text. In addition, font type and size, contrast and background darkness have an adverse effect on the accuracy of the resulted text. Thus, an investigation into the relationship between scanned document zoom and scanning resolution in Dots per Inch (DPI) for a special case and type of scanned forms is carried out to enable design of an algorithm that takes into account such cases. It is found that a much higher level of zooming and resolution is needed to achieve acceptable recognition for the special case of dark, low contrast, small font forms. It is also found that the optimum zooming level is set by the number of recognized words as they are more difficult to learn and analyze.


Introduction
The goal of Optical Character Recognition (OCR) is to classify optical patterns (often contained in a digital image) corresponding to alphanumeric or other characters.The process of OCR involves several steps including segmentation, feature extraction, and classification.Some applications of OCR range from people wish to scan in a document and have the text of that document available in a word processor, to recognition of license plate numbers and zip codes [1]- [5].
The best OCR process is achieved when the following conditions are met: 1) A clean printed copy.
2) A scanned document where little or no tilts occurred to the horizontal lines.
3) A scanned document which is free of smearing marks and blurring.4) A document whose characters are distinguishable with distinct edges.5) A document without underlined characters especially the letters g, j, p, q and y. 6) A document with no handwritten notes.7) A document with no colored text, dark backgrounds.The most common use for OCR is to convert text documents to some sort of digital representation.OCR can reach 98% accuracy.However, accuracy will decrease dependant on the quality of the scanned documents and the type of algorithm used to interpret the resulting scanned files.The quality of scanned document can be measured by a term called DPI (Dots per Inch).Usually, 300 DPI is the standard quality, since with such resolution it is possible to reach good most accuracy without sacrificing speed and file size [6]- [10].
In this paper, an approach is made towards relating and explaining the effect of dark, low contrast, small font forms on the level of resolution and zoom required to obtain acceptable results.In addition, the realization that a higher Zoom and Resolution values are required per fixed contrast to obtain satisfactory results for such forms, will be presented and proved.An attempt is made to quantify such approach in order to enable an intelligent selection of Resolution and Zooming levels as per type of scanned documents.

Methodology
A low contrast, dark, small font form is used to establish the best scanning parameters for such a special case.The form is scanned at various zooming levels per resolution levels.The two used matrices are: 1) Zooming Matrix: [100.120.140.160.180].
2) Resolution Matrix: [100, 200, 300, 400, 500, 600, 700, 800, 900, 1000].The used algorithm is designed to find an intersection whereby a maximum recognition rate value or values are obtained for both words and numbers.The final correlated value for zoom as a function of resolution values is selected for the type and properties of the scanned form and feedback into a knowledge base.The experiment is repeated for other types and variations of forms to enrich the KB and have a proper automatic system that switches in-between parameters to obtain optimum recognition rate.
The used algorithm will implement the following expressions: Words Recognition , Numbers Recognition , [ ] Equations ( 1) and ( 2) represent words and numbers recognition over the domains of resolution and zoom for all intervals and values of contrast faced in practical cases.Here, the algorithm will cycle through combinations of both R and Z arrays and log the values of correctly recognized word and number counts into the Knowledge Base (KB) as part of the learning curve of the system.Equation (3) applies a correlation function to establish the optimum cross correlation between both Z and R values for the case under consideration.The knowledge Base will use the expression in (3) to test various combinations of the stated parameters and form a learning curve such that it auto adjust its scanning values to obtain the best or optimum recognition.

Results
A real case of formal, dark background, low contrast forms exchanged as images through scanned e-mails and fax machines are collected and re-scanned again at various levels of resolution and zoom values.The resulting scanned files are then interfaced to the recognition and interpreting algorithm with learning capabilities.The resulting recognized files are then produced as text documents with statistical analysis regarding the correctly recognized numbers and words versus resolution and zooming levels.
Figures 1-5 show recognition curves for both words and numbers as a function of both resolution and zooming parameters.These values are displayed in Table 1 and Table 2.

Discussion
From Figures 1-5 and Table 1 and Table 2, the following is deduced: 1) A steady increase in the recognition rate for words up to a certain value of zooming and resolution, where the recognition rate decreases due to blurring in the case of the zoom parameter and line thickening in the case of resolution.
2) Numbers recognition rate is higher than words recognition rate per same parameters.This is expected as learning printed numbers is much easier than the letter and word variations.
Table 3 and Table 4 show percentage recognition of both words and numbers in relation to resolution and zoom parameters.From the tables, it is obvious that numbers have higher percentages of recognition per zoom value compared to words, hence higher recognition rate at smaller zooming [11]- [13].
Figure 6 and Figure 7 show the dominance of zooming level at 160% compared to other levels of zooming used in the experimental work.It actually divides the plane into two main sub-planes: 1) Plane 1: Contains recognition rates for zoom values [100%, 120, 140%, 180%].
2) Plane 2: Contains recognition rates for zoom value [160%].However, it could be seen that the spread of Plane 1 is more in the case of numbers compared to words, as numbers are easier to be recognized.
Based on the results in Figure 6 and Figure 7, Table 5 and Table 6 show the results of applying an acceptable threshold value of 80% recognition to both words and numbers.From the tables, it is clear that both words 0.4 0.5 0.9 0.5 0.5 0.5 0.6 0.9 0.5 0.5 0.5 0.6 0.9 0.5 0.6 0.5 0.7 0.9 0.5 0.6 0.5 0.7 0.9 0.5 0.6 0.5 0.7 0.9 0.5 0.5 0.5 0.7 0.9 0.5 1000 0.5 0.4 0.7 0.9 0.5 Table 5. Recognition rate for words as a function of resolution and zoom at 80% threshold.To establish a common working criteria for both words and numbers, as usually any document will have a mix of both types, correlation is carried out between the contents of Table 5 and Table 6, resulting in the operating parameters for this special condition of Dark, Low contrast, Small font, formal forms being: 1) Zoom: [160%].

Conclusions
The obtained recognition curve displayed very interesting characteristics resembling a pass band like characteristics.The prescribing curve showed low recognition at the normal resolution used for standard forms due to darkness, low contrast, and small fonts, and low recognition at very high resolution values, which was explained in terms of increase in lines thickening that resulted in letters becoming closer to each other, thus reducing recognition rate.In addition, it is realized a low recognition rate at high zooming value due to blurring.For numbers, it is noticed that they suffer much less in terms of recognition compared with words, and that is normal and expected, as the learning curve is much simpler for numbers in comparison to letters and words.
In conclusion, the quality of scanned forms and characters properties in terms of size and fonts will affect both resolution levels and zooming levels.It is also proved that word recognition will determine the ultimate levels of operation for the interpreting software as it is more affected by optical and digital properties compared with numbers.It is critical to enable a calculating algorithm to properly select all parameters based on type and quality of scanned documents for best results.The complexity of interpretation comes with forms that contain different types and sizes of fonts with low contrast and dark backgrounds.In such cases, the array of resolution versus the array for zooming comes very useful to resolve such issue, as a multi optimum scanning level is possible using the obtained pass band curve.

Figure 1 .
Figure 1.Recognition of words and numbers at 100% Zoom.

Figure 4 .
Figure 4. Recognition of words and numbers at 160% Zoom.

Figure 5 .
Figure 5. Recognition of words and numbers at 180% Zoom.

Figure
Figure 7. Recognition of numbers at 160% Zoom.

Table 1 .
Recognized words as a function of resolution and zoom.

Table 2 .
Recognized numbers as a function of resolution and zoom.

Table 3 .
Recognition rate for words as a function of resolution and zoom.

Table 4 .
Recognition rate for numbers as a function of resolution and zoom.

Table 6 .
Recognition rate for numbers as a function of resolution and zoom at 80% threshold.