pdf conversion

Optical Character Recognition (Optical Character Recognition) referred to as "OCR". ORC is a technology that analyzes and recognizes image files containing textual information to obtain text and layout information.

This usually involves the following processes.

1. Image Input

Different formats of images have different storage formats and compression methods.online pdf conversion The current open source projects for accessing images are OpenCV and CxImage.

2. Preprocessing

Pre-processing mainly includes binarization, denoising and skew correction, as follows.

Binarization:In most cases,merge pdf rearrange pages the pictures taken by the camera are color images, which contain a lot of information and need to be simplified. We can simply divide the content of the picture into foreground and background. In order for computers to recognize text faster and better, we need to process the color image first so that only the foreground and background information is left in the picture, i.e. simply define the foreground information as black and the background information as white, which is a binary image. A comparison of the color image and the binary image before and after processing is shown in Figure 1.

Noise removal: for different documents of the enterprise, the definition of noise we can pass differently. According to the characteristic information of the noise to effectively eliminate the processing,pdf split and merge download online which is called noise removal.

Skew Correction: Generally speaking, users randomly take photos, it is likely that the photo files will produce skew. At this time, it is necessary to use text recognition software for correction.

3. Layout Analysis

The process of dividing document images into paragraphs and lines is called layout analysis. Due to the diversity and complexity of the actual document, there is no fixed, optimal cropping model.

4. Character cutting

Due to the limitations of the photo, often resulting in characters sticking, broken pen and other phenomena, greatly limiting the performance of the recognition system. At this time, the need for text recognition software with a character cutting function.

5. Character Recognition

Template matching early, and then mainly based on feature extraction. Due to the displacement of the text, stroke thickness, stroke breaks, adhesion, rotation and other factors, the difficulty of extraction is greatly increased.

6. Layout Recovery

Usually, human society hopes that after the recognition of the text, can still be arranged in accordance with the original document picture or that one, to maintain the paragraph structure remains unchanged, the location of the relationship between the same, unchanged, the order of the same, and then output to the Word document or PDF document, this process is called layout recovery.

7. Post-processing, verification

In different language environments, the logical order of the language is different. Therefore, according to the context of the language features, the recognition results need to be corrected, this process is post-processing.