Tesseract ocr for mac os x

2/25/2023

Nowadays, state-of-the-art methods are usually not adapted to the historical domain moreover, they usually need a significant amount of annotated documents which is very expensive and time-consuming to acquire. During the last a few decades, the amount of digitized archival material has increased rapidly, and therefore, an efficient method to convert these document images into a text form has become essential to allow information retrieval and knowledge extraction on such data. To sum up, this paper shows a way how to create an efficient OCR system for historical documents with a need for only a little annotated training data.ĭigitization of historical documents is an important task for preserving our cultural heritage. We also demonstrate that obtained scores are comparable or even better than the scores of several state-of-the-art systems. The experiments aim at determining the best way how to achieve good performance with the given small set of data. We show that both the segmentation and OCR tasks are feasible with only a few annotated real data samples. This corpus is freely available for research, and all proposed methods are evaluated on these data. We have created a novel real dataset for OCR from Porta fontium portal. Both approaches are state of the art in the relevant fields. Our segmentation methods are based on fully convolutional networks, and the OCR approach utilizes recurrent neural networks. The presented complete OCR system includes two main tasks: page layout analysis including text block and line segmentation and OCR.

Therefore, this paper introduces a set of methods that allows performing an OCR on historical document images using only a small amount of real, manually annotated training data. Nowadays, OCR methods are often not adapted to the historical domain moreover, they usually need a significant amount of annotated documents. Such methods are dependent on optical character recognition (OCR) which converts the document images into textual representations. As the number of digitized historical documents has increased rapidly during the last a few decades, it is necessary to provide efficient methods of information retrieval and knowledge extraction to make the data accessible.

0 Comments

Tesseract ocr for mac os x

Leave a Reply.

Author

Archives

Categories