Survey of Post-OCR Processing Approaches.
In: ACM Computing Surveys, Jg. 54 (2022-07-01), Heft 6, S. 1-37
Online
academicJournal
Zugriff:
Optical character recognition (OCR) is one of the most popular techniques used for converting printed documents into machine-readable ones. While OCR engines can do well with modern text, their performance is unfortunately significantly reduced on historical materials. Additionally, many texts have already been processed by various out-of-date digitisation techniques. As a consequence, digitised texts are noisy and need to be post-corrected. This article clarifies the importance of enhancing quality of OCR results by studying their effects on information retrieval and natural language processing applications. We then define the post-OCR processing problem, illustrate its typical pipeline, and review the state-of-the-art post-OCR processing approaches. Evaluation metrics, accessible datasets, language resources, and useful toolkits are also reported. Furthermore, the work identifies the current trend and outlines some research directions of this field. [ABSTRACT FROM AUTHOR]
Titel: |
Survey of Post-OCR Processing Approaches.
|
---|---|
Autor/in / Beteiligte Person: | THI TUYET HAI, NGUYEN ; JATOWT, ADAM ; COUSTATY, MICKAEL ; DOUCET, ANTOINE |
Link: | |
Zeitschrift: | ACM Computing Surveys, Jg. 54 (2022-07-01), Heft 6, S. 1-37 |
Veröffentlichung: | 2022 |
Medientyp: | academicJournal |
ISSN: | 0360-0300 (print) |
DOI: | 10.1145/3453476 |
Schlagwort: |
|
Sonstiges: |
|