PdfToTextViaOCR at MIR@MU

PdfToTextViaOCR

PdfToTextViaOCR is an open-source tool written in Java (under Apache License 2.0) for rendering images from PDF documents and using OCR to render text for further usage such as indexing.

PdfToTextViaOCR has been developed as part of EuDML workflow and is used as fallback solution when other text rendering tools fail.

For extracting images is used an open-source library PDFBox published under Apache License 2.0. As the OCR module is currently used Tesseract that is an open-source tool offering large language support even for languages based on Cyrrilic.

Go to: navigation | start of page | end of page

Cite as

Text

WOJCIECHOWSKI, Krzys, Petr SOJKA, Nicolas HOUILLON, Michal RŮŽIČKA, Radim HATLAPATKA, Vlastimil KREJČÍŘ, Miroslav HRDINA, Jiří SOCHOR, Pavel RYCHLÝ, Aleš HORÁK, Alan SEXTON, Gilberto PEDROSA, Franck LONTIN, Thierry BOUCHE and Maciej KOŁUDA. Toolset for Image and Text Processing and Metadata Enhancements — Final Release: Deliverable 7.4 of project EuDML. As of 9th February 2013. EU CIP-ICT-PSP project 250503 EuDML: The European Digital Mathematics Library, 2013. 24 pp. Deliverable D7.4.

BibTeX

@misc{eudml:d7.4,
     author = "Krzys Wojciechowski and Petr Sojka and Nicolas Houillon
			and Michal Růžička and Radim Hatlapatka
			and Vlastimil Krejčíř and Miroslav Hrdina
			and Jiří Sochor and Pavel Rychlý and Aleš Horák
			and Alan Sexton and Gilberto Pedrosa
			and Franck Lontin and Thierry Bouche",
      title = "{Toolset for Image and Text Processing and Metadata
		Enhancements -- Final Release}",
       year = 2012,
      month = Mar,
       note = {Deliverable D7.4 of EU CIP-ICT-PSP project 250503 
	       \href{http://project.eudml.eu/}
		    {EuDML: The European Digital Mathematics Library}}, 
        url = {https://project.eudml.org/sites/default/files/D7.4.pdf},
}

Go to:

MIR@MU