PdfToTextViaOCR
PdfToTextViaOCR is an open-source tool written in Java (under Apache License 2.0) for rendering images from PDF documents and using OCR to render text for further usage such as indexing.
PdfToTextViaOCR has been developed as part of EuDML workflow and is used as fallback solution when other text rendering tools fail.
For extracting images is used an open-source library PDFBox published under Apache License 2.0. As the OCR module is currently used Tesseract that is an open-source tool offering large language support even for languages based on Cyrrilic.
Go to: navigation | start of page | end of page
Cite as
Text
WOJCIECHOWSKI, Krzys, Petr SOJKA, Nicolas HOUILLON, Michal RŮŽIČKA, Radim HATLAPATKA, Vlastimil KREJČÍŘ, Miroslav HRDINA, Jiří SOCHOR, Pavel RYCHLÝ, Aleš HORÁK, Alan SEXTON, Gilberto PEDROSA, Franck LONTIN, Thierry BOUCHE and Maciej KOŁUDA. Toolset for Image and Text Processing and Metadata Enhancements — Final Release: Deliverable 7.4 of project EuDML. As of 9th February 2013. EU CIP-ICT-PSP project 250503 EuDML: The European Digital Mathematics Library, 2013. 24 pp. Deliverable D7.4.
BibTeX
@misc{eudml:d7.4, author = "Krzys Wojciechowski and Petr Sojka and Nicolas Houillon and Michal Růžička and Radim Hatlapatka and Vlastimil Krejčíř and Miroslav Hrdina and Jiří Sochor and Pavel Rychlý and Aleš Horák and Alan Sexton and Gilberto Pedrosa and Franck Lontin and Thierry Bouche", title = "{Toolset for Image and Text Processing and Metadata Enhancements -- Final Release}", year = 2012, month = Mar, note = {Deliverable D7.4 of EU CIP-ICT-PSP project 250503 \href{http://project.eudml.eu/} {EuDML: The European Digital Mathematics Library}}, url = {https://project.eudml.org/sites/default/files/D7.4.pdf}, }
Go to: navigation | start of page | end of page
Selected Publications
- LEE, Mark, Petr SOJKA, Radim ŘEHŮŘEK, Radim HATLAPATKA, Maroš KUCBEL, Thierry BOUCHE, Claude GOUTORBE, Romeo ANGHELACHE and Krzysztof WOJCIECHOWSKI. Toolset for Entity and Semantic Associations – Final Release: Deliverable 8.4 of project EuDML. 1.0 as of 8th February 2013. : EU CIP-ICT-PSP project 250503 EuDML: The European Digital Mathematics Library, 2013. 13 s. Deliverable D8.4.
- WOJCIECHOWSKI, Krzys, Petr SOJKA, Nicolas HOUILLON, Michal RŮŽIČKA, Radim HATLAPATKA, Vlastimil KREJČÍŘ, Miroslav HRDINA, Jiří SOCHOR, Pavel RYCHLÝ, Aleš HORÁK, Alan SEXTON, Gilberto PEDROSA, Franck LONTIN, Thierry BOUCHE and Maciej KOŁUDA. Toolset for Image and Text Processing and Metadata Enhancements — Final Release: Deliverable 7.4 of project EuDML. 1 as of 9th February 2013. : EU CIP-ICT-PSP project 250503 EuDML: The European Digital Mathematics Library, 2013. 24 s. Deliverable D7.4.
- HATLAPATKA, Radim. JBIG2 Supported by OCR. In Petr Sojka, Michael Kohlhase. DML 2012: Towards a Digital Mathematics Library. Brno: Masaryk University, 2012.
- SOJKA, Petr and Radim HATLAPATKA. Toolset for Image and Text Processing and Metadata Editing – Initial release: Deliverable 7.2 of project EuDML. 1.0 as of 1st March 2011. : EU CIP-ICT-PSP project 250503 EuDML: The European Digital Mathematics Library, 2011. 25 s. Deliverable D7.2.
- HATLAPATKA, Radim. JBIG2 Supported by OCR. CEUR Workshop Proceedings, Aachen: Neuveden, 2012, roč. 921, October, s. 82-90. ISSN 1613-0073.
- SOJKA, Petr and Radim HATLAPATKA. Document Engineering for a Digital Library: PDF recompression using JBIG2 and other optimization of PDF documents. In Proceedings of MEMICS 2010 conference. Znojmo, Czech Republic: NOVPRESS s.r.o., 2010. s. 205. ISBN 978-80-87342-10-7.
- SOJKA, Petr and Radim HATLAPATKA. Document Engineering for a Digital Library: PDF recompression using JBIG2 and other optimization of PDF documents. In Proceedings of DocEng 2010 conference. Manchester, UK: ACM, 2010. s. 3-12, 10 s. ISBN 978-1-4503-0231-9. doi:10.1145/1860559.1860563.
- HATLAPATKA, Radim and Petr SOJKA. PDF Enhancements Tools for a Digital Library: pdfJbIm and pdfsign. In DML 2010 Towards a Digital Mathematics Library. First edition. Brno, Czech Republic: Masaryk University, 2010. s. 45-55, 11 s. ISBN 978-80-210-5242-0.