PdfJbIm
PdfJbIm is an open-source tool written in Java for reducing size of PDF documents. It uses benefits of standard JBIG2. JBIG2 standard is an extremely powerfull standard for compressing images scanned text and thus ideal usage area of such tool are digital libraries such as DML-CZ or EuDML which contain lots of PDF documents with scanned text.
The process of optimizing PDF document using pdfJbIm consists of several steps which are shown below in a workflow diagram.
As we can see from workflow diagram it uses open-source encoder jbig2enc for compressing extracted images according JBIG2 standard is used an open-source encoder jbig2enc.
To improve efectivity of achieved compression ratio were made improvements to jbig2enc encoder. This improvements increase perceptually lossless compression ratio by using an additional comparision process for representative symbols.
Older improvement is based on finding differencies between two representative symbols in shapes of lines (horizontal, vertical or diagonal) and points. If such difference is not found these images are considered equivalent and these two representative symbols are unified. This means all pointers to these representants are transfered to point only to one of them and the second one is removed.
Now is in development newer jbi2enc improvement which uses results of an OCR engine. For this purpose there is now used Tesseract. Because running OCR is not a cheap operation the recognition of each representant is run just once.
Cite as
Text
SOJKA, Petr and Radim HATLAPATKA. Document Engineering for a Digital Library: PDF recompression using JBIG2 and other optimization of PDF documents. In Proceedings of DocEng 2010 conference. Manchester, UK: ACM, 2010. p. 3–12. ISBN 978-1-4503-0231-9. doi:10.1145/1860559.1860563.
BibTeX
@inproceedings{doi:10.1145:1860559.1860563, author = "Petr Sojka and Radim Hatlapatka", title = "{Document Engineering for a Digital Library: PDF recompression using JBIG2 and other optimization of PDF documents}", booktitle = "Proceedings of the ACM Conference on Document Engineering, DocEng 2010", publisher = "Association of Computing Machinery", address = "Manchester", year = 2010, month = Sep, isbn = "978-1-4503-0231-9", pages = "3--12", url = {http://portal.acm.org/citation.cfm?id=1860563}, doi = {10.1145/1860559.1860563}, }
Go to: navigation | start of page | end of page
Selected Publications
- LEE, Mark, Petr SOJKA, Radim ŘEHŮŘEK, Radim HATLAPATKA, Maroš KUCBEL, Thierry BOUCHE, Claude GOUTORBE, Romeo ANGHELACHE and Krzysztof WOJCIECHOWSKI. Toolset for Entity and Semantic Associations – Final Release: Deliverable 8.4 of project EuDML. 1.0 as of 8th February 2013. : EU CIP-ICT-PSP project 250503 EuDML: The European Digital Mathematics Library, 2013. 13 s. Deliverable D8.4.
- WOJCIECHOWSKI, Krzys, Petr SOJKA, Nicolas HOUILLON, Michal RŮŽIČKA, Radim HATLAPATKA, Vlastimil KREJČÍŘ, Miroslav HRDINA, Jiří SOCHOR, Pavel RYCHLÝ, Aleš HORÁK, Alan SEXTON, Gilberto PEDROSA, Franck LONTIN, Thierry BOUCHE and Maciej KOŁUDA. Toolset for Image and Text Processing and Metadata Enhancements — Final Release: Deliverable 7.4 of project EuDML. 1 as of 9th February 2013. : EU CIP-ICT-PSP project 250503 EuDML: The European Digital Mathematics Library, 2013. 24 s. Deliverable D7.4.
- HATLAPATKA, Radim. JBIG2 Supported by OCR. In Petr Sojka, Michael Kohlhase. DML 2012: Towards a Digital Mathematics Library. Brno: Masaryk University, 2012.
- SOJKA, Petr and Radim HATLAPATKA. Toolset for Image and Text Processing and Metadata Editing – Initial release: Deliverable 7.2 of project EuDML. 1.0 as of 1st March 2011. : EU CIP-ICT-PSP project 250503 EuDML: The European Digital Mathematics Library, 2011. 25 s. Deliverable D7.2.
- HATLAPATKA, Radim. JBIG2 Supported by OCR. CEUR Workshop Proceedings, Aachen: Neuveden, 2012, roč. 921, October, s. 82-90. ISSN 1613-0073.
- SOJKA, Petr and Radim HATLAPATKA. Document Engineering for a Digital Library: PDF recompression using JBIG2 and other optimization of PDF documents. In Proceedings of MEMICS 2010 conference. Znojmo, Czech Republic: NOVPRESS s.r.o., 2010. s. 205. ISBN 978-80-87342-10-7.
- SOJKA, Petr and Radim HATLAPATKA. Document Engineering for a Digital Library: PDF recompression using JBIG2 and other optimization of PDF documents. In Proceedings of DocEng 2010 conference. Manchester, UK: ACM, 2010. s. 3-12, 10 s. ISBN 978-1-4503-0231-9. doi:10.1145/1860559.1860563.
- HATLAPATKA, Radim and Petr SOJKA. PDF Enhancements Tools for a Digital Library: pdfJbIm and pdfsign. In DML 2010 Towards a Digital Mathematics Library. First edition. Brno, Czech Republic: Masaryk University, 2010. s. 45-55, 11 s. ISBN 978-80-210-5242-0.