Go to:

PdfJbIm

PdfJbIm is an open-source tool written in Java for reducing size of PDF documents. It uses benefits of standard JBIG2. JBIG2 standard is an extremely powerfull standard for compressing images scanned text and thus ideal usage area of such tool are digital libraries such as DML-CZ or EuDML which contain lots of PDF documents with scanned text.

The process of optimizing PDF document using pdfJbIm consists of several steps which are shown below in a workflow diagram.

pdfJbIm workflow diagram

As we can see from workflow diagram it uses open-source encoder jbig2enc for compressing extracted images according JBIG2 standard is used an open-source encoder jbig2enc.

To improve efectivity of achieved compression ratio were made improvements to jbig2enc encoder. This improvements increase perceptually lossless compression ratio by using an additional comparision process for representative symbols.

Older improvement is based on finding differencies between two representative symbols in shapes of lines (horizontal, vertical or diagonal) and points. If such difference is not found these images are considered equivalent and these two representative symbols are unified. This means all pointers to these representants are transfered to point only to one of them and the second one is removed.

Now is in development newer jbi2enc improvement which uses results of an OCR engine. For this purpose there is now used Tesseract. Because running OCR is not a cheap operation the recognition of each representant is run just once.

Cite as

Text

SOJKA, Petr and Radim HATLAPATKA. Document Engineering for a Digital Library: PDF recompression using JBIG2 and other optimization of PDF documents. In Proceedings of DocEng 2010 conference. Manchester, UK: ACM, 2010. p. 3–12. ISBN 978-1-4503-0231-9. doi:10.1145/1860559.1860563.

BibTeX

@inproceedings{doi:10.1145:1860559.1860563,
     author = "Petr Sojka and Radim Hatlapatka",
      title = "{Document Engineering for a Digital Library: PDF recompression 
                using JBIG2 and other optimization of PDF documents}",
  booktitle = "Proceedings of the ACM Conference on Document Engineering,
	       DocEng 2010",
  publisher = "Association of Computing Machinery",
    address = "Manchester",
       year = 2010,
      month = Sep,
       isbn = "978-1-4503-0231-9",
      pages = "3--12",
        url = {http://portal.acm.org/citation.cfm?id=1860563},
	doi = {10.1145/1860559.1860563},
}
		
Go to: navigation | start of page | end of page

Selected Publications

Go to: navigation | start of page | end of page

Relevant projects

Go to: navigation | start of page | end of page