Wednesday, March 18, 2015

Quick and dirty PDF OCR in Fedora 21

Every so often you wind up with a scanned PDF of mostly text. Performing OCR allows you to both search and select text in the document. Usually for one-off scans, a quick OCR is sufficient (no training, no skew correction, etc.). For those cases, I usually recommend pdfocr. Unfortunately for Fedora users, pdfocr uses PDFtk (which Fedora does not package due to it's dependence on gcj), and ExactImage, which no longer has a maintainer. Read on for a quick work around.


The first issue is PDFtk. Since pdfocr already uses Poppler for converting PDF files to PPM before feeding them to an OCR engine, I forked pdfocr to use Poppler in place of PDFtk for it's PDF split and join operations.

Second issue is the packaging of ExactImage which pdfocr uses to merge the hOCR files generated by the OCR engine with the PPM files generated from the PDFs. I suspect ExactImage no longer has a package maintainer since it needs to be patched to support the libpng and giflib versions used by Fedora. I ported (stole wholesale) the patches used by Debian, and put them together to build the necessary RPM.

I put the RPMs for my pdfocr fork and ExactImage in my pdftools copr repository. Simply enable the repo, install pdfocr, then you can quickly OCR a PDF with:

pdfocr -i [input].pdf -o [output].pdf

I find the CuneiForm OCR tool does a better job than Tesseract (likely due to the default configuration, and not the actual quality of the OCR engine), add the '-c' option if you want to try it. As with all things, you mileage may vary!

7 comments:

  1. Hi there! I found your repo quite useful. Seems to work OK for Fedora 24. I ended up hacking pdfocr to get decent results.
    Some notes you may find helpful:

    pdfseparate crashed for me sometimes. I just got rid of that call and used pdf2pnm's built in page args. This is slower, but seemed more reliable.

    As you did, I found that CuneiForm worked better than Tesseract, though not by very much. After looking at the hocr generated by the tools I decided was that horc2pdf (the Exactimage program) was not interpreting the horc by either in a sane way. After poking around a bit, I found that Tesseract now seems to have a pdf output mode that works and behaves pretty much like you would expect. So I just outputted the pdf from tesseract and skipped horc2pdf. The results were quite good.

    The pdf I was trying to index was a scanned HP16C manual, in case it matters.

    ReplyDelete
    Replies
    1. Interesting, when I have some time maybe I'll update the scripts with your changes. Thanks for the heads up!

      Nice calculator BTW, think I have one around here somewhere...

      Delete
    2. I pushed pdfocr 0.2.0, it has the changes you mention above (direct PDF output for Tesseract, pdftoppm page splitting). It also has a workaround for the hocr output which I took from here: https://bugs.launchpad.net/cuneiform-linux/+bug/623438/comments/60

      Finally, if delete files is enabled, it cleans up as it goes so it doesn't use so much RAM when OCRing big files.

      Thanks again for finding all this!

      Delete
  2. I should be so lucky. I'm just using an emulator for now.

    ReplyDelete
  3. This comment has been removed by the author.

    ReplyDelete
  4. This comment has been removed by the author.

    ReplyDelete
  5. Great! Thanks for submitting it. I'll try it out when I get a chance. (sorry for the multiple comments).

    ReplyDelete