Akom's Tech Ruminations

Various tech outbursts - code and solutions to practical problems

OCR on a large PDF using tesseract and pdftk Linux

Posted by Admin • Thursday, January 19. 2017 • Category: Linux

This turned out to be harder than I thought. I found a large (50MB) PDF with about 50 pages, and none of the tesseract GUI's seemed to be able to handle it without crashing. The solution is to convert the PDF to TIFF so that command-line tesseract could handle it directly, but now ImageMagick couldn't handle that conversion as it was running out of memory (even with the limit settings). So the only option was to reduce the load on all the moving parts by splitting the PDF into pages.

Even after splitting the PDF and running each page through the PDF->TIFF->Tesseract->PDF chain I was still having issues:
Error in pixReadFromTiffStream: spp not in set {1,3,4}
Huh? So it turns out that sometimes you may wind up with an alpha channel in your TIFF and tesseract can't handle this. There is a solution, fortunately. So finally, I put all of these steps together into a script:

This script takes the input PDF and:
  1. splits it into pages
  2. converts each page to a TIFF format that tesseract can actually handle, removing alpha channels, etc
  3. performs OCR
  4. outputs each individual page as PDF
  5. reassembles the PDF
Surprisingly (or not?) my resulting PDF turned out to be 10 times smaller than the original! I guess the DPI was insane.

Prerequisites:
  1. tesseract
  2. imagemagick
  3. pdftk
And here is the script:

#!/bin/bash -e

if [ -z "$2" ] ; then
        echo "usage $0 PDF NUMPAGES"
fi

PDF="$1"
NUM="$2"

for PAGE in $(seq -f "%05g" 1 $NUM) ; do
        echo "Processing page $PAGE"
        pdftk "$PDF" cat $PAGE output temp.pdf
        echo "Split PDF"
        convert -density 300 temp.pdf -depth 8  -fill white -draw 'rectangle 10,10 20,20' -background white -flatten +matte temp.tiff
        echo "Converted to TIFF"
        tesseract temp.tiff tmp.pdf_"${PAGE}" pdf
        rm -f temp.tiff temp.pdf
done


pdftk tmp.pdf_*.pdf output ocr-output.pdf  && rm -f tmp.pdf_*.pdf
echo "Output written to ocr-output.pdf"
 
So why does the script need you to provide the number of pages? I could have used pdfinfo in the script, but in my testing, pdfinfo returned incorrect values for Page Count (It told me that my 50 page PDF had 820 pages!). Easier to just ask the user.

0 Trackbacks

  1. No Trackbacks

0 Comments

Display comments as (Linear | Threaded)
  1. No comments

Add Comment


You can use [geshi lang=lang_name [,ln={y|n}]][/geshi] tags to embed source code snippets.
Enclosing asterisks marks text as bold (*word*), underscore are made via _word_.
Standard emoticons like :-) and ;-) are converted to images.

To prevent automated Bots from commentspamming, please enter the string you see in the image below in the appropriate input box. Your comment will only be submitted if the strings match. Please ensure that your browser supports and accepts cookies, or your comment cannot be verified correctly.
CAPTCHA

What is the primary language of this blog? (Anti-SPAM question)


Submitted comments will be subject to moderation before being displayed.