OCR on a large PDF using tesseract and pdftk

Random Entry: Customizing Windows 7 keyboard and mouse to act like Linux
< Running Jenkins Swarm client as a service via Upstart | Puppet recipe for setting up autossh via systemd >

OCR on a large PDF using tesseract and pdftk

Posted by Admin • Thursday, January 19. 2017 • Category: Linux

This turned out to be harder than I thought. I found a large (50MB) PDF with about 50 pages, and none of the tesseract GUI's seemed to be able to handle it without crashing. The solution is to convert the PDF to TIFF so that command-line tesseract could handle it directly, but now ImageMagick couldn't handle that conversion as it was running out of memory (even with the limit settings). So the only option was to reduce the load on all the moving parts by splitting the PDF into pages.

Even after splitting the PDF and running each page through the PDF->TIFF->Tesseract->PDF chain I was still having issues:

Error in pixReadFromTiffStream: spp not in set {1,3,4}

Huh? So it turns out that sometimes you may wind up with an alpha channel in your TIFF and tesseract can't handle this. There is a solution, fortunately. So finally, I put all of these steps together into a script:

This script takes the input PDF and:

splits it into pages
converts each page to a TIFF format that tesseract can actually handle, removing alpha channels, etc
performs OCR
outputs each individual page as PDF
reassembles the PDF

Surprisingly (or not?) my resulting PDF turned out to be 10 times smaller than the original! I guess the DPI was insane.

Prerequisites:

tesseract
imagemagick
pdftk

And here is the script:

#!/bin/bash -e

if [ -z "$2" ] ; then
echo "usage $0 PDF NUMPAGES"
fi

PDF="$1"
NUM="$2"

for PAGE in $(seq -f "%05g" 1 $NUM) ; do
echo "Processing page $PAGE"
pdftk "$PDF" cat $PAGE output temp.pdf
echo "Split PDF"
convert -density 300 temp.pdf -depth 8 -fill white -draw 'rectangle 10,10 20,20' -background white -flatten +matte temp.tiff
echo "Converted to TIFF"
tesseract temp.tiff tmp.pdf_"${PAGE}" pdf
rm -f temp.tiff temp.pdf
done

pdftk tmp.pdf_*.pdf output ocr-output.pdf && rm -f tmp.pdf_*.pdf
echo "Output written to ocr-output.pdf"

So why does the script need you to provide the number of pages? I could have used pdfinfo in the script, but in my testing, pdfinfo returned incorrect values for Page Count (It told me that my 50 page PDF had 820 pages!). Easier to just ask the user.

0 Trackbacks

Trackback specific URI for this entry

No Trackbacks

1 Comments

Display comments as (Linear | Threaded)

Vicente reis on 2018-04-17 18:20:
Hi, your script is awesome!

I've made some changes into it, take a look: https://github.com/vicentereis/pdf-ocr-converter

Mon	Tue	Wed	Thu	Fri	Sat	Sun
← Back	July '26
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Akom's Tech Ruminations