OCR on a large PDF using tesseract and pdftk
Posted by Admin • Thursday, January 19. 2017 • Category: LinuxThis turned out to be harder than I thought. I found a large (50MB) PDF with about 50 pages, and none of the tesseract GUI's seemed to be able to handle it without crashing. The solution is to convert the PDF to TIFF so that command-line tesseract could handle it directly, but now ImageMagick couldn't handle that conversion as it was running out of memory (even with the limit settings). So the only option was to reduce the load on all the moving parts by splitting the PDF into pages.
Even after splitting the PDF and running each page through the PDF->TIFF->Tesseract->PDF chain I was still having issues:
Error in pixReadFromTiffStream: spp not in set {1,3,4}
Huh? So it turns out that sometimes you may wind up with an alpha channel in your TIFF and tesseract can't handle this. There is a solution, fortunately. So finally, I put all of these steps together into a script:
This script takes the input PDF and:
- splits it into pages
- converts each page to a TIFF format that tesseract can actually handle, removing alpha channels, etc
- performs OCR
- outputs each individual page as PDF
- reassembles the PDF
Surprisingly (or not?) my resulting PDF turned out to be 10 times smaller than the original! I guess the DPI was insane.
Prerequisites:
- tesseract
- imagemagick
- pdftk
And here is the script:
#!/bin/bash -e
if [ -z "$2" ] ; then
echo "usage $0 PDF NUMPAGES"
fi
PDF="$1"
NUM="$2"
for PAGE in $(seq -f "%05g" 1 $NUM) ; do
echo "Processing page $PAGE"
pdftk "$PDF" cat $PAGE output temp.pdf
echo "Split PDF"
convert -density 300 temp.pdf -depth 8 -fill white -draw 'rectangle 10,10 20,20' -background white -flatten +matte temp.tiff
echo "Converted to TIFF"
tesseract temp.tiff tmp.pdf_"${PAGE}" pdf
rm -f temp.tiff temp.pdf
done
pdftk tmp.pdf_*.pdf output ocr-output.pdf && rm -f tmp.pdf_*.pdf
echo "Output written to ocr-output.pdf"
So why does the script need you to provide the number of pages? I could have used pdfinfo in the script, but in my testing, pdfinfo returned incorrect values for Page Count (It told me that my 50 page PDF had 820 pages!). Easier to just ask the user.
Hi, your script is awesome!
I've made some changes into it, take a look: https://github.com/vicentereis/pdf-ocr-converter