OCR on a large PDF using tesseract and pdftk
Posted by Admin • Thursday, January 19. 2017 • Category: LinuxThis turned out to be harder than I thought. I found a large (50MB) PDF with about 50 pages, and none of the tesseract GUI's seemed to be able to handle it without crashing. The solution is to convert the PDF to TIFF so that command-line tesseract could handle it directly, but now ImageMagick couldn't handle that conversion as it was running out of memory (even with the limit settings). So the only option was to reduce the load on all the moving parts by splitting the PDF into pages.
Even after splitting the PDF and running each page through the PDF->TIFF->Tesseract->PDF chain I was still having issues:
Error in pixReadFromTiffStream: spp not in set {1,3,4}
Huh? So it turns out that sometimes you may wind up with an alpha channel in your TIFF and tesseract can't handle this. There is a solution, fortunately. So finally, I put all of these steps together into a script:
Continue reading "OCR on a large PDF using tesseract and pdftk"