OCRmyPDF 3.0

OCRmyPDF adds an inisible text layer to PDF documents after passing it through the Tesseract OCR engine. The output will be PDF/A with a selectable but invisible text layer above scanned image-documents. This allows later searching and archiving.

Tags pdf ocr scanning
License MITL
State stable

Recent Releases

3.014 Sep 2015 17:45 minor feature: bump to v3.0 and move repos. Test case: No longer using JHOVE. Move to my repo: github.com/fritz-hh = jbarlow83.
3.0-rc931 Aug 2015 01:45 minor feature: Throw exception if iccprofiles not found instead of returning None. unpaper: support paletted files by conversion instead of bailing. Use png256 raster device when possible. Prevent running validation on missing file after an exception is thrown. Add test cases for additional image formats. ghostscript: quiet startup on rasterize. Bump version to -rc9.
3.0-rc826 Aug 2015 17:45 minor feature: Exception thrown if input PDF was missing DocumentInfo block. Bump to -rc8.
3.0-rc517 Aug 2015 03:15 minor feature: Source code (zip) . Source code (tar.gz).
3.0-rc407 Aug 2015 20:05 minor feature:
3.0-rc230 Jul 2015 06:05 minor feature:
2.2-stable24 May 2015 06:45 minor feature: Update to jhove v1.11 Request the python library reportlab v3.0 or newer (So that we could remove a patch to the previous version of reportlab leading to issues for some users)
2.011 Sep 2014 17:05 hidden: Check if the language(s) passed using the -l option is supported by tesseract. Allow OCRmyPDF to be used with tesseract 3.02.01, even though OCR might fail for few PDF file. Rationale: For some linux distribution, no newer version than tesseract 3.02.01 is available. More robust algorithm for checking the version of the installed tesseract package.