OCRKit Online Help – The Missing Manual
Getting started
To recognize the text of an PDF or image file simply drag and drop it onto the OCRKit application icon or use the File menu: Open...
The preference settings are sorted into four main categories that you can reach from the top menu bar:
OCR
Here you choose the language to be used for the optical character recognition.
The mode option allows to choose special processing options for fax and dot-matrix printed documents.
For regular office text the dictionary based spelling correction is usually helpful to improve results. For other alpha numeric data, such as scientific data or financial numbers turning off this option can be an advantage.
Processing
OCRKit tries to re-use images from source files 1:1. However PDF pages may contain more than one image, or vector and text elements. In this case the resolution option controls the resolution used to rasterize the whole pages to one image. Using the match physical size for low resolution images, OCRKit can try to detect the actual page size for photos without resolution information, such as taken with a mobile phone.
The rotate option allows you to adjust the page orientation before processing - for example when the creation application rotated landscape pages. De-skew based on page content allows to correct small skew angle often produced by scanning paper in a scanner.
The color detection can be used to automatically convert color images to gray or black and white to save storage space.
Using the de-screen option you can reduce the pattern of small dots a times visible when scanning offset printed magazines or newspapers. It can also be useful to remove other small dot noise on your source material.
Filing
With the Format option you select the output format. This can be Searchable or Highly Compressed PDF, or pure text formats, such as Rich-Text or HTML.
The compression quality controls the resulting file size, and thus affect the images quality. By moving this option further to the left side you receive smaller files by compromising image quality by compression artifacts.
By default OCRKit uses the original filename, while you can choose to add an -OCR extension or edit each filename manually.
When you select or drop multiple files in OCRKit for processing, they are processed one by one. You can also select to merge all those files in one batch into a single output file. E.g. for converting a bunch of JPEG or TIFF files into one highly compressed and searchable PDF.
OCRKit is using the order as received by the operating system. While this usually is the order you selected the files in the multi-selection, at times macOS's Finder unfortunately sorts them in an arbitrary order. In any case you can choose to have OCRKit sort the files to retain a reliably order while merging all files in one batch.
Finalization
In the Finalization tab you can choose whether to remove the original document to the system trash after processing, and whether to notify another application about the new file. This can be used with Apple's Preview for visual control, or database and cloud application to archive the final document.
Imprinter
With the digital imprinter of the Pro version you can add watermarks to your documents. Commonly used marks are CONFIDENTIAL, PRELIMINARY, COPY, or similar terms that suit your workflow, in any font, shape, or rotation.
You control the language used for the text recognition, as well as all other processing settings such as the output format (PDF, RTF, HTML or plain/text) in the OCRKit Preferences... menu.
My image does not OCR well
This is usually the result of poor image quality. If your image is barely readable for a human, you can imagine it is even harder to identify the text for a computer program. The resolution for scans of regular office paperwork should be between 200 and 300 dpi (dots / pixel per inch). We recommend using 300 dpi for all regular daily office material. Using more than 300 dpi does not necessarily improve results, but mainly increases the resulting PDF files. Unless you use Automatic rotation of the Pro version the text must also be in the right, readable orientation.
AppleScript
You can also script OCRKit to integrate it into your specific workflow. For example process incoming files, via shared folder, from MFP copy machine, etc. and simply tell OCRKit to open and thus process is via AppleScript:
tell application "OCRKit" set resolution to 240 set rotation to 180 set destination app to "/Application/Some.app" -- the legacy of AppleScript POSIX path handling, ... open "Users:admin:Desktop:orderform.pdf" open POSIX path of "/Users/Admin/Desktop/orderform.pdf" end tell
Command line
Since OCRKit version 2.5 direct command line scripting is supported. This greatly simplifies the use of OCRKit in batch processing, allows to set more options and is also more robust and cross-platform than AppleSCript.
OCRKit.app/Contents/MacOS/OCRKit \ --lang en | de | fr | es | ... \ --format pdf | html | rtf | text \ --no-progress \ --output out-file in-file
Since OCRKit version 16.9 additional command line options are supported:
-r, --recursive directory Scan directory recursively for new files. Skips files from OCRKit, with text layer or vector graphics. --pattern "regex" Pattern used to match filenames during recursive scans. Defaults to "%.pdf$", recommendation for TIFF is "%.tiff?$" --log file Write log file information and statistics during recursive scan to file. --password secret Use secret password to decrypt PDF files during batch processing. --test-run [ fast ] Only run batch processing in test mode to test PDF files or to obtain page count to estimate total processing time. "fast" will only check the first page of each file, instead of going thru all pages for image and vector analyzation. --tag name Use extended attribute name to tag the processing state of files during batch processing. "macos:OCRKit (%s)" will use native macOS Finder tags instead, or simply "macos:OCRKit" not including the state attribute. The order of the state attribute are: "started", "analyzed", "processed", and can also be "encrypted"