katana.units.ocr.tesseract — Tesseract

Unit to perform Optical Character Recognition with Tesseract.

The unit inherits from katana.unit.FileUnit to ensure the target is an image.

This unit uses the Python library for Tesseract, which must be installed for this to run.

class katana.units.ocr.tesseract.Unit(*args, **kwargs)

Bases: katana.unit.FileUnit

GROUPS = ['ocr', 'tesseract']

These are “tags” for a unit. Considering it is a Ocr unit, “ocr” is included, as well as the unit name “tesseract”.

PRIORITY = 25

Priority works with 0 being the highest priority, and 100 being the lowest priority. 50 is the default priorty. This unit has a higher priority because this is lightweight.

RECURSE_SELF = False

Do not recurse into itself, since it will not provide another image.

evaluate(case: Any) → None

Evaluate the target. Attempt OCR on the target and recurse on any newfound data.

Parameters:case – A case returned by enumerate. For this unit, the enumerate function is not used.
Returns:None. This function should not return any data.
katana.units.ocr.tesseract.attempt_ocr(image_path: str) → str

Run tesseract against an image file and return the string found

Parameters:image_path – The path to an image file.
Returns:The string determined by Tesseract’s OCR efforts.