katana.units.ocr.tesseract
— Tesseract¶
Unit to perform Optical Character Recognition with Tesseract.
The unit inherits from katana.unit.FileUnit
to ensure the target
is an image.
This unit uses the Python library for Tesseract, which must be installed for this to run.
-
class
katana.units.ocr.tesseract.
Unit
(*args, **kwargs) Bases:
katana.unit.FileUnit
-
GROUPS
= ['ocr', 'tesseract'] These are “tags” for a unit. Considering it is a Ocr unit, “ocr” is included, as well as the unit name “tesseract”.
-
PRIORITY
= 25 Priority works with 0 being the highest priority, and 100 being the lowest priority. 50 is the default priorty. This unit has a higher priority because this is lightweight.
-
RECURSE_SELF
= False Do not recurse into itself, since it will not provide another image.
-
evaluate
(case: Any) → None Evaluate the target. Attempt OCR on the target and recurse on any newfound data.
Parameters: case – A case returned by enumerate
. For this unit, theenumerate
function is not used.Returns: None. This function should not return any data.
-
-
katana.units.ocr.tesseract.
attempt_ocr
(image_path: str) → str Run tesseract against an image file and return the string found
Parameters: image_path – The path to an image file. Returns: The string determined by Tesseract’s OCR efforts.