`katana.units.pdf.pdf2text` — pdf2text¶

Convert PDF to Text

This unit retrieves the text included in a PDF document, using the “pdftotext” Python library.

The unit inherits from katana.unit.FileUnit to ensure the target is a PDF file.

class katana.units.pdf.pdf2text.Unit(*args, **kwargs)

BLOCKED_GROUPS = ['pdf']: PDFs shouldn’t come out of this. So no reason to look.

GROUPS = ['pdf', 'pdftotext', 'pdf2text']: These are “tags” for a unit. Considering it is a pdf unit, “pdf” is included, and the name of the unit, “pdftotext”

PRIORITY = 25: Priority works with 0 being the highest priority, and 100 being the lowest priority. 50 is the default priorty. This unit has a high priority if this is detected…

evaluate(case: Any) → None

Evaluate the target. Extract the text out of the PDF document and recurse on any newfound text.

Parameters:	case – A case returned by `enumerate`. For this unit, the `enumerate` function is not used.
Returns:	None. This function should not return any data.

katana.units.pdf.pdf2text — pdf2text¶