katana.units.pdf.pdf2text
— pdf2text¶
Convert PDF to Text
This unit retrieves the text included in a PDF document, using the “pdftotext” Python library.
The unit inherits from katana.unit.FileUnit
to ensure the target
is a PDF file.
-
class
katana.units.pdf.pdf2text.
Unit
(*args, **kwargs) Bases:
katana.unit.FileUnit
-
BLOCKED_GROUPS
= ['pdf'] PDFs shouldn’t come out of this. So no reason to look.
-
GROUPS
= ['pdf', 'pdftotext', 'pdf2text'] These are “tags” for a unit. Considering it is a pdf unit, “pdf” is included, and the name of the unit, “pdftotext”
-
PRIORITY
= 25 Priority works with 0 being the highest priority, and 100 being the lowest priority. 50 is the default priorty. This unit has a high priority if this is detected…
-
RECURSE_SELF
= False Again no PDF from this. So recursion is silly.
-
evaluate
(case: Any) → None Evaluate the target. Extract the text out of the PDF document and recurse on any newfound text.
Parameters: case – A case returned by enumerate
. For this unit, theenumerate
function is not used.Returns: None. This function should not return any data.
-