katana.units.web.robots — Check robots.txt

Check robots.txt

This unit will look through all of the different robots.txt entries on a webpage and look for a flag.

It passes a User-Agent to act as a Google-bot crawler.

This unit inherits from katana.units.web.WebUnit as that contains lots of predefined variables that can be used throughout multiple web units.

Warning

This unit automatically attempts to perform malicious actions on the target. DO NOT use this in any circumstances where you do not have the authority to operate!

class katana.units.web.robots.Unit(*args, **kwargs)

Bases: katana.units.web.WebUnit

GROUPS = ['web', 'robots', 'robots.txt']

These are “tags” for a unit. Considering it is a Web unit, “web” is included, as well as the name of the unit, “robots”.

PRIORITY = 30

Priority works with 0 being the highest priority, and 100 being the lowest priority. 50 is the default priorty. This unit has a somewhat higher priority.

RECURSE_SELF = False

This unit should not recurse into itself. That would be silly.

enumerate()

Yield cases. This function will look at robots.txt page and return each page, to be examined by the evaluate function.

Returns:A generator, yielding a string for each URL in robots.txt.
evaluate(case)

Evaluate the target. Reach out to every entry in the robots.txt file and look for flags.

Parameters:case – A case returned by enumerate. For this unit, the enumerate function will yield each URL in the robots.txt file
Returns:None. This function should not return any data.
katana.units.web.robots.headers = {'User-Agent': 'Googlebot/2.1'}

Include these headers in the unit, to simulate action as the Googlebot crawler.