katana.units.web.git — Dump Git Repos

Git Dumper

This unit will detect if a /.git/ directory is found on a website. If it is, it will pull down all the files and search for flags within the commits and objects inside of the public facing git repository.

This process is threaded, alongside Katana already being threaded… so your mileage may vary.

This unit inherits from katana.units.web.WebUnit as that contains lots of predefined variables that can be used throughout multiple web units.

Note

This code is shamelessly ripped from https://github.com/arthaud/git-dumper

class katana.units.web.git.DownloadWorker(pending_tasks, tasks_done, args)

Bases: katana.units.web.git.Worker

Part of the Git Dumper procedure.

Download a list of files

do_task(filepath, url, directory, retry, timeout, unit, katana)
init(url, directory, retry, timeout, unit, katana)
class katana.units.web.git.FindObjectsWorker(pending_tasks, tasks_done, args)

Bases: katana.units.web.git.DownloadWorker

Part of the Git Dumper procedure.

Find objects.

do_task(obj, url, directory, retry, timeout, unit, katana)
class katana.units.web.git.FindRefsWorker(pending_tasks, tasks_done, args)

Bases: katana.units.web.git.DownloadWorker

Part of the Git Dumper procedure.

Find refs/

do_task(filepath, url, directory, retry, timeout, unit, katana)
class katana.units.web.git.RecursiveDownloadWorker(pending_tasks, tasks_done, args)

Bases: katana.units.web.git.DownloadWorker

Part of the Git Dumper procedure.

Download a directory recursively.

do_task(filepath, url, directory, retry, timeout, unit, katana)
class katana.units.web.git.Unit(*args, **kwargs)

Bases: katana.units.web.WebUnit

BAD_MIME_TYPES = ['application/octet-stream']
GROUPS = ['web', 'git']

These are “tags” for a unit. Considering it is a Web unit, “web” is included, as well as the name of the unit, “git”.

PRIORITY = 40

Priority works with 0 being the highest priority, and 100 being the lowest priority. 50 is the default priorty. This unit has a somewhat higher priority.

RECURSE_SELF = False

This unit should not recurse into itself. It would make no sense.

evaluate(case: Any)

Evaluate the target. If a .git repository is found, download it and look through all of the objects for a flag.

Parameters:case – A case returned by enumerate. For this unit, the enumerate function is not used.
Returns:None. This function should not return any data.
class katana.units.web.git.Worker(pending_tasks, tasks_done, args)

Bases: multiprocessing.context.Process

Part of the Git Dumper procedure.

Worker for process_tasks

do_task(task, *args)
init(*args)
run()

Method to be run in sub-process; can be overridden in sub-class

katana.units.web.git.bad_starting_links = [b'#', b'javascript:', b'https://', b'http://', b'//']

This is a blacklist to avoid inline JavaScript, anchors, and external links..

katana.units.web.git.create_intermediate_dirs(path)

Part of the Git Dumper procedure.

Create intermediate directories, if necessary

katana.units.web.git.fetch_git(unit, url, directory, jobs, retry, timeout, katana)

Dump a .git repository into the output directory.

This is the core function of the https://github.com/arthaud/git-dumper code.

katana.units.web.git.get_indexed_files(response)

Part of the Git Dumper procedure.

Return all the files in the directory index webpage.

katana.units.web.git.get_referenced_sha1(obj_file)

Part of the Git Dumper procedure.

Return all the referenced SHA1 in the given object file

katana.units.web.git.has_a_bad_start(link)

This is a convenience function just to avoid bad links above

katana.units.web.git.is_html(response)

Return True if the response is a HTML webpage

katana.units.web.git.process_tasks(initial_tasks, worker, jobs, args=(), tasks_done=None)

Part of the Git Dumper procedure.

Process tasks in parallel.