katana.units.web.git
— Dump Git Repos¶
Git Dumper
This unit will detect if a /.git/
directory is found on a website.
If it is, it will pull down all the files and search for flags within
the commits and objects inside of the public facing git repository.
This process is threaded, alongside Katana already being threaded… so your mileage may vary.
This unit inherits from katana.units.web.WebUnit
as that contains
lots of predefined variables that can be used throughout multiple web units.
Note
This code is shamelessly ripped from https://github.com/arthaud/git-dumper
-
class
katana.units.web.git.
DownloadWorker
(pending_tasks, tasks_done, args) Bases:
katana.units.web.git.Worker
Part of the Git Dumper procedure.
Download a list of files
-
do_task
(filepath, url, directory, retry, timeout, unit, katana)
-
init
(url, directory, retry, timeout, unit, katana)
-
-
class
katana.units.web.git.
FindObjectsWorker
(pending_tasks, tasks_done, args) Bases:
katana.units.web.git.DownloadWorker
Part of the Git Dumper procedure.
Find objects.
-
do_task
(obj, url, directory, retry, timeout, unit, katana)
-
-
class
katana.units.web.git.
FindRefsWorker
(pending_tasks, tasks_done, args) Bases:
katana.units.web.git.DownloadWorker
Part of the Git Dumper procedure.
Find refs/
-
do_task
(filepath, url, directory, retry, timeout, unit, katana)
-
-
class
katana.units.web.git.
RecursiveDownloadWorker
(pending_tasks, tasks_done, args) Bases:
katana.units.web.git.DownloadWorker
Part of the Git Dumper procedure.
Download a directory recursively.
-
do_task
(filepath, url, directory, retry, timeout, unit, katana)
-
-
class
katana.units.web.git.
Unit
(*args, **kwargs) Bases:
katana.units.web.WebUnit
-
BAD_MIME_TYPES
= ['application/octet-stream']
-
GROUPS
= ['web', 'git'] These are “tags” for a unit. Considering it is a Web unit, “web” is included, as well as the name of the unit, “git”.
-
PRIORITY
= 40 Priority works with 0 being the highest priority, and 100 being the lowest priority. 50 is the default priorty. This unit has a somewhat higher priority.
-
RECURSE_SELF
= False This unit should not recurse into itself. It would make no sense.
-
evaluate
(case: Any) Evaluate the target. If a
.git
repository is found, download it and look through all of the objects for a flag.Parameters: case – A case returned by enumerate
. For this unit, theenumerate
function is not used.Returns: None. This function should not return any data.
-
-
class
katana.units.web.git.
Worker
(pending_tasks, tasks_done, args) Bases:
multiprocessing.context.Process
Part of the Git Dumper procedure.
Worker for process_tasks
-
do_task
(task, *args)
-
init
(*args)
-
run
() Method to be run in sub-process; can be overridden in sub-class
-
-
katana.units.web.git.
bad_starting_links
= [b'#', b'javascript:', b'https://', b'http://', b'//'] This is a blacklist to avoid inline JavaScript, anchors, and external links..
-
katana.units.web.git.
create_intermediate_dirs
(path) Part of the Git Dumper procedure.
Create intermediate directories, if necessary
-
katana.units.web.git.
fetch_git
(unit, url, directory, jobs, retry, timeout, katana) Dump a .git repository into the output directory.
This is the core function of the https://github.com/arthaud/git-dumper code.
-
katana.units.web.git.
get_indexed_files
(response) Part of the Git Dumper procedure.
Return all the files in the directory index webpage.
-
katana.units.web.git.
get_referenced_sha1
(obj_file) Part of the Git Dumper procedure.
Return all the referenced SHA1 in the given object file
-
katana.units.web.git.
has_a_bad_start
(link) This is a convenience function just to avoid bad links above
-
katana.units.web.git.
is_html
(response) Return True if the response is a HTML webpage
-
katana.units.web.git.
process_tasks
(initial_tasks, worker, jobs, args=(), tasks_done=None) Part of the Git Dumper procedure.
Process tasks in parallel.