summaryrefslogtreecommitdiff
path: root/crocoite/controller.py
AgeCommit message (Collapse)AuthorFilesLines
2018-09-25Prevent recursing into arbitrary schemesLars-Dominik Braun1-1/+9
HTTP(S) only.
2018-09-25Parallelize recursive grabsLars-Dominik Braun1-4/+14
❤️ asyncio.
2018-09-25Add recursive controllerLars-Dominik Braun1-1/+129
Simple and sequential.
2018-09-25Log extracted linksLars-Dominik Braun1-0/+23
2018-08-21Remove celery and recursionLars-Dominik Braun1-118/+3
Gonna rewrite that properly.
2018-08-04Add package information to warcinfoLars-Dominik Braun1-6/+16
Change warcinfo record format to JSON (this is permitted by the specs) and add Python version, dependencies and their versions as well as file hashes. This should give us enough information to figure out the exact environment used to create the WARC.
2018-08-04Reintroduce WARC loggingLars-Dominik Braun1-23/+33
Commit 7730e0d64ec895091a0dd7eb0e3c6ce2ed02d981 removed logging to WARC files. Add it again, but with a different implementation.. Credits to structlog for inspiration.
2018-06-25warc: Save DOM-/image screenshot as WARC conversionLars-Dominik Braun1-7/+1
Judging from the docs this is the proper way to store these resources. Enable both for the IRC bot by default, since they won’t interfere with IA’s wayback machine.
2018-06-20Add __slots__ to classesLars-Dominik Braun1-0/+22
This is mainly a quality of life change
2018-06-20Synchronous SiteLoader event handlingLars-Dominik Braun1-99/+161
Previously a browser crash stalled the entire grab, since events from pychrome were handled asynchronously in a different thread and exceptions were not propagated to the main thread. Now all browser events are stored in a queue and processed by the main thread, allowing us to handle browser crashes gracefully (more or less). This made the following additional changes necessary: - Clear separation between producer (browser) and consumer (WARC, stats, …) - Behavior scripts now yield events as well, instead of accessing the WARC writer - WARC logging was removed (for now) and WARC writer does not require serialization any more
2018-05-05Extract only visible and clickable linksLars-Dominik Braun1-1/+1
2018-05-04Add distributed recursive crawlsLars-Dominik Braun1-5/+17
2018-05-04Add support for recursive crawlsLars-Dominik Braun1-0/+100
Only local right now, not distributed.
2018-05-04behavior: Add link extraction scriptLars-Dominik Braun1-1/+11
2018-05-04Move page archiving logic to SinglePageControllerLars-Dominik Braun1-0/+103
In preparation for recursive crawls.