summaryrefslogtreecommitdiff
path: root/crocoite/controller.py
AgeCommit message (Collapse)AuthorFilesLines
2018-08-04Reintroduce WARC loggingLars-Dominik Braun1-23/+33
Commit 7730e0d64ec895091a0dd7eb0e3c6ce2ed02d981 removed logging to WARC files. Add it again, but with a different implementation.. Credits to structlog for inspiration.
2018-06-25warc: Save DOM-/image screenshot as WARC conversionLars-Dominik Braun1-7/+1
Judging from the docs this is the proper way to store these resources. Enable both for the IRC bot by default, since they won’t interfere with IA’s wayback machine.
2018-06-20Add __slots__ to classesLars-Dominik Braun1-0/+22
This is mainly a quality of life change
2018-06-20Synchronous SiteLoader event handlingLars-Dominik Braun1-99/+161
Previously a browser crash stalled the entire grab, since events from pychrome were handled asynchronously in a different thread and exceptions were not propagated to the main thread. Now all browser events are stored in a queue and processed by the main thread, allowing us to handle browser crashes gracefully (more or less). This made the following additional changes necessary: - Clear separation between producer (browser) and consumer (WARC, stats, …) - Behavior scripts now yield events as well, instead of accessing the WARC writer - WARC logging was removed (for now) and WARC writer does not require serialization any more
2018-05-05Extract only visible and clickable linksLars-Dominik Braun1-1/+1
2018-05-04Add distributed recursive crawlsLars-Dominik Braun1-5/+17
2018-05-04Add support for recursive crawlsLars-Dominik Braun1-0/+100
Only local right now, not distributed.
2018-05-04behavior: Add link extraction scriptLars-Dominik Braun1-1/+11
2018-05-04Move page archiving logic to SinglePageControllerLars-Dominik Braun1-0/+103
In preparation for recursive crawls.