summaryrefslogtreecommitdiff
path: root/contrib
AgeCommit message (Collapse)AuthorFilesLines
2019-01-27irc: Add URL blacklistLars-Dominik Braun1-0/+3
2019-01-27irc: Switch configuration to JSONLars-Dominik Braun2-10/+12
2018-12-05irc: Add example config fileLars-Dominik Braun1-0/+10
2018-10-14irc: Add PoC dashboardLars-Dominik Braun3-0/+156
Using websockets, vue and bulma.
2018-08-21Remove celery and recursionLars-Dominik Braun1-229/+0
Gonna rewrite that properly.
2018-06-25warc: Save DOM-/image screenshot as WARC conversionLars-Dominik Braun1-2/+1
Judging from the docs this is the proper way to store these resources. Enable both for the IRC bot by default, since they won’t interfere with IA’s wayback machine.
2018-06-20Synchronous SiteLoader event handlingLars-Dominik Braun1-5/+4
Previously a browser crash stalled the entire grab, since events from pychrome were handled asynchronously in a different thread and exceptions were not propagated to the main thread. Now all browser events are stored in a queue and processed by the main thread, allowing us to handle browser crashes gracefully (more or less). This made the following additional changes necessary: - Clear separation between producer (browser) and consumer (WARC, stats, …) - Behavior scripts now yield events as well, instead of accessing the WARC writer - WARC logging was removed (for now) and WARC writer does not require serialization any more
2018-05-05Rename command line toolsLars-Dominik Braun2-124/+0
Move contrib/ scripts to .tools and add entry points to setup.py, rename crocoite-standalone to crocoite-grab.
2018-05-05contrib: Add WARC merging scriptLars-Dominik Braun1-0/+70
Very useful for distributed, recursive crawls which create one WARC per page.
2018-05-04sopel: Use recursive, distributed controllerLars-Dominik Braun1-2/+7
2018-05-04IRC plugin: Use argparseLars-Dominik Braun1-17/+33
2018-05-04Move page archiving logic to SinglePageControllerLars-Dominik Braun1-15/+12
In preparation for recursive crawls.
2018-04-20Add screenshot extraction script to contrib/Lars-Dominik Braun1-0/+54
2018-02-22irc plugin: Serialize celery operationsLars-Dominik Braun1-68/+105
This is a workaround for https://github.com/celery/celery/issues/4480
2017-12-25Increase default body sizeLars-Dominik Braun1-4/+4
2017-12-24Refactor behavior scriptsLars-Dominik Braun1-10/+7
No functional changes, just cleanup. Replaces onload and onsnapshot events. Move screen metric emulation, DOM snapshots and screenshots here as well.
2017-12-22Add simple stats-keeping SiteLoaderLars-Dominik Braun1-1/+14
2017-12-19Select default behavior scripts by site URLLars-Dominik Braun1-2/+24
2017-12-17Add distributed archivingLars-Dominik Braun1-0/+144
Using celery. Also adds a plugin for the IRC bot sopel. Code still needs some love, but it should work.