summaryrefslogtreecommitdiff
path: root/crocoite/cli.py
AgeCommit message (Collapse)AuthorFilesLines
2019-01-27irc: Switch configuration to JSONLars-Dominik Braun1-12/+12
2019-01-07Log Chrome’s responses to WARC by defaultLars-Dominik Braun1-1/+2
We may not be able to reproduce every failure, so logging as much as possible is important to figure out what went wrong. Also, in case a bug is uncovered in the future, we can check the logs and possibly fix it with -errata.
2018-12-22Switch -recursive to asyncio’s .cancel()Lars-Dominik Braun1-2/+3
RecursiveController used a custom .cancel() method before. Instead we can simply cancel .run() and handle the CancelledError inside run() and fetch().
2018-12-21Parse URLs by defaultLars-Dominik Braun1-2/+3
Use library yarl (already pulled in by aiohttp). No URL processed should be a string.
2018-12-02controller: Remove unused argumentLars-Dominik Braun1-1/+1
Has been replaced by handler a while ago.
2018-12-01cli: Fix --behaviorLars-Dominik Braun1-2/+3
2018-11-25single: Graceful ^CLars-Dominik Braun1-1/+5
Allow cancellation of timeout wait.
2018-11-19Coding styleLars-Dominik Braun1-8/+5
Fix a few random issues pointed out by pylint, mainly unused imports.
2018-11-14Async chrome process startupLars-Dominik Braun1-3/+3
Move it to .devtools. Seems more fitting.
2018-11-06Switch single mode to asyncioLars-Dominik Braun1-6/+7
This is a direct port to asyncio without any design changes. These need to happen in further refinements. Fixes issue #1.
2018-10-30Increase default timeoutsLars-Dominik Braun1-2/+2
These are more sane than the previous super-short defaults. Obviously this will slow down recursive crawls.
2018-10-23single: Set and recursive: check exit statusLars-Dominik Braun1-5/+20
Use exit status to signal something is wrong. Check it within recursive, increment crashed counter and do not move the resulting WARC, it might be broken.
2018-10-14irc: Add PoC dashboardLars-Dominik Braun1-0/+8
Using websockets, vue and bulma.
2018-10-14irc: Graceful bot shutdownLars-Dominik Braun1-3/+7
Wait for remaining jobs to finish without accepting new ones, but still allow some interaction with the bot (status/revoke).
2018-10-11recursive: Gracefully shut down on SIGINT/TERMLars-Dominik Braun1-1/+4
2018-10-02irc: Refactoring/beautificationLars-Dominik Braun1-3/+6
Add logging, split bot into abstract bot implementation and actual chromebot implementation, move some reusable checks into decorators.
2018-09-29Add documentationLars-Dominik Braun1-1/+6
For -recursive and -irc
2018-09-29irc: Limit number of processes spawnedLars-Dominik Braun1-1/+2
2018-09-29Add simple IRC botLars-Dominik Braun1-0/+19
chromebot is back! Dropping sopel, because it does not work well with asyncio.
2018-09-25Parallelize recursive grabsLars-Dominik Braun1-1/+3
❤️ asyncio.
2018-09-25Add recursive controllerLars-Dominik Braun1-0/+40
Simple and sequential.
2018-09-25Log extracted linksLars-Dominik Braun1-2/+2
2018-08-21Remove celery and recursionLars-Dominik Braun1-53/+20
Gonna rewrite that properly.
2018-08-04Reintroduce WARC loggingLars-Dominik Braun1-8/+8
Commit 7730e0d64ec895091a0dd7eb0e3c6ce2ed02d981 removed logging to WARC files. Add it again, but with a different implementation.. Credits to structlog for inspiration.
2018-06-20Synchronous SiteLoader event handlingLars-Dominik Braun1-6/+13
Previously a browser crash stalled the entire grab, since events from pychrome were handled asynchronously in a different thread and exceptions were not propagated to the main thread. Now all browser events are stored in a queue and processed by the main thread, allowing us to handle browser crashes gracefully (more or less). This made the following additional changes necessary: - Clear separation between producer (browser) and consumer (WARC, stats, …) - Behavior scripts now yield events as well, instead of accessing the WARC writer - WARC logging was removed (for now) and WARC writer does not require serialization any more
2018-05-04Share recursive argument parserLars-Dominik Braun1-7/+13
2018-05-04Support --browser again for local crawlsLars-Dominik Braun1-1/+5
Broken by commit 75019eac4545bb2e8b90033834e91beef614cdf3
2018-05-04Add distributed recursive crawlsLars-Dominik Braun1-23/+18
2018-05-04Add support for recursive crawlsLars-Dominik Braun1-2/+15
Only local right now, not distributed.
2018-05-04behavior: Add link extraction scriptLars-Dominik Braun1-2/+3
2018-05-04Move page archiving logic to SinglePageControllerLars-Dominik Braun1-114/+21
In preparation for recursive crawls.
2017-12-25Increase default body sizeLars-Dominik Braun1-3/+3
2017-12-24Refactor behavior scriptsLars-Dominik Braun1-146/+28
No functional changes, just cleanup. Replaces onload and onsnapshot events. Move screen metric emulation, DOM snapshots and screenshots here as well.
2017-12-22Add simple stats-keeping SiteLoaderLars-Dominik Braun1-3/+7
2017-12-20Increase hardcoded max timeoutsLars-Dominik Braun1-2/+2
We need a better solution for this. Sites loading a lot of responsive images easily need a minute after resizing.
2017-12-19Serialize WARC writingLars-Dominik Braun1-3/+3
Logger and SiteWriter both access .write_record() concurrently, which can corrupt WARC files. Move the writer to its own thread and decouple it with a queue. Since we’re probably I/O-bound this may speed up writeback as well.
2017-12-19Select default behavior scripts by site URLLars-Dominik Braun1-1/+10
2017-12-17Add distributed archivingLars-Dominik Braun1-145/+206
Using celery. Also adds a plugin for the IRC bot sopel. Code still needs some love, but it should work.
2017-12-06Start Chrome browser instanceLars-Dominik Braun1-44/+49
Unless --browser argument is given. Uses sane settings and a temporary profile directory.
2017-12-06Add flags to disable screenshot/DOM snapshotLars-Dominik Braun1-5/+9
2017-12-03Fix UTF-8 encoding nameLars-Dominik Braun1-1/+1
HTMLSerializer uses the exact string given in <meta charset=X>, thus it should be with hyphen.
2017-12-03Add page screenshot to WARCLars-Dominik Braun1-0/+14
2017-11-29argparse: Add metavarLars-Dominik Braun1-7/+7
2017-11-29RefactoringLars-Dominik Braun1-402/+50
Reusable browser communication and WARC writing.
2017-11-26DOM snapshot: Generate valid HTML5Lars-Dominik Braun1-7/+12
Some tags are “void”, i.e. cannot contain contents and don’t have a closing tag.
2017-11-25Ignore duplicate URLs when saving DOM snapshotLars-Dominik Braun1-1/+10
2017-11-25Workaround broken device metrics resetLars-Dominik Braun1-1/+3
Apparently neither width=0, height=0 nor clearDeviceMetricsOverride() do what they should, so manually reset to 1080p screen size.
2017-11-25Strip on* HTML attributesLars-Dominik Braun1-1/+27
They can carry JavaScript as well and should not be allowed for DOM snapshots.
2017-11-25Rename --run-before-snapshot and document --on* optionsLars-Dominik Braun1-3/+3
2017-11-24DOM snapshot: Save frames/subdocuments as wellLars-Dominik Braun1-13/+36
Request all subdocuments with pierce=True, split the result and save each document. Playback with pywb works, because timestamps of the snapshots are close to each other.