summaryrefslogtreecommitdiff
path: root/crocoite/cli.py
AgeCommit message (Collapse)AuthorFilesLines
2018-08-04Reintroduce WARC loggingLars-Dominik Braun1-8/+8
Commit 7730e0d64ec895091a0dd7eb0e3c6ce2ed02d981 removed logging to WARC files. Add it again, but with a different implementation.. Credits to structlog for inspiration.
2018-06-20Synchronous SiteLoader event handlingLars-Dominik Braun1-6/+13
Previously a browser crash stalled the entire grab, since events from pychrome were handled asynchronously in a different thread and exceptions were not propagated to the main thread. Now all browser events are stored in a queue and processed by the main thread, allowing us to handle browser crashes gracefully (more or less). This made the following additional changes necessary: - Clear separation between producer (browser) and consumer (WARC, stats, …) - Behavior scripts now yield events as well, instead of accessing the WARC writer - WARC logging was removed (for now) and WARC writer does not require serialization any more
2018-05-04Share recursive argument parserLars-Dominik Braun1-7/+13
2018-05-04Support --browser again for local crawlsLars-Dominik Braun1-1/+5
Broken by commit 75019eac4545bb2e8b90033834e91beef614cdf3
2018-05-04Add distributed recursive crawlsLars-Dominik Braun1-23/+18
2018-05-04Add support for recursive crawlsLars-Dominik Braun1-2/+15
Only local right now, not distributed.
2018-05-04behavior: Add link extraction scriptLars-Dominik Braun1-2/+3
2018-05-04Move page archiving logic to SinglePageControllerLars-Dominik Braun1-114/+21
In preparation for recursive crawls.
2017-12-25Increase default body sizeLars-Dominik Braun1-3/+3
2017-12-24Refactor behavior scriptsLars-Dominik Braun1-146/+28
No functional changes, just cleanup. Replaces onload and onsnapshot events. Move screen metric emulation, DOM snapshots and screenshots here as well.
2017-12-22Add simple stats-keeping SiteLoaderLars-Dominik Braun1-3/+7
2017-12-20Increase hardcoded max timeoutsLars-Dominik Braun1-2/+2
We need a better solution for this. Sites loading a lot of responsive images easily need a minute after resizing.
2017-12-19Serialize WARC writingLars-Dominik Braun1-3/+3
Logger and SiteWriter both access .write_record() concurrently, which can corrupt WARC files. Move the writer to its own thread and decouple it with a queue. Since we’re probably I/O-bound this may speed up writeback as well.
2017-12-19Select default behavior scripts by site URLLars-Dominik Braun1-1/+10
2017-12-17Add distributed archivingLars-Dominik Braun1-145/+206
Using celery. Also adds a plugin for the IRC bot sopel. Code still needs some love, but it should work.
2017-12-06Start Chrome browser instanceLars-Dominik Braun1-44/+49
Unless --browser argument is given. Uses sane settings and a temporary profile directory.
2017-12-06Add flags to disable screenshot/DOM snapshotLars-Dominik Braun1-5/+9
2017-12-03Fix UTF-8 encoding nameLars-Dominik Braun1-1/+1
HTMLSerializer uses the exact string given in <meta charset=X>, thus it should be with hyphen.
2017-12-03Add page screenshot to WARCLars-Dominik Braun1-0/+14
2017-11-29argparse: Add metavarLars-Dominik Braun1-7/+7
2017-11-29RefactoringLars-Dominik Braun1-402/+50
Reusable browser communication and WARC writing.
2017-11-26DOM snapshot: Generate valid HTML5Lars-Dominik Braun1-7/+12
Some tags are “void”, i.e. cannot contain contents and don’t have a closing tag.
2017-11-25Ignore duplicate URLs when saving DOM snapshotLars-Dominik Braun1-1/+10
2017-11-25Workaround broken device metrics resetLars-Dominik Braun1-1/+3
Apparently neither width=0, height=0 nor clearDeviceMetricsOverride() do what they should, so manually reset to 1080p screen size.
2017-11-25Strip on* HTML attributesLars-Dominik Braun1-1/+27
They can carry JavaScript as well and should not be allowed for DOM snapshots.
2017-11-25Rename --run-before-snapshot and document --on* optionsLars-Dominik Braun1-3/+3
2017-11-24DOM snapshot: Save frames/subdocuments as wellLars-Dominik Braun1-13/+36
Request all subdocuments with pierce=True, split the result and save each document. Playback with pywb works, because timestamps of the snapshots are close to each other.
2017-11-24Reset device metricsLars-Dominik Braun1-2/+5
2017-11-24Save onsnapshot script to WARCLars-Dominik Braun1-4/+8
2017-11-22Make <canvas> static before DOM snapshotLars-Dominik Braun1-8/+13
Use --run-before-snapshot=canvas-snapshot.js. Replaces <canvas> with image snapshot. We could use .captureStream() as well.
2017-11-22Emulate different screen sizesLars-Dominik Braun1-0/+25
Causes the browser to load CSS assets and <img> srcset, for example.
2017-11-22Add example fixups for InstagramLars-Dominik Braun1-3/+9
2017-11-21Move base64 metadata into WARC headerLars-Dominik Braun1-1/+1
2017-11-21Graceful page load timeoutLars-Dominik Braun1-9/+26
Stop scrolling script, wait for remaining resources to load.
2017-11-20Add page created from DOM snapshotLars-Dominik Braun1-6/+101
2017-11-17Initial importLars-Dominik Braun1-0/+320