crocoite.git - Web archiving using Google Chrome

Age	Commit message (Collapse)	Author	Files	Lines
2019-01-27	irc: Switch configuration to JSON	Lars-Dominik Braun	1	-12/+12

2019-01-07	Log Chrome’s responses to WARC by default	Lars-Dominik Braun	1	-1/+2
	We may not be able to reproduce every failure, so logging as much as possible is important to figure out what went wrong. Also, in case a bug is uncovered in the future, we can check the logs and possibly fix it with -errata.
2018-12-22	Switch -recursive to asyncio’s .cancel()	Lars-Dominik Braun	1	-2/+3
	RecursiveController used a custom .cancel() method before. Instead we can simply cancel .run() and handle the CancelledError inside run() and fetch().
2018-12-21	Parse URLs by default	Lars-Dominik Braun	1	-2/+3
	Use library yarl (already pulled in by aiohttp). No URL processed should be a string.
2018-12-02	controller: Remove unused argument	Lars-Dominik Braun	1	-1/+1
	Has been replaced by handler a while ago.
2018-12-01	cli: Fix --behavior	Lars-Dominik Braun	1	-2/+3

2018-11-25	single: Graceful ^C	Lars-Dominik Braun	1	-1/+5
	Allow cancellation of timeout wait.
2018-11-19	Coding style	Lars-Dominik Braun	1	-8/+5
	Fix a few random issues pointed out by pylint, mainly unused imports.
2018-11-14	Async chrome process startup	Lars-Dominik Braun	1	-3/+3
	Move it to .devtools. Seems more fitting.
2018-11-06	Switch single mode to asyncio	Lars-Dominik Braun	1	-6/+7
	This is a direct port to asyncio without any design changes. These need to happen in further refinements. Fixes issue #1.
2018-10-30	Increase default timeouts	Lars-Dominik Braun	1	-2/+2
	These are more sane than the previous super-short defaults. Obviously this will slow down recursive crawls.
2018-10-23	single: Set and recursive: check exit status	Lars-Dominik Braun	1	-5/+20
	Use exit status to signal something is wrong. Check it within recursive, increment crashed counter and do not move the resulting WARC, it might be broken.
2018-10-14	irc: Add PoC dashboard	Lars-Dominik Braun	1	-0/+8
	Using websockets, vue and bulma.
2018-10-14	irc: Graceful bot shutdown	Lars-Dominik Braun	1	-3/+7
	Wait for remaining jobs to finish without accepting new ones, but still allow some interaction with the bot (status/revoke).
2018-10-11	recursive: Gracefully shut down on SIGINT/TERM	Lars-Dominik Braun	1	-1/+4

2018-10-02	irc: Refactoring/beautification	Lars-Dominik Braun	1	-3/+6
	Add logging, split bot into abstract bot implementation and actual chromebot implementation, move some reusable checks into decorators.
2018-09-29	Add documentation	Lars-Dominik Braun	1	-1/+6
	For -recursive and -irc
2018-09-29	irc: Limit number of processes spawned	Lars-Dominik Braun	1	-1/+2

2018-09-29	Add simple IRC bot	Lars-Dominik Braun	1	-0/+19
	chromebot is back! Dropping sopel, because it does not work well with asyncio.
2018-09-25	Parallelize recursive grabs	Lars-Dominik Braun	1	-1/+3
	❤️ asyncio.
2018-09-25	Add recursive controller	Lars-Dominik Braun	1	-0/+40
	Simple and sequential.
2018-09-25	Log extracted links	Lars-Dominik Braun	1	-2/+2

2018-08-21	Remove celery and recursion	Lars-Dominik Braun	1	-53/+20
	Gonna rewrite that properly.
2018-08-04	Reintroduce WARC logging	Lars-Dominik Braun	1	-8/+8
	Commit 7730e0d64ec895091a0dd7eb0e3c6ce2ed02d981 removed logging to WARC files. Add it again, but with a different implementation.. Credits to structlog for inspiration.
2018-06-20	Synchronous SiteLoader event handling	Lars-Dominik Braun	1	-6/+13
	Previously a browser crash stalled the entire grab, since events from pychrome were handled asynchronously in a different thread and exceptions were not propagated to the main thread. Now all browser events are stored in a queue and processed by the main thread, allowing us to handle browser crashes gracefully (more or less). This made the following additional changes necessary: - Clear separation between producer (browser) and consumer (WARC, stats, …) - Behavior scripts now yield events as well, instead of accessing the WARC writer - WARC logging was removed (for now) and WARC writer does not require serialization any more
2018-05-04	Share recursive argument parser	Lars-Dominik Braun	1	-7/+13

2018-05-04	Support --browser again for local crawls	Lars-Dominik Braun	1	-1/+5
	Broken by commit 75019eac4545bb2e8b90033834e91beef614cdf3
2018-05-04	Add distributed recursive crawls	Lars-Dominik Braun	1	-23/+18

2018-05-04	Add support for recursive crawls	Lars-Dominik Braun	1	-2/+15
	Only local right now, not distributed.
2018-05-04	behavior: Add link extraction script	Lars-Dominik Braun	1	-2/+3

2018-05-04	Move page archiving logic to SinglePageController	Lars-Dominik Braun	1	-114/+21
	In preparation for recursive crawls.
2017-12-25	Increase default body size	Lars-Dominik Braun	1	-3/+3

2017-12-24	Refactor behavior scripts	Lars-Dominik Braun	1	-146/+28
	No functional changes, just cleanup. Replaces onload and onsnapshot events. Move screen metric emulation, DOM snapshots and screenshots here as well.
2017-12-22	Add simple stats-keeping SiteLoader	Lars-Dominik Braun	1	-3/+7

2017-12-20	Increase hardcoded max timeouts	Lars-Dominik Braun	1	-2/+2
	We need a better solution for this. Sites loading a lot of responsive images easily need a minute after resizing.
2017-12-19	Serialize WARC writing	Lars-Dominik Braun	1	-3/+3
	Logger and SiteWriter both access .write_record() concurrently, which can corrupt WARC files. Move the writer to its own thread and decouple it with a queue. Since we’re probably I/O-bound this may speed up writeback as well.
2017-12-19	Select default behavior scripts by site URL	Lars-Dominik Braun	1	-1/+10

2017-12-17	Add distributed archiving	Lars-Dominik Braun	1	-145/+206
	Using celery. Also adds a plugin for the IRC bot sopel. Code still needs some love, but it should work.
2017-12-06	Start Chrome browser instance	Lars-Dominik Braun	1	-44/+49
	Unless --browser argument is given. Uses sane settings and a temporary profile directory.
2017-12-06	Add flags to disable screenshot/DOM snapshot	Lars-Dominik Braun	1	-5/+9

2017-12-03	Fix UTF-8 encoding name	Lars-Dominik Braun	1	-1/+1
	HTMLSerializer uses the exact string given in <meta charset=X>, thus it should be with hyphen.
2017-12-03	Add page screenshot to WARC	Lars-Dominik Braun	1	-0/+14

2017-11-29	argparse: Add metavar	Lars-Dominik Braun	1	-7/+7

2017-11-29	Refactoring	Lars-Dominik Braun	1	-402/+50
	Reusable browser communication and WARC writing.
2017-11-26	DOM snapshot: Generate valid HTML5	Lars-Dominik Braun	1	-7/+12
	Some tags are “void”, i.e. cannot contain contents and don’t have a closing tag.
2017-11-25	Ignore duplicate URLs when saving DOM snapshot	Lars-Dominik Braun	1	-1/+10

2017-11-25	Workaround broken device metrics reset	Lars-Dominik Braun	1	-1/+3
	Apparently neither width=0, height=0 nor clearDeviceMetricsOverride() do what they should, so manually reset to 1080p screen size.
2017-11-25	Strip on* HTML attributes	Lars-Dominik Braun	1	-1/+27
	They can carry JavaScript as well and should not be allowed for DOM snapshots.
2017-11-25	Rename --run-before-snapshot and document --on* options	Lars-Dominik Braun	1	-3/+3

2017-11-24	DOM snapshot: Save frames/subdocuments as well	Lars-Dominik Braun	1	-13/+36
	Request all subdocuments with pierce=True, split the result and save each document. Playback with pywb works, because timestamps of the snapshots are close to each other.