crocoite.git - Web archiving using Google Chrome

Age	Commit message (Collapse)	Author	Files	Lines
2018-11-14	Async chrome process startup	Lars-Dominik Braun	6	-157/+161
	Move it to .devtools. Seems more fitting.
2018-11-10	tools: Fix WARC merging	Lars-Dominik Braun	2	-18/+205
	WARC-Target-URI was taken from the previous record, even if the URI was different. This essentially removes the revisited URL from the archive. Also add a few tests. And boy, warcio is a mess.
2018-11-08	devtools: Disable websocket pings to Chrome	Lars-Dominik Braun	2	-1/+12
	Chrome does not like that.
2018-11-06	Switch single mode to asyncio	Lars-Dominik Braun	5	-175/+141
	This is a direct port to asyncio without any design changes. These need to happen in further refinements. Fixes issue #1.
2018-11-06	Switch site loader to async DevTools communication	Lars-Dominik Braun	2	-229/+236

2018-11-06	Add simple asyncio-based DevTool communication	Lars-Dominik Braun	2	-0/+406
	Inspired by pychrome/aiochrome, but includes crash handling and async get() instead of callbacks.
2018-11-03	html: Add tests for tag/attribute stripping	Lars-Dominik Braun	1	-0/+38

2018-10-30	recursive: Actually stop the grab when canceled	Lars-Dominik Braun	1	-1/+3
	This change was lost during the merge of 958563a3602780b48599c27acf212139c2e6904d.
2018-10-30	Reduce idle wait time after stopping page	Lars-Dominik Braun	1	-4/+4

2018-10-30	Increase default timeouts	Lars-Dominik Braun	1	-2/+2
	These are more sane than the previous super-short defaults. Obviously this will slow down recursive crawls.
2018-10-23	single: Set and recursive: check exit status	Lars-Dominik Braun	2	-12/+34
	Use exit status to signal something is wrong. Check it within recursive, increment crashed counter and do not move the resulting WARC, it might be broken.
2018-10-22	behavior: Unload script only if the handle is valid	Lars-Dominik Braun	1	-2/+4
	For some reason with Google Chrome 70 this is not the case any more.
2018-10-14	irc: Add PoC dashboard	Lars-Dominik Braun	3	-16/+119
	Using websockets, vue and bulma.
2018-10-14	irc: Graceful bot shutdown	Lars-Dominik Braun	3	-16/+110
	Wait for remaining jobs to finish without accepting new ones, but still allow some interaction with the bot (status/revoke).
2018-10-11	recursive: Gracefully shut down on SIGINT/TERM	Lars-Dominik Braun	2	-4/+18

2018-10-10	Add timezone to logger dates	Lars-Dominik Braun	1	-1/+3
	UTC everywhere. Make that clear.
2018-10-03	controller: Depth limit does not work with i>1	Lars-Dominik Braun	1	-1/+3
	No easy way to fix this, so just limit to [0, 1] for now.
2018-10-03	irc: Fix mode parsing	Lars-Dominik Braun	2	-7/+37
	Ignore unsupported modes, add tests.
2018-10-02	irc: Refactoring/beautification	Lars-Dominik Braun	2	-101/+266
	Add logging, split bot into abstract bot implementation and actual chromebot implementation, move some reusable checks into decorators.
2018-09-29	Add documentation	Lars-Dominik Braun	2	-3/+9
	For -recursive and -irc
2018-09-29	irc: Limit number of processes spawned	Lars-Dominik Braun	2	-21/+25

2018-09-29	Add simple IRC bot	Lars-Dominik Braun	2	-0/+273
	chromebot is back! Dropping sopel, because it does not work well with asyncio.
2018-09-25	Prevent recursing into arbitrary schemes	Lars-Dominik Braun	1	-1/+9
	HTTP(S) only.
2018-09-25	Parallelize recursive grabs	Lars-Dominik Braun	2	-5/+17
	❤️ asyncio.
2018-09-25	Add recursive controller	Lars-Dominik Braun	2	-1/+169
	Simple and sequential.
2018-09-25	Immediately flush logger	Lars-Dominik Braun	1	-0/+2
	Consumers can read the latest gossip faster now.
2018-09-25	Log extracted links	Lars-Dominik Braun	2	-2/+25

2018-08-21	Remove celery and recursion	Lars-Dominik Braun	3	-317/+23
	Gonna rewrite that properly.
2018-08-17	behavior: Load more comments from Facebook	Lars-Dominik Braun	1	-0/+4

2018-08-05	test_browser: Properly handle failed requests	Lars-Dominik Braun	2	-15/+14
	Fixes test failures. Very fragile code unfortunately.
2018-08-04	Properly handle failure to retrieve request body	Lars-Dominik Braun	3	-5/+50
	Just truncate the WARC record like we do with responses. Also add a few tests, but they’re not covering the call to getRequestPostData. Not sure what we have to do here.
2018-08-04	Reference warcinfo record in every other record	Lars-Dominik Braun	1	-18/+30

2018-08-04	Add package information to warcinfo	Lars-Dominik Braun	3	-8/+65
	Change warcinfo record format to JSON (this is permitted by the specs) and add Python version, dependencies and their versions as well as file hashes. This should give us enough information to figure out the exact environment used to create the WARC.
2018-08-04	Reintroduce WARC logging	Lars-Dominik Braun	9	-76/+337
	Commit 7730e0d64ec895091a0dd7eb0e3c6ce2ed02d981 removed logging to WARC files. Add it again, but with a different implementation.. Credits to structlog for inspiration.
2018-06-25	browser: Fix testcase race condition	Lars-Dominik Braun	1	-0/+4

2018-06-25	warc: Add metadata to truncated records	Lars-Dominik Braun	1	-22/+28
	Specifically for a) redirects (body missing) b) bodies larger than size limit and c) whenever we couldn’t fetch the response body for whatever reason. We gave it our best shot, but still failed miserably. Future generations will certainly appreciate that. Eh, maybe. Hopefully. Will they?
2018-06-25	warc: Save DOM-/image screenshot as WARC conversion	Lars-Dominik Braun	6	-37/+72
	Judging from the docs this is the proper way to store these resources. Enable both for the IRC bot by default, since they won’t interfere with IA’s wayback machine.
2018-06-21	Fix a few issues pointed out by pylint	Lars-Dominik Braun	5	-22/+10

2018-06-21	browser: Add a few more tests	Lars-Dominik Braun	1	-3/+31
	Increase coverage.
2018-06-20	Move tests to pytest	Lars-Dominik Braun	2	-162/+177
	It just seems a little nicer than plain old unittest
2018-06-20	Add __slots__ to classes	Lars-Dominik Braun	5	-1/+56
	This is mainly a quality of life change
2018-06-20	Synchronous SiteLoader event handling	Lars-Dominik Braun	6	-509/+514
	Previously a browser crash stalled the entire grab, since events from pychrome were handled asynchronously in a different thread and exceptions were not propagated to the main thread. Now all browser events are stored in a queue and processed by the main thread, allowing us to handle browser crashes gracefully (more or less). This made the following additional changes necessary: - Clear separation between producer (browser) and consumer (WARC, stats, …) - Behavior scripts now yield events as well, instead of accessing the WARC writer - WARC logging was removed (for now) and WARC writer does not require serialization any more
2018-06-08	browser: Replace --remote-debugging-socket-fd	Lars-Dominik Braun	1	-23/+19
	It was replaced by --remote-debugging-pipe in version 67. pychrome does not support that out of the box, so instead we’ll let Chrome choose its own port and poll a file in its user-data-dir.
2018-06-03	behavior: Wrap extract links script in anonymous namespace	Lars-Dominik Braun	2	-2/+5
	Otherwise it may clash with symbols defined by the page.
2018-05-20	behavior: Patreon: Load more comments/replies	Lars-Dominik Braun	1	-0/+4

2018-05-20	behavior: Click Patreon’s “load more” button	Lars-Dominik Braun	1	-0/+6

2018-05-05	Rename command line tools	Lars-Dominik Braun	1	-0/+97
	Move contrib/ scripts to .tools and add entry points to setup.py, rename crocoite-standalone to crocoite-grab.
2018-05-05	Extract only visible and clickable links	Lars-Dominik Braun	2	-4/+29

2018-05-04	Share recursive argument parser	Lars-Dominik Braun	2	-14/+15

2018-05-04	Support --browser again for local crawls	Lars-Dominik Braun	2	-2/+6
	Broken by commit 75019eac4545bb2e8b90033834e91beef614cdf3