diff options
40 files changed, 3143 insertions, 1195 deletions
diff --git a/.travis.yml b/.travis.yml index c687962..b1d417c 100644 --- a/.travis.yml +++ b/.travis.yml @@ -1,11 +1,17 @@ dist: xenial language: python -python: - - "3.6" - - "3.6-dev" - - "3.7" - - "3.7-dev" - - "3.8-dev" +matrix: + include: + - python: "3.6" + - python: "3.7" + - python: "3.8" + - python: "3.6-dev" + - python: "3.7-dev" + - python: "3.8-dev" + allow_failures: + - python: "3.6-dev" + - python: "3.7-dev" + - python: "3.8-dev" install: - pip install . script: @@ -1,211 +1,15 @@ crocoite ======== -Preservation for the modern web, powered by `headless Google -Chrome`_. - -.. image:: https://travis-ci.org/PromyLOPh/crocoite.svg?branch=master - :target: https://travis-ci.org/PromyLOPh/crocoite - -.. image:: https://codecov.io/gh/PromyLOPh/crocoite/branch/master/graph/badge.svg - :target: https://codecov.io/gh/PromyLOPh/crocoite - -.. _headless Google Chrome: https://developers.google.com/web/updates/2017/04/headless-chrome - -Quick start ------------ - -These dependencies must be present to run crocoite: - -- Python ≥3.6 -- PyYAML_ -- aiohttp_ -- websockets_ -- warcio_ -- html5lib_ -- bottom_ (IRC client) -- `Google Chrome`_ - -.. _PyYAML: https://pyyaml.org/wiki/PyYAML -.. _aiohttp: https://aiohttp.readthedocs.io/ -.. _websockets: https://websockets.readthedocs.io/ -.. _warcio: https://github.com/webrecorder/warcio -.. _html5lib: https://github.com/html5lib/html5lib-python -.. _bottom: https://github.com/numberoverzero/bottom -.. _Google Chrome: https://www.google.com/chrome/ - -The following commands clone the repository from GitHub_, set up a virtual -environment and install crocoite: - -.. _GitHub: https://github.com/PromyLOPh/crocoite - .. code:: bash - git clone https://github.com/PromyLOPh/crocoite.git - cd crocoite - virtualenv -p python3 sandbox - source sandbox/bin/activate - pip install . - -One-shot command line interface and pywb_ playback: - -.. code:: bash - - pip install pywb - crocoite-grab http://example.com/ example.com.warc.gz - rm -rf collections && wb-manager init test && wb-manager add test example.com.warc.gz + pip install crocoite pywb + crocoite http://example.com/ example.com.warc.gz + wb-manager init test && wb-manager add test example.com.warc.gz wayback & $BROWSER http://localhost:8080 -.. _pywb: https://github.com/ikreymer/pywb - -Rationale ---------- - -Most modern websites depend heavily on executing code, usually JavaScript, on -the user’s machine. They also make use of new and emerging Web technologies -like HTML5, WebSockets, service workers and more. Even worse from the -preservation point of view, they also require some form of user interaction to -dynamically load more content (infinite scrolling, dynamic comment loading, -etc). - -The naive approach of fetching a HTML page, parsing it and extracting -links to referenced resources therefore is not sufficient to create a faithful -snapshot of these web applications. A full browser, capable of running scripts and -providing modern Web API’s is absolutely required for this task. Thankfully -Google Chrome runs without a display (headless mode) and can be controlled by -external programs, allowing them to navigate and extract or inject data. -This section describes the solutions crocoite offers and explains design -decisions taken. - -crocoite captures resources by listening to Chrome’s `network events`_ and -requesting the response body using `Network.getResponseBody`_. This approach -has caveats: The original HTTP requests and responses, as sent over the wire, -are not available. They are reconstructed from parsed data. The character -encoding for text documents is changed to UTF-8. And the content body of HTTP -redirects cannot be retrieved due to a race condition. - -.. _network events: https://chromedevtools.github.io/devtools-protocol/1-3/Network -.. _Network.getResponseBody: https://chromedevtools.github.io/devtools-protocol/1-3/Network#method-getResponseBody - -But at the same time it allows crocoite to rely on Chrome’s well-tested network -stack and HTTP parser. Thus it supports HTTP version 1 and 2 as well as -transport protocols like SSL and QUIC. Depending on Chrome also eliminates the -need for a man-in-the-middle proxy, like warcprox_, which has to decrypt SSL -traffic and present a fake certificate to the browser in order to store the -transmitted content. - -.. _warcprox: https://github.com/internetarchive/warcprox - -WARC records generated by crocoite therefore are an abstract view on the -resource they represent and not necessarily the data sent over the wire. A URL -fetched with HTTP/2 for example will still result in a HTTP/1.1 -request/response pair in the WARC file. This may be undesireable from -an archivist’s point of view (“save the data exactly like we received it”). But -this level of abstraction is inevitable when dealing with more than one -protocol. - -crocoite also interacts with and therefore alters the grabbed websites. It does -so by injecting `behavior scripts`_ into the site. Typically these are written -in JavaScript, because interacting with a page is easier this way. These -scripts then perform different tasks: Extracting targets from visible -hyperlinks, clicking buttons or scrolling the website to to load more content, -as well as taking a static screenshot of ``<canvas>`` elements for the DOM -snapshot (see below). - -.. _behavior scripts: https://github.com/PromyLOPh/crocoite/tree/master/crocoite/data - -Replaying archived WARC’s can be quite challenging and might not be possible -with current technology (or even at all): - -- Some sites request assets based on screen resolution, pixel ratio and - supported image formats (webp). Replaying those with different parameters - won’t work, since assets for those are missing. Example: missguided.com. -- Some fetch different scripts based on user agent. Example: youtube.com. -- Requests containing randomly generated JavaScript callback function names - won’t work. Example: weather.com. -- Range requests (Range: bytes=1-100) are captured as-is, making playback - difficult - -crocoite offers two methods to work around these issues. Firstly it can save a -DOM snapshot to the WARC file. It contains the entire DOM in HTML format minus -``<script>`` tags after the site has been fully loaded and thus can be -displayed without executing scripts. Obviously JavaScript-based navigation -does not work any more. Secondly it also saves a screenshot of the full page, -so even if future browsers cannot render and display the stored HTML a fully -rendered version of the website can be replayed instead. - -Advanced usage --------------- - -crocoite is built with the Unix philosophy (“do one thing and do it well”) in -mind. Thus ``crocoite-grab`` can only save a single page. If you want recursion -use ``crocoite-recursive``, which follows hyperlinks according to ``--policy``. -It can either recurse a maximum number of levels or grab all pages with the -same prefix as the start URL: - -.. code:: bash - - crocoite-recursive --policy prefix http://www.example.com/dir/ output - -will save all pages in ``/dir/`` and below to individual files in the output -directory ``output``. You can customize the command used to grab individual -pages by appending it after ``output``. This way distributed grabs (ssh to a -different machine and execute the job there, queue the command with Slurm, …) -are possible. - -IRC bot -^^^^^^^ - -A simple IRC bot (“chromebot”) is provided with the command ``crocoite-irc``. -It reads its configuration from a config file like the example provided in -``contrib/chromebot.ini`` and supports the following commands: - -a <url> -j <concurrency> -r <policy> - Archive <url> with <concurrency> processes according to recursion <policy> -s <uuid> - Get job status for <uuid> -r <uuid> - Revoke or abort running job with <uuid> - -Browser configuration -^^^^^^^^^^^^^^^^^^^^^ - -Generally crocoite provides reasonable defaults for Google Chrome via its -`devtools module`_. When debugging this software it might be necessary to open -a non-headless instance of the browser by running - -.. code:: bash - - google-chrome-stable --remote-debugging-port=9222 --auto-open-devtools-for-tabs - -and then passing the option ``--browser=http://localhost:9222`` to -``crocoite-grab``. This allows human intervention through the browser’s builtin -console. - -Another issue that might arise is related to fonts. Headless servers usually -don’t have them installed by default and thus rendered screenshots may contain -replacement characters (□) instead of the actual text. This affects mostly -non-latin character sets. It is therefore recommended to install at least -Micrsoft’s Corefonts_ as well as DejaVu_, Liberation_ or a similar font family -covering a wide range of character sets. - -.. _devtools module: crocoite/devtools.py -.. _Corefonts: http://corefonts.sourceforge.net/ -.. _DejaVu: https://dejavu-fonts.github.io/ -.. _Liberation: https://pagure.io/liberation-fonts - -Related projects ----------------- - -brozzler_ - Uses Google Chrome as well, but intercepts traffic using a proxy. Supports - distributed crawling and immediate playback. -Squidwarc_ - Communicates with headless Google Chrome and uses the Network API to - retrieve requests like crocoite. Supports recursive crawls and page - scrolling, but neither custom JavaScript nor distributed crawling. +See documentation_ for more information. -.. _brozzler: https://github.com/internetarchive/brozzler -.. _Squidwarc: https://github.com/N0taN3rd/Squidwarc +.. _documentation: https://6xq.net/crocoite/ diff --git a/contrib/chromebot.ini b/contrib/chromebot.ini deleted file mode 100644 index a302356..0000000 --- a/contrib/chromebot.ini +++ /dev/null @@ -1,10 +0,0 @@ -[irc] -host = irc.example.com -port = 6667 -ssl = False -tempdir = /path/to/warc -destdir = /path/to/tmp -nick = chromebot -channel = #testchannel -process_limit = 1 - diff --git a/contrib/chromebot.json b/contrib/chromebot.json new file mode 100644 index 0000000..214b770 --- /dev/null +++ b/contrib/chromebot.json @@ -0,0 +1,16 @@ +{ + "irc": { + "host": "irc.example.com", + "port": 6667, + "ssl": false, + "nick": "chromebot", + "channels": ["#testchannel"] + }, + "tempdir": "/path/to/tmp", + "destdir": "/path/to/warc", + "process_limit": 1 + "blacklist": { + "^https?://(.+\\.)?local(host)?/": "Not acceptable" + }, + "need_voice": false +} diff --git a/contrib/dashboard.html b/contrib/dashboard.html index cc09d50..49a15bc 100644 --- a/contrib/dashboard.html +++ b/contrib/dashboard.html @@ -4,7 +4,7 @@ <meta charset="utf-8"> <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"> <title>chromebot dashboard</title> - <!--<script src="https://cdn.jsdelivr.net/npm/vue/dist/vue.js"></script>--> + <!--<script src="https://cdn.jsdelivr.net/npm/vue@2/dist/vue.js"></script>--> <script src="https://cdn.jsdelivr.net/npm/vue@2/dist/vue.min.js"></script> <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bulma@0.7/css/bulma.min.css"> <link rel="stylesheet" href="dashboard.css"> @@ -13,8 +13,9 @@ <noscript>Please enable JavaScript.</noscript> <section id="app" class="section"> <h1 class="title">chromebot dashboard</h1> + <bot-status v-bind:jobs="jobs"></bot-status> <div class="jobs"> - <job-item v-for="j in jobs" v-bind:job="j" v-bind:jobs="jobs" v-bind:ignored="ignored" v-bind:key="j.id"></job-item> + <job-item v-for="j in jobs" v-bind:job="j" v-bind:jobs="jobs" v-bind:key="j.id"></job-item> </div> </section> <script src="dashboard.js"></script> diff --git a/contrib/dashboard.js b/contrib/dashboard.js index eb34d43..b5520dc 100644 --- a/contrib/dashboard.js +++ b/contrib/dashboard.js @@ -1,5 +1,5 @@ /* configuration */ -let socket = "ws://localhost:6789/", +let socket = "wss://localhost:6789/", urllogMax = 100; function formatSize (bytes) { @@ -35,19 +35,12 @@ class Job { } let jobs = {}; -/* list of ignored job ids, i.e. those the user deleted from the dashboard */ -let ignored = []; let ws = new WebSocket(socket); ws.onmessage = function (event) { var msg = JSON.parse (event.data); let msgdate = new Date (Date.parse (msg.date)); var j = undefined; - console.log (msg); if (msg.job) { - if (ignored.includes (msg.job)) { - console.log ("job ignored", msg.job); - return; - } j = jobs[msg.job]; if (j === undefined) { j = new Job (msg.job, 'unknown', '<unknown>', new Date ()); @@ -79,7 +72,7 @@ ws.onmessage = function (event) { } else if (rmsg.uuid == '5b8498e4-868d-413c-a67e-004516b8452c') { /* recursion status */ Object.assign (j.stats, rmsg); - } else if (rmsg.uuid == '1680f384-744c-4b8a-815b-7346e632e8db') { + } else if (rmsg.uuid == 'd1288fbe-8bae-42c8-af8c-f2fa8b41794f') { /* fetch */ j.addUrl (rmsg.url); } @@ -91,14 +84,8 @@ ws.onerror = function (event) { }; Vue.component('job-item', { - props: ['job', 'jobs', 'ignored'], - template: '<div class="job box" :id="job.id"><ul class="columns"><li class="jid column is-narrow"><a :href="\'#\' + job.id">{{ job.id }}</a></li><li class="url column"><a :href="job.url">{{ job.url }}</a></li><li class="status column is-narrow"><job-status v-bind:job="job"></job-status></li><li class="column is-narrow"><a class="delete" v-on:click="del(job.id)"></a></li></ul><job-stats v-bind:job="job"></job-stats></div>', - methods: { - del: function (id) { - Vue.delete(this.jobs, id); - this.ignored.push (id); - } - } + props: ['job', 'jobs'], + template: '<div class="job box" :id="job.id"><ul class="columns"><li class="jid column is-narrow"><a :href="\'#\' + job.id">{{ job.id }}</a></li><li class="url column"><a :href="job.url">{{ job.url }}</a></li><li class="status column is-narrow"><job-status v-bind:job="job"></job-status></li></ul><job-stats v-bind:job="job"></job-stats></div>', }); Vue.component('job-status', { props: ['job'], @@ -117,6 +104,21 @@ Vue.component('filesize', { template: '<span class="filesize">{{ fvalue }}</span>', computed: { fvalue: function () { return formatSize (this.value); } } }); +Vue.component('bot-status', { + props: ['jobs'], + template: '<nav class="level"><div class="level-item has-text-centered"><div><p class="heading">Pending</p><p class="title">{{ stats.pending }}</p></div></div><div class="level-item has-text-centered"><div><p class="heading">Running</p><p class="title">{{ stats.running }}</p></div></div><div class="level-item has-text-centered"><div><p class="heading">Finished</p><p class="title">{{ stats.finished+stats.aborted }}</p></div></div><div class="level-item has-text-centered"><div><p class="heading">Transferred</p><p class="title"><filesize v-bind:value="stats.totalBytes"></filesize></p></div></div></nav>', + computed: { + stats: function () { + let s = {pending: 0, running: 0, finished: 0, aborted: 0, totalBytes: 0}; + for (let k in this.jobs) { + let j = this.jobs[k]; + s[j.status]++; + s.totalBytes += j.stats.bytesRcv; + } + return s; + } + } +}); let app = new Vue({ el: '#app', diff --git a/crocoite/behavior.py b/crocoite/behavior.py index eb5478b..1610751 100644 --- a/crocoite/behavior.py +++ b/crocoite/behavior.py @@ -35,35 +35,41 @@ instance. """ import asyncio, json, os.path -from urllib.parse import urlsplit from base64 import b64decode from collections import OrderedDict import pkg_resources from html5lib.serializer import HTMLSerializer +from yarl import URL import yaml -from .util import getFormattedViewportMetrics, removeFragment +from .util import getFormattedViewportMetrics from . import html from .html import StripAttributeFilter, StripTagFilter, ChromeTreeWalker -from .devtools import Crashed +from .devtools import Crashed, TabException class Script: """ A JavaScript resource """ __slots__ = ('path', 'data') + datadir = 'data' def __init__ (self, path=None, encoding='utf-8'): self.path = path if path: - self.data = pkg_resources.resource_string (__name__, os.path.join ('data', path)).decode (encoding) + self.data = pkg_resources.resource_string (__name__, os.path.join (self.datadir, path)).decode (encoding) def __repr__ (self): - return '<Script {}>'.format (self.path) + return f'<Script {self.path}>' def __str__ (self): return self.data + @property + def abspath (self): + return pkg_resources.resource_filename (__name__, + os.path.join (self.datadir, self.path)) + @classmethod def fromStr (cls, data, path=None): s = Script () @@ -89,33 +95,23 @@ class Behavior: return True def __repr__ (self): - return '<Behavior {}>'.format (self.name) + return f'<Behavior {self.name}>' async def onload (self): """ After loading the page started """ # this is a dirty hack to make this function an async generator return - yield + yield # pragma: no cover async def onstop (self): """ Before page loading is stopped """ return - yield + yield # pragma: no cover async def onfinish (self): """ After the site has stopped loading """ return - yield - -class HostnameFilter: - """ Limit behavior script to hostname """ - - hostname = None - - def __contains__ (self, url): - url = urlsplit (url) - hostname = url.hostname.split ('.')[::-1] - return hostname[:2] == self.hostname + yield # pragma: no cover class JsOnload (Behavior): """ Execute JavaScript on page load """ @@ -141,6 +137,8 @@ class JsOnload (Behavior): # parameter. # XXX: is there a better way to do this? result = await tab.Runtime.evaluate (expression=str (self.script)) + self.logger.debug ('behavior onload inject', + uuid='a2da9b78-5648-44c5-bfa8-5c7573e13ad3', result=result) exception = result.get ('exceptionDetails', None) result = result['result'] assert result['type'] == 'function', result @@ -148,23 +146,45 @@ class JsOnload (Behavior): constructor = result['objectId'] if self.options: - yield Script.fromStr (json.dumps (self.options, indent=2), '{}/options'.format (self.script.path)) - result = await tab.Runtime.callFunctionOn ( - functionDeclaration='function(options){return new this(options);}', - objectId=constructor, - arguments=[{'value': self.options}]) - result = result['result'] - assert result['type'] == 'object', result - assert result.get ('subtype') != 'error', result - self.context = result['objectId'] + yield Script.fromStr (json.dumps (self.options, indent=2), f'{self.script.path}#options') + + try: + result = await tab.Runtime.callFunctionOn ( + functionDeclaration='function(options){return new this(options);}', + objectId=constructor, + arguments=[{'value': self.options}]) + self.logger.debug ('behavior onload start', + uuid='6c0605ae-93b3-46b3-b575-ba45790909a7', result=result) + result = result['result'] + assert result['type'] == 'object', result + assert result.get ('subtype') != 'error', result + self.context = result['objectId'] + except TabException as e: + if e.args[0] == -32000: + # the site probably reloaded. ignore this, since we’ll be + # re-injected into the new site by the controller. + self.logger.error ('jsonload onload failed', + uuid='c151a863-78d1-41f4-a8e6-c022a6c5d252', + exception=e.args) + else: + raise async def onstop (self): tab = self.loader.tab - assert self.context is not None - await tab.Runtime.callFunctionOn (functionDeclaration='function(){return this.stop();}', objectId=self.context) - await tab.Runtime.releaseObject (objectId=self.context) + try: + assert self.context is not None + await tab.Runtime.callFunctionOn (functionDeclaration='function(){return this.stop();}', + objectId=self.context) + await tab.Runtime.releaseObject (objectId=self.context) + except TabException as e: + # cannot do anything about that. Ignoring should be fine. + self.logger.error ('jsonload onstop failed', + uuid='1786726f-c8ec-4f79-8769-30954d4e32f5', + exception=e.args, + objectId=self.context) + return - yield + yield # pragma: no cover ### Generic scripts ### @@ -195,18 +215,25 @@ class EmulateScreenMetrics (Behavior): l = self.loader tab = l.tab for s in sizes: + self.logger.debug ('device override', + uuid='3d2d8096-1a75-4830-ad79-ae5f6f97071d', **s) await tab.Emulation.setDeviceMetricsOverride (**s) # give the browser time to re-eval page and start requests # XXX: should wait until loader is not busy any more await asyncio.sleep (1) + self.logger.debug ('clear override', + uuid='f9401683-eb3a-4b86-9bb2-c8c5d876fc8d') await tab.Emulation.clearDeviceMetricsOverride () return - yield + yield # pragma: no cover class DomSnapshotEvent: __slots__ = ('url', 'document', 'viewport') def __init__ (self, url, document, viewport): + # XXX: document encoding? + assert isinstance (document, bytes) + self.url = url self.document = document self.viewport = viewport @@ -235,18 +262,21 @@ class DomSnapshot (Behavior): viewport = await getFormattedViewportMetrics (tab) dom = await tab.DOM.getDocument (depth=-1, pierce=True) + self.logger.debug ('dom snapshot document', + uuid='0c720784-8bd1-4fdc-a811-84394d753539', dom=dom) haveUrls = set () for doc in ChromeTreeWalker (dom['root']).split (): - rawUrl = doc['documentURL'] - if rawUrl in haveUrls: + url = URL (doc['documentURL']) + if url in haveUrls: # ignore duplicate URLs. they are usually caused by # javascript-injected iframes (advertising) with no(?) src - self.logger.warning ('have DOM snapshot for URL {}, ignoring'.format (rawUrl)) - continue - url = urlsplit (rawUrl) - if url.scheme in ('http', 'https'): - self.logger.debug ('saving DOM snapshot for url {}, base {}'.format (doc['documentURL'], doc['baseURL'])) - haveUrls.add (rawUrl) + self.logger.warning ('dom snapshot duplicate', + uuid='d44de989-98d4-456e-82e7-9d4c49acab5e') + elif url.scheme in ('http', 'https'): + self.logger.debug ('dom snapshot', + uuid='ece7ff05-ccd9-44b5-b6a8-be25a24b96f4', + base=doc["baseURL"]) + haveUrls.add (url) walker = ChromeTreeWalker (doc) # remove script, to make the page static and noscript, because at the # time we took the snapshot scripts were enabled @@ -254,7 +284,7 @@ class DomSnapshot (Behavior): disallowedAttributes = html.eventAttributes stream = StripAttributeFilter (StripTagFilter (walker, disallowedTags), disallowedAttributes) serializer = HTMLSerializer () - yield DomSnapshotEvent (removeFragment (doc['documentURL']), serializer.render (stream, 'utf-8'), viewport) + yield DomSnapshotEvent (url.with_fragment(None), serializer.render (stream, 'utf-8'), viewport) class ScreenshotEvent: __slots__ = ('yoff', 'data', 'url') @@ -267,35 +297,77 @@ class ScreenshotEvent: class Screenshot (Behavior): """ Create screenshot from tab and write it to WARC + + Chrome will allocate an additional 512MB of RAM when using this plugin. """ + __slots__ = ('script') + name = 'screenshot' + # Hardcoded max texture size of 16,384 (crbug.com/770769) + maxDim = 16*1024 + + def __init__ (self, loader, logger): + super ().__init__ (loader, logger) + self.script = Script ('screenshot.js') + async def onfinish (self): tab = self.loader.tab + # for top-level/full-screen elements with position: fixed we need to + # figure out their actual size (i.e. scrollHeight) and use that when + # overriding the viewport size. + # we could do this without javascript, but that would require several + # round-trips to Chrome or pulling down the entire DOM+computed styles + tab = self.loader.tab + yield self.script + result = await tab.Runtime.evaluate (expression=str (self.script), returnByValue=True) + assert result['result']['type'] == 'object', result + result = result['result']['value'] + + # this is required to make the browser render more than just the small + # actual viewport (i.e. entire page). see + # https://github.com/GoogleChrome/puppeteer/blob/45873ea737b4ebe4fa7d6f46256b2ea19ce18aa7/lib/Page.js#L805 + metrics = await tab.Page.getLayoutMetrics () + contentSize = metrics['contentSize'] + contentHeight = max (result + [contentSize['height']]) + + override = { + 'width': 0, + 'height': 0, + 'deviceScaleFactor': 0, + 'mobile': False, + 'viewport': {'x': 0, + 'y': 0, + 'width': contentSize['width'], + 'height': contentHeight, + 'scale': 1} + } + self.logger.debug ('screenshot override', + uuid='e0affa18-cbb1-4d97-9d13-9a88f704b1b2', override=override) + await tab.Emulation.setDeviceMetricsOverride (**override) + tree = await tab.Page.getFrameTree () try: - url = removeFragment (tree['frameTree']['frame']['url']) + url = URL (tree['frameTree']['frame']['url']).with_fragment (None) except KeyError: - self.logger.error ('frame without url', tree=tree) + self.logger.error ('frame without url', + uuid='edc2743d-b93e-4ba1-964e-db232f2f96ff', tree=tree) url = None - # see https://github.com/GoogleChrome/puppeteer/blob/230be28b067b521f0577206899db01f0ca7fc0d2/examples/screenshots-longpage.js - # Hardcoded max texture size of 16,384 (crbug.com/770769) - maxDim = 16*1024 - metrics = await tab.Page.getLayoutMetrics () - contentSize = metrics['contentSize'] - width = min (contentSize['width'], maxDim) + width = min (contentSize['width'], self.maxDim) # we’re ignoring horizontal scroll intentionally. Most horizontal # layouts use JavaScript scrolling and don’t extend the viewport. - for yoff in range (0, contentSize['height'], maxDim): - height = min (contentSize['height'] - yoff, maxDim) + for yoff in range (0, contentHeight, self.maxDim): + height = min (contentHeight - yoff, self.maxDim) clip = {'x': 0, 'y': yoff, 'width': width, 'height': height, 'scale': 1} ret = await tab.Page.captureScreenshot (format='png', clip=clip) data = b64decode (ret['data']) yield ScreenshotEvent (url, yoff, data) + await tab.Emulation.clearDeviceMetricsOverride () + class Click (JsOnload): """ Generic link clicking """ @@ -305,7 +377,7 @@ class Click (JsOnload): def __init__ (self, loader, logger): super ().__init__ (loader, logger) with pkg_resources.resource_stream (__name__, os.path.join ('data', 'click.yaml')) as fd: - self.options['sites'] = list (yaml.load_all (fd)) + self.options['sites'] = list (yaml.safe_load_all (fd)) class ExtractLinksEvent: __slots__ = ('links', ) @@ -313,6 +385,16 @@ class ExtractLinksEvent: def __init__ (self, links): self.links = links + def __repr__ (self): + return f'<ExtractLinksEvent {self.links!r}>' + +def mapOrIgnore (f, l): + for e in l: + try: + yield f (e) + except: + pass + class ExtractLinks (Behavior): """ Extract links from a page using JavaScript @@ -333,7 +415,7 @@ class ExtractLinks (Behavior): tab = self.loader.tab yield self.script result = await tab.Runtime.evaluate (expression=str (self.script), returnByValue=True) - yield ExtractLinksEvent (list (set (result['result']['value']))) + yield ExtractLinksEvent (list (set (mapOrIgnore (URL, result['result']['value'])))) class Crash (Behavior): """ Crash the browser. For testing only. Obviously. """ @@ -346,7 +428,7 @@ class Crash (Behavior): except Crashed: pass return - yield + yield # pragma: no cover # available behavior scripts. Order matters, move those modifying the page # towards the end of available diff --git a/crocoite/browser.py b/crocoite/browser.py index c472746..3518789 100644 --- a/crocoite/browser.py +++ b/crocoite/browser.py @@ -23,84 +23,197 @@ Chrome browser interactions. """ import asyncio -from urllib.parse import urlsplit -from base64 import b64decode +from base64 import b64decode, b64encode +from datetime import datetime, timedelta from http.server import BaseHTTPRequestHandler +from yarl import URL +from multidict import CIMultiDict + from .logger import Level from .devtools import Browser, TabException -class Item: - """ - Simple wrapper containing Chrome request and response - """ +# These two classes’ only purpose is so we can later tell whether a body was +# base64-encoded or a unicode string +class Base64Body (bytes): + def __new__ (cls, value): + return bytes.__new__ (cls, b64decode (value)) + + @classmethod + def fromBytes (cls, b): + """ For testing """ + return cls (b64encode (b)) + +class UnicodeBody (bytes): + def __new__ (cls, value): + if type (value) is not str: + raise TypeError ('expecting unicode string') - __slots__ = ('chromeRequest', 'chromeResponse', 'chromeFinished', - 'isRedirect', 'failed', 'body', 'requestBody') + return bytes.__new__ (cls, value.encode ('utf-8')) - def __init__ (self): - self.chromeRequest = {} - self.chromeResponse = {} - self.chromeFinished = {} - self.isRedirect = False - self.failed = False - self.body = None - self.requestBody = None +class Request: + __slots__ = ('headers', 'body', 'initiator', 'hasPostData', 'method', 'timestamp') + + def __init__ (self, method=None, headers=None, body=None): + self.headers = headers + self.body = body + self.hasPostData = False + self.initiator = None + # HTTP method + self.method = method + self.timestamp = None + + def __repr__ (self): + return f'Request({self.method!r}, {self.headers!r}, {self.body!r})' + + def __eq__ (self, b): + if b is None: + return False + + if not isinstance (b, Request): + raise TypeError ('Can only compare equality with Request.') + + # do not compare hasPostData (only required to fetch body) and + # timestamp (depends on time) + return self.headers == b.headers and \ + self.body == b.body and \ + self.initiator == b.initiator and \ + self.method == b.method + +class Response: + __slots__ = ('status', 'statusText', 'headers', 'body', 'bytesReceived', + 'timestamp', 'mimeType') + + def __init__ (self, status=None, statusText=None, headers=None, body=None, mimeType=None): + self.status = status + self.statusText = statusText + self.headers = headers + self.body = body + # bytes received over the network (not body size!) + self.bytesReceived = 0 + self.timestamp = None + self.mimeType = mimeType + + def __repr__ (self): + return f'Response({self.status!r}, {self.statusText!r}, {self.headers!r}, {self.body!r}, {self.mimeType!r})' + + def __eq__ (self, b): + if b is None: + return False + + if not isinstance (b, Response): + raise TypeError ('Can only compare equality with Response.') + + # do not compare bytesReceived (depends on network), timestamp + # (depends on time) and statusText (does not matter) + return self.status == b.status and \ + self.statusText == b.statusText and \ + self.headers == b.headers and \ + self.body == b.body and \ + self.mimeType == b.mimeType + +class ReferenceTimestamp: + """ Map relative timestamp to absolute timestamp """ + + def __init__ (self, relative, absolute): + self.relative = timedelta (seconds=relative) + self.absolute = datetime.utcfromtimestamp (absolute) + + def __call__ (self, relative): + if not isinstance (relative, timedelta): + relative = timedelta (seconds=relative) + return self.absolute + (relative-self.relative) + +class RequestResponsePair: + __slots__ = ('request', 'response', 'id', 'url', 'remoteIpAddress', + 'protocol', 'resourceType', '_time') + + def __init__ (self, id=None, url=None, request=None, response=None): + self.request = request + self.response = response + self.id = id + self.url = url + self.remoteIpAddress = None + self.protocol = None + self.resourceType = None + self._time = None def __repr__ (self): - return '<Item {}>'.format (self.url) - - @property - def request (self): - return self.chromeRequest.get ('request', {}) - - @property - def response (self): - return self.chromeResponse.get ('response', {}) - - @property - def initiator (self): - return self.chromeRequest['initiator'] - - @property - def id (self): - return self.chromeRequest['requestId'] - - @property - def encodedDataLength (self): - return self.chromeFinished['encodedDataLength'] - - @property - def url (self): - return self.response.get ('url', self.request.get ('url')) - - @property - def parsedUrl (self): - return urlsplit (self.url) - - @property - def requestHeaders (self): - # the response object may contain refined headers, which were - # *actually* sent over the wire - return self._unfoldHeaders (self.response.get ('requestHeaders', self.request['headers'])) - - @property - def responseHeaders (self): - return self._unfoldHeaders (self.response['headers']) - - @property - def statusText (self): - text = self.response.get ('statusText') - if text: - return text - text = BaseHTTPRequestHandler.responses.get (self.response['status']) - if text: - return text[0] - return 'No status text available' - - @property - def resourceType (self): - return self.chromeResponse.get ('type', self.chromeRequest.get ('type', None)) + return f'RequestResponsePair({self.id!r}, {self.url!r}, {self.request!r}, {self.response!r})' + + def __eq__ (self, b): + if not isinstance (b, RequestResponsePair): + raise TypeError (f'Can only compare with {self.__class__.__name__}') + + # do not compare id and _time. These depend on external factors and do + # not influence the request/response *content* + return self.request == b.request and \ + self.response == b.response and \ + self.url == b.url and \ + self.remoteIpAddress == b.remoteIpAddress and \ + self.protocol == b.protocol and \ + self.resourceType == b.resourceType + + def fromRequestWillBeSent (self, req): + """ Set request data from Chrome Network.requestWillBeSent event """ + r = req['request'] + + self.id = req['requestId'] + self.url = URL (r['url']) + self.resourceType = req.get ('type') + self._time = ReferenceTimestamp (req['timestamp'], req['wallTime']) + + assert self.request is None, req + self.request = Request () + self.request.initiator = req['initiator'] + self.request.headers = CIMultiDict (self._unfoldHeaders (r['headers'])) + self.request.hasPostData = r.get ('hasPostData', False) + self.request.method = r['method'] + self.request.timestamp = self._time (req['timestamp']) + if self.request.hasPostData: + postData = r.get ('postData') + if postData is not None: + self.request.body = UnicodeBody (postData) + + def fromResponse (self, r, timestamp=None, resourceType=None): + """ + Set response data from Chrome’s Response object. + + Request must exist. Updates if response was set before. Sometimes + fromResponseReceived is triggered twice by Chrome. No idea why. + """ + assert self.request is not None, (self.request, r) + + if not timestamp: + timestamp = self.request.timestamp + + self.remoteIpAddress = r.get ('remoteIPAddress') + self.protocol = r.get ('protocol') + if resourceType: + self.resourceType = resourceType + + # a response may contain updated request headers (i.e. those actually + # sent over the wire) + if 'requestHeaders' in r: + self.request.headers = CIMultiDict (self._unfoldHeaders (r['requestHeaders'])) + + self.response = Response () + self.response.headers = CIMultiDict (self._unfoldHeaders (r['headers'])) + self.response.status = r['status'] + self.response.statusText = r['statusText'] + self.response.timestamp = timestamp + self.response.mimeType = r['mimeType'] + + def fromResponseReceived (self, resp): + """ Set response data from Chrome Network.responseReceived """ + return self.fromResponse (resp['response'], + self._time (resp['timestamp']), resp['type']) + + def fromLoadingFinished (self, data): + self.response.bytesReceived = data['encodedDataLength'] + + def fromLoadingFailed (self, data): + self.response = None @staticmethod def _unfoldHeaders (headers): @@ -114,67 +227,46 @@ class Item: items.append ((k, v)) return items - def setRequest (self, req): - self.chromeRequest = req - - def setResponse (self, resp): - self.chromeResponse = resp - - def setFinished (self, finished): - self.chromeFinished = finished - async def prefetchRequestBody (self, tab): - # request body - req = self.request - postData = req.get ('postData') - if postData: - self.requestBody = postData.encode ('utf8'), False - elif req.get ('hasPostData', False): + if self.request.hasPostData and self.request.body is None: try: postData = await tab.Network.getRequestPostData (requestId=self.id) - postData = postData['postData'] - self.requestBody = b64decode (postData), True + self.request.body = UnicodeBody (postData['postData']) except TabException: - self.requestBody = None - else: - self.requestBody = None, False + self.request.body = None async def prefetchResponseBody (self, tab): - # get response body + """ Fetch response body """ try: body = await tab.Network.getResponseBody (requestId=self.id) - rawBody = body['body'] - base64Encoded = body['base64Encoded'] - if base64Encoded: - rawBody = b64decode (rawBody) + if body['base64Encoded']: + self.response.body = Base64Body (body['body']) else: - rawBody = rawBody.encode ('utf8') - self.body = rawBody, base64Encoded + self.response.body = UnicodeBody (body['body']) except TabException: - self.body = None + self.response.body = None + +class NavigateError (IOError): + pass -class VarChangeEvent: - """ Notify when variable is changed """ +class PageIdle: + """ Page idle event """ - __slots__ = ('_value', 'event') + __slots__ = ('idle', ) - def __init__ (self, value): - self._value = value - self.event = asyncio.Event() + def __init__ (self, idle): + self.idle = idle - def set (self, value): - if value != self._value: - self._value = value - # unblock waiting threads - self.event.set () - self.event.clear () + def __bool__ (self): + return self.idle - def get (self): - return self._value +class FrameNavigated: + __slots__ = ('id', 'url', 'mimeType') - async def wait (self): - await self.event.wait () - return self._value + def __init__ (self, id, url, mimeType): + self.id = id + self.url = URL (url) + self.mimeType = mimeType class SiteLoader: """ @@ -183,18 +275,18 @@ class SiteLoader: XXX: track popup windows/new tabs and close them """ - __slots__ = ('requests', 'browser', 'url', 'logger', 'tab', '_iterRunning', 'idle', '_framesLoading') + __slots__ = ('requests', 'browser', 'logger', 'tab', '_iterRunning', + '_framesLoading', '_rootFrame') allowedSchemes = {'http', 'https'} - def __init__ (self, browser, url, logger): + def __init__ (self, browser, logger): self.requests = {} self.browser = Browser (url=browser) - self.url = url - self.logger = logger.bind (context=type (self).__name__, url=url) + self.logger = logger.bind (context=type (self).__name__) self._iterRunning = [] - self.idle = VarChangeEvent (True) self._framesLoading = set () + self._rootFrame = None async def __aenter__ (self): tab = self.tab = await self.browser.__aenter__ () @@ -236,6 +328,7 @@ class SiteLoader: tab.Page.javascriptDialogOpening: self._javascriptDialogOpening, tab.Page.frameStartedLoading: self._frameStartedLoading, tab.Page.frameStoppedLoading: self._frameStoppedLoading, + tab.Page.frameNavigated: self._frameNavigated, } # The implementation is a little advanced. Why? The goal here is to @@ -247,36 +340,46 @@ class SiteLoader: # we need to block (yield) for every item completed, but not # handled by the consumer (caller). running = self._iterRunning - running.append (asyncio.ensure_future (self.tab.get ())) + tabGetTask = asyncio.ensure_future (self.tab.get ()) + running.append (tabGetTask) while True: done, pending = await asyncio.wait (running, return_when=asyncio.FIRST_COMPLETED) for t in done: result = t.result () if result is None: pass - elif isinstance (result, Item): - yield result - else: + elif t == tabGetTask: method, data = result f = handler.get (method, None) if f is not None: task = asyncio.ensure_future (f (**data)) pending.add (task) - pending.add (asyncio.ensure_future (self.tab.get ())) + tabGetTask = asyncio.ensure_future (self.tab.get ()) + pending.add (tabGetTask) + else: + yield result running = pending self._iterRunning = running - async def start (self): - await self.tab.Page.navigate(url=self.url) + async def navigate (self, url): + ret = await self.tab.Page.navigate(url=url) + self.logger.debug ('navigate', + uuid='9d47ded2-951f-4e09-86ee-fd4151e20666', result=ret) + if 'errorText' in ret: + raise NavigateError (ret['errorText']) + self._rootFrame = ret['frameId'] # internal chrome callbacks async def _requestWillBeSent (self, **kwargs): + self.logger.debug ('requestWillBeSent', + uuid='b828d75a-650d-42d2-8c66-14f4547512da', args=kwargs) + reqId = kwargs['requestId'] req = kwargs['request'] - logger = self.logger.bind (reqId=reqId, reqUrl=req['url']) + url = URL (req['url']) + logger = self.logger.bind (reqId=reqId, reqUrl=url) - url = urlsplit (req['url']) if url.scheme not in self.allowedSchemes: return @@ -286,38 +389,44 @@ class SiteLoader: # redirects never “finish” loading, but yield another requestWillBeSent with this key set redirectResp = kwargs.get ('redirectResponse') if redirectResp: - # create fake responses - resp = {'requestId': reqId, 'response': redirectResp, 'timestamp': kwargs['timestamp']} - item.setResponse (resp) - resp = {'requestId': reqId, 'encodedDataLength': 0, 'timestamp': kwargs['timestamp']} - item.setFinished (resp) - item.isRedirect = True - logger.info ('redirect', uuid='85eaec41-e2a9-49c2-9445-6f19690278b8', target=req['url']) + if item.url != url: + # this happens for unknown reasons. the docs simply state + # it can differ in case of a redirect. Fix it and move on. + logger.warning ('redirect url differs', + uuid='558a7df7-2258-4fe4-b16d-22b6019cc163', + expected=item.url) + redirectResp['url'] = str (item.url) + item.fromResponse (redirectResp) + logger.info ('redirect', uuid='85eaec41-e2a9-49c2-9445-6f19690278b8', target=url) + # XXX: queue this? no need to wait for it await item.prefetchRequestBody (self.tab) - # cannot fetch request body due to race condition (item id reused) + # cannot fetch response body due to race condition (item id reused) ret = item else: logger.warning ('request exists', uuid='2c989142-ba00-4791-bb03-c2a14e91a56b') - item = Item () - item.setRequest (kwargs) + item = RequestResponsePair () + item.fromRequestWillBeSent (kwargs) self.requests[reqId] = item - logger.debug ('request', uuid='55c17564-1bd0-4499-8724-fa7aad65478f') return ret async def _responseReceived (self, **kwargs): + self.logger.debug ('responseReceived', + uuid='ecd67e69-401a-41cb-b4ec-eeb1f1ec6abb', args=kwargs) + reqId = kwargs['requestId'] item = self.requests.get (reqId) if item is None: return resp = kwargs['response'] - logger = self.logger.bind (reqId=reqId, respUrl=resp['url']) - url = urlsplit (resp['url']) + url = URL (resp['url']) + logger = self.logger.bind (reqId=reqId, respUrl=url) + if item.url != url: + logger.error ('url mismatch', uuid='7385f45f-0b06-4cbc-81f9-67bcd72ee7d0', respUrl=url) if url.scheme in self.allowedSchemes: - logger.debug ('response', uuid='84461c4e-e8ef-4cbd-8e8e-e10a901c8bd0') - item.setResponse (kwargs) + item.fromResponseReceived (kwargs) else: logger.warning ('scheme forbidden', uuid='2ea6e5d7-dd3b-4881-b9de-156c1751c666') @@ -326,32 +435,37 @@ class SiteLoader: Item was fully loaded. For some items the request body is not available when responseReceived is fired, thus move everything here. """ + self.logger.debug ('loadingFinished', + uuid='35479405-a5b5-4395-8c33-d3601d1796b9', args=kwargs) + reqId = kwargs['requestId'] item = self.requests.pop (reqId, None) if item is None: # we never recorded this request (blacklisted scheme, for example) return + if not item.response: + # chrome failed to send us a responseReceived event for this item, + # so we can’t record it (missing request/response headers) + self.logger.error ('response missing', + uuid='fac3ab96-3f9b-4c5a-95c7-f83b675cdcb9', requestId=item.id) + return + req = item.request - logger = self.logger.bind (reqId=reqId, reqUrl=req['url']) - resp = item.response - if req['url'] != resp['url']: - logger.error ('url mismatch', uuid='7385f45f-0b06-4cbc-81f9-67bcd72ee7d0', respUrl=resp['url']) - url = urlsplit (resp['url']) - if url.scheme in self.allowedSchemes: - logger.info ('finished', uuid='5a8b4bad-f86a-4fe6-a53e-8da4130d6a02') - item.setFinished (kwargs) + if item.url.scheme in self.allowedSchemes: + item.fromLoadingFinished (kwargs) + # XXX queue both await asyncio.gather (item.prefetchRequestBody (self.tab), item.prefetchResponseBody (self.tab)) return item async def _loadingFailed (self, **kwargs): + self.logger.info ('loadingFailed', + uuid='4a944e85-5fae-4aa6-9e7c-e578b29392e4', args=kwargs) + reqId = kwargs['requestId'] - self.logger.warning ('loading failed', - uuid='68410f13-6eea-453e-924e-c1af4601748b', - errorText=kwargs['errorText'], - blockedReason=kwargs.get ('blockedReason')) + logger = self.logger.bind (reqId=reqId) item = self.requests.pop (reqId, None) if item is not None: - item.failed = True + item.fromLoadingFailed (kwargs) return item async def _entryAdded (self, **kwargs): @@ -381,11 +495,25 @@ class SiteLoader: uuid='3ef7292e-8595-4e89-b834-0cc6bc40ee38', **kwargs) async def _frameStartedLoading (self, **kwargs): + self.logger.debug ('frameStartedLoading', + uuid='bbeb39c0-3304-4221-918e-f26bd443c566', args=kwargs) + self._framesLoading.add (kwargs['frameId']) - self.idle.set (False) + return PageIdle (False) async def _frameStoppedLoading (self, **kwargs): + self.logger.debug ('frameStoppedLoading', + uuid='fcbe8110-511c-4cbb-ac2b-f61a5782c5a0', args=kwargs) + self._framesLoading.remove (kwargs['frameId']) if not self._framesLoading: - self.idle.set (True) + return PageIdle (True) + + async def _frameNavigated (self, **kwargs): + self.logger.debug ('frameNavigated', + uuid='0e876f7d-7129-4612-8632-686f42ac6e1f', args=kwargs) + frame = kwargs['frame'] + if self._rootFrame == frame['id']: + assert frame.get ('parentId', None) is None, "root frame must not have a parent" + return FrameNavigated (frame['id'], frame['url'], frame['mimeType']) diff --git a/crocoite/cli.py b/crocoite/cli.py index c3c41a4..04bbb19 100644 --- a/crocoite/cli.py +++ b/crocoite/cli.py @@ -22,27 +22,68 @@ Command line interface """ -import argparse, sys, signal, asyncio, os +import argparse, sys, signal, asyncio, os, json +from traceback import TracebackException from enum import IntEnum +from yarl import URL +from http.cookies import SimpleCookie +import pkg_resources +try: + import manhole + manhole.install (patch_fork=False, oneshot_on='USR1') +except ModuleNotFoundError: + pass -from . import behavior +from . import behavior, browser from .controller import SinglePageController, \ ControllerSettings, StatsHandler, LogHandler, \ RecursiveController, DepthLimit, PrefixLimit from .devtools import Passthrough, Process from .warc import WarcHandler -from .logger import Logger, JsonPrintConsumer, DatetimeConsumer, WarcHandlerConsumer +from .logger import Logger, JsonPrintConsumer, DatetimeConsumer, \ + WarcHandlerConsumer, Level from .devtools import Crashed +def absurl (s): + """ argparse: Absolute URL """ + u = URL (s) + if u.is_absolute (): + return u + raise argparse.ArgumentTypeError ('Must be absolute') + +def cookie (s): + """ argparse: Cookie """ + c = SimpleCookie (s) + # for some reason the constructor does not raise an exception if the cookie + # supplied is invalid. It’ll simply be empty. + if len (c) != 1: + raise argparse.ArgumentTypeError ('Invalid cookie') + # we want a single Morsel + return next (iter (c.values ())) + +def cookiejar (f): + """ argparse: Cookies from file """ + cookies = [] + try: + with open (f, 'r') as fd: + for l in fd: + l = l.lstrip () + if l and not l.startswith ('#'): + cookies.append (cookie (l)) + except FileNotFoundError: + raise argparse.ArgumentTypeError (f'Cookie jar "{f}" does not exist') + return cookies + class SingleExitStatus(IntEnum): """ Exit status for single-shot command line """ Ok = 0 Fail = 1 BrowserCrash = 2 + Navigate = 3 def single (): - parser = argparse.ArgumentParser(description='Save website to WARC using Google Chrome.') - parser.add_argument('--browser', help='DevTools URL', metavar='URL') + parser = argparse.ArgumentParser(description='crocoite helper tools to fetch individual pages.') + parser.add_argument('--browser', help='DevTools URL', type=absurl, metavar='URL') parser.add_argument('--timeout', default=1*60*60, type=int, help='Maximum time for archival', metavar='SEC') parser.add_argument('--idle-timeout', default=30, type=int, help='Maximum idle seconds (i.e. no requests)', dest='idleTimeout', metavar='SEC') parser.add_argument('--behavior', help='Enable behavior script', @@ -50,7 +91,19 @@ def single (): default=list (behavior.availableMap.keys ()), choices=list (behavior.availableMap.keys ()), metavar='NAME', nargs='*') - parser.add_argument('url', help='Website URL', metavar='URL') + parser.add_argument('--warcinfo', help='Add extra information to warcinfo record', + metavar='JSON', type=json.loads) + # re-using curl’s short/long switch names whenever possible + parser.add_argument('-k', '--insecure', + action='store_true', + help='Disable certificate validation') + parser.add_argument ('-b', '--cookie', type=cookie, metavar='SET-COOKIE', + action='append', default=[], help='Cookies in Set-Cookie format.') + parser.add_argument ('-c', '--cookie-jar', dest='cookieJar', + type=cookiejar, metavar='FILE', + default=pkg_resources.resource_filename (__name__, 'data/cookies.txt'), + help='Cookie jar file, read-only.') + parser.add_argument('url', help='Website URL', type=absurl, metavar='URL') parser.add_argument('output', help='WARC filename', metavar='FILE') args = parser.parse_args () @@ -61,13 +114,19 @@ def single (): service = Process () if args.browser: service = Passthrough (args.browser) - settings = ControllerSettings (idleTimeout=args.idleTimeout, timeout=args.timeout) + settings = ControllerSettings ( + idleTimeout=args.idleTimeout, + timeout=args.timeout, + insecure=args.insecure, + cookies=args.cookieJar + args.cookie, + ) with open (args.output, 'wb') as fd, WarcHandler (fd, logger) as warcHandler: logger.connect (WarcHandlerConsumer (warcHandler)) handler = [StatsHandler (), LogHandler (logger), warcHandler] b = list (map (lambda x: behavior.availableMap[x], args.enabledBehaviorNames)) controller = SinglePageController (url=args.url, settings=settings, - service=service, handler=handler, behavior=b, logger=logger) + service=service, handler=handler, behavior=b, logger=logger, + warcinfo=args.warcinfo) try: loop = asyncio.get_event_loop() run = asyncio.ensure_future (controller.run ()) @@ -79,9 +138,20 @@ def single (): ret = SingleExitStatus.Ok except Crashed: ret = SingleExitStatus.BrowserCrash + except asyncio.CancelledError: + # don’t log this one + pass + except browser.NavigateError: + ret = SingleExitStatus.Navigate + except Exception as e: + ret = SingleExitStatus.Fail + logger.error ('cli exception', + uuid='7fd69858-ecaa-4225-b213-8ab880aa3cc5', + traceback=list (TracebackException.from_exception (e).format ())) finally: r = handler[0].stats logger.info ('stats', context='cli', uuid='24d92d16-770e-4088-b769-4020e127a7ff', **r) + logger.info ('exit', context='cli', uuid='9b1bd603-f7cd-4745-895a-5b894a5166f2', status=ret) return ret @@ -92,68 +162,84 @@ def parsePolicy (recursive, url): return DepthLimit (int (recursive)) elif recursive == 'prefix': return PrefixLimit (url) - raise ValueError ('Unsupported') + raise argparse.ArgumentTypeError ('Unsupported recursion mode') def recursive (): logger = Logger (consumer=[DatetimeConsumer (), JsonPrintConsumer ()]) - parser = argparse.ArgumentParser(description='Recursively run crocoite-grab.') - parser.add_argument('--policy', help='Recursion policy', metavar='POLICY') - parser.add_argument('--tempdir', help='Directory for temporary files', metavar='DIR') - parser.add_argument('--prefix', help='Output filename prefix, supports templates {host} and {date}', metavar='FILENAME', default='{host}-{date}-') - parser.add_argument('--concurrency', '-j', help='Run at most N jobs', metavar='N', default=1, type=int) - parser.add_argument('url', help='Seed URL', metavar='URL') - parser.add_argument('output', help='Output directory', metavar='DIR') - parser.add_argument('command', help='Fetch command, supports templates {url} and {dest}', metavar='CMD', nargs='*', default=['crocoite-grab', '{url}', '{dest}']) + parser = argparse.ArgumentParser(description='Save website to WARC using Google Chrome.') + parser.add_argument('-j', '--concurrency', + help='Run at most N jobs concurrently', metavar='N', default=1, + type=int) + parser.add_argument('-r', '--recursion', help='Recursion policy', + metavar='POLICY') + parser.add_argument('--tempdir', help='Directory for temporary files', + metavar='DIR') + parser.add_argument('url', help='Seed URL', type=absurl, metavar='URL') + parser.add_argument('output', + help='Output file, supports templates {host}, {date} and {seqnum}', + metavar='FILE') + parser.add_argument('command', + help='Fetch command, supports templates {url} and {dest}', + metavar='CMD', nargs='*', + default=['crocoite-single', '{url}', '{dest}']) args = parser.parse_args () try: - policy = parsePolicy (args.policy, args.url) - except ValueError: - parser.error ('Invalid argument for --policy') - - os.makedirs (args.output, exist_ok=True) + policy = parsePolicy (args.recursion, args.url) + except argparse.ArgumentTypeError as e: + parser.error (str (e)) - controller = RecursiveController (url=args.url, output=args.output, - command=args.command, logger=logger, policy=policy, - tempdir=args.tempdir, prefix=args.prefix, - concurrency=args.concurrency) + try: + controller = RecursiveController (url=args.url, output=args.output, + command=args.command, logger=logger, policy=policy, + tempdir=args.tempdir, concurrency=args.concurrency) + except ValueError as e: + parser.error (str (e)) + run = asyncio.ensure_future (controller.run ()) loop = asyncio.get_event_loop() - stop = lambda signum: controller.cancel () + stop = lambda signum: run.cancel () loop.add_signal_handler (signal.SIGINT, stop, signal.SIGINT) loop.add_signal_handler (signal.SIGTERM, stop, signal.SIGTERM) - loop.run_until_complete(controller.run ()) - loop.close() + try: + loop.run_until_complete(run) + except asyncio.CancelledError: + pass + finally: + loop.close() return 0 def irc (): - from configparser import ConfigParser + import json, re from .irc import Chromebot logger = Logger (consumer=[DatetimeConsumer (), JsonPrintConsumer ()]) parser = argparse.ArgumentParser(description='IRC bot.') - parser.add_argument('--config', '-c', help='Config file location', metavar='PATH', default='chromebot.ini') + parser.add_argument('--config', '-c', help='Config file location', metavar='PATH', default='chromebot.json') args = parser.parse_args () - config = ConfigParser () - config.read (args.config) + with open (args.config) as fd: + config = json.load (fd) s = config['irc'] + blacklist = dict (map (lambda x: (re.compile (x[0], re.I), x[1]), config['blacklist'].items ())) loop = asyncio.get_event_loop() bot = Chromebot ( - host=s.get ('host'), - port=s.getint ('port'), - ssl=s.getboolean ('ssl'), - nick=s.get ('nick'), - channels=[s.get ('channel')], - tempdir=s.get ('tempdir'), - destdir=s.get ('destdir'), - processLimit=s.getint ('process_limit'), + host=s['host'], + port=s['port'], + ssl=s['ssl'], + nick=s['nick'], + channels=s['channels'], + tempdir=config['tempdir'], + destdir=config['destdir'], + processLimit=config['process_limit'], logger=logger, + blacklist=blacklist, + needVoice=config['need_voice'], loop=loop) stop = lambda signum: bot.cancel () loop.add_signal_handler (signal.SIGINT, stop, signal.SIGINT) diff --git a/crocoite/controller.py b/crocoite/controller.py index f8b1420..8374b4e 100644 --- a/crocoite/controller.py +++ b/crocoite/controller.py @@ -22,58 +22,56 @@ Controller classes, handling actions required for archival """ -import time -import tempfile, asyncio, json, os +import time, tempfile, asyncio, json, os, shutil, signal from itertools import islice from datetime import datetime -from urllib.parse import urlparse from operator import attrgetter +from abc import ABC, abstractmethod +from yarl import URL from . import behavior as cbehavior -from .browser import SiteLoader, Item -from .util import getFormattedViewportMetrics, getSoftwareInfo, removeFragment +from .browser import SiteLoader, RequestResponsePair, PageIdle, FrameNavigated +from .util import getFormattedViewportMetrics, getSoftwareInfo from .behavior import ExtractLinksEvent +from .devtools import toCookieParam class ControllerSettings: - __slots__ = ('idleTimeout', 'timeout') + __slots__ = ('idleTimeout', 'timeout', 'insecure', 'cookies') - def __init__ (self, idleTimeout=2, timeout=10): + def __init__ (self, idleTimeout=2, timeout=10, insecure=False, cookies=None): self.idleTimeout = idleTimeout self.timeout = timeout + self.insecure = insecure + self.cookies = cookies or [] - def toDict (self): - return dict (idleTimeout=self.idleTimeout, timeout=self.timeout) + def __repr__ (self): + return f'<ControllerSetting idleTimeout={self.idleTimeout!r}, timeout={self.timeout!r}, insecure={self.insecure!r}, cookies={self.cookies!r}>' defaultSettings = ControllerSettings () -class EventHandler: +class EventHandler (ABC): """ Abstract base class for event handler """ __slots__ = () - # this handler wants to know about exceptions before they are reraised by - # the controller - acceptException = False - - def push (self, item): + @abstractmethod + async def push (self, item): raise NotImplementedError () class StatsHandler (EventHandler): __slots__ = ('stats', ) - acceptException = True - def __init__ (self): self.stats = {'requests': 0, 'finished': 0, 'failed': 0, 'bytesRcv': 0} - def push (self, item): - if isinstance (item, Item): + async def push (self, item): + if isinstance (item, RequestResponsePair): self.stats['requests'] += 1 - if item.failed: + if not item.response: self.stats['failed'] += 1 else: self.stats['finished'] += 1 - self.stats['bytesRcv'] += item.encodedDataLength + self.stats['bytesRcv'] += item.response.bytesReceived class LogHandler (EventHandler): """ Handle items by logging information about them """ @@ -83,7 +81,7 @@ class LogHandler (EventHandler): def __init__ (self, logger): self.logger = logger.bind (context=type (self).__name__) - def push (self, item): + async def push (self, item): if isinstance (item, ExtractLinksEvent): # limit number of links per message, so json blob won’t get too big it = iter (item.links) @@ -102,6 +100,71 @@ class ControllerStart: def __init__ (self, payload): self.payload = payload +class IdleStateTracker (EventHandler): + """ Track SiteLoader’s idle state by listening to PageIdle events """ + + __slots__ = ('_idle', '_loop', '_idleSince') + + def __init__ (self, loop): + self._idle = True + self._loop = loop + + self._idleSince = self._loop.time () + + async def push (self, item): + if isinstance (item, PageIdle): + self._idle = bool (item) + if self._idle: + self._idleSince = self._loop.time () + + async def wait (self, timeout): + """ Wait until page has been idle for at least timeout seconds. If the + page has been idle before calling this function it may return + immediately. """ + + assert timeout > 0 + while True: + if self._idle: + now = self._loop.time () + sleep = timeout-(now-self._idleSince) + if sleep <= 0: + break + else: + # not idle, check again after timeout expires + sleep = timeout + await asyncio.sleep (sleep) + +class InjectBehaviorOnload (EventHandler): + """ Control behavior script injection based on frame navigation messages. + When a page is reloaded (for whatever reason), the scripts need to be + reinjected. """ + + __slots__ = ('controller', '_loaded') + + def __init__ (self, controller): + self.controller = controller + self._loaded = False + + async def push (self, item): + if isinstance (item, FrameNavigated): + await self._runon ('load') + self._loaded = True + + async def stop (self): + if self._loaded: + await self._runon ('stop') + + async def finish (self): + if self._loaded: + await self._runon ('finish') + + async def _runon (self, method): + controller = self.controller + for b in controller._enabledBehavior: + f = getattr (b, 'on' + method) + async for item in f (): + await controller.processItem (item) + class SinglePageController: """ Archive a single page url. @@ -110,120 +173,141 @@ class SinglePageController: (stats, warc writer). """ - __slots__ = ('url', 'service', 'behavior', 'settings', 'logger', 'handler') + __slots__ = ('url', 'service', 'behavior', 'settings', 'logger', 'handler', + 'warcinfo', '_enabledBehavior') def __init__ (self, url, logger, \ service, behavior=cbehavior.available, \ - settings=defaultSettings, handler=[]): + settings=defaultSettings, handler=None, \ + warcinfo=None): self.url = url self.service = service self.behavior = behavior self.settings = settings self.logger = logger.bind (context=type (self).__name__, url=url) - self.handler = handler + self.handler = handler or [] + self.warcinfo = warcinfo - def processItem (self, item): + async def processItem (self, item): for h in self.handler: - h.push (item) + await h.push (item) async def run (self): logger = self.logger async def processQueue (): async for item in l: - self.processItem (item) + await self.processItem (item) + + idle = IdleStateTracker (asyncio.get_event_loop ()) + self.handler.append (idle) + behavior = InjectBehaviorOnload (self) + self.handler.append (behavior) - async with self.service as browser, SiteLoader (browser, self.url, logger=logger) as l: + async with self.service as browser, SiteLoader (browser, logger=logger) as l: handle = asyncio.ensure_future (processQueue ()) + timeoutProc = asyncio.ensure_future (asyncio.sleep (self.settings.timeout)) - start = time.time () + # configure browser + tab = l.tab + await tab.Security.setIgnoreCertificateErrors (ignore=self.settings.insecure) + await tab.Network.setCookies (cookies=list (map (toCookieParam, self.settings.cookies))) # not all behavior scripts are allowed for every URL, filter them - enabledBehavior = list (filter (lambda x: self.url in x, + self._enabledBehavior = list (filter (lambda x: self.url in x, map (lambda x: x (l, logger), self.behavior))) - version = await l.tab.Browser.getVersion () + version = await tab.Browser.getVersion () payload = { 'software': getSoftwareInfo (), 'browser': { 'product': version['product'], 'useragent': version['userAgent'], - 'viewport': await getFormattedViewportMetrics (l.tab), + 'viewport': await getFormattedViewportMetrics (tab), }, 'tool': 'crocoite-single', # not the name of the cli utility 'parameters': { 'url': self.url, 'idleTimeout': self.settings.idleTimeout, 'timeout': self.settings.timeout, - 'behavior': list (map (attrgetter('name'), enabledBehavior)), + 'behavior': list (map (attrgetter('name'), self._enabledBehavior)), + 'insecure': self.settings.insecure, + 'cookies': list (map (lambda x: x.OutputString(), self.settings.cookies)), }, } - self.processItem (ControllerStart (payload)) + if self.warcinfo: + payload['extra'] = self.warcinfo + await self.processItem (ControllerStart (payload)) - await l.start () - for b in enabledBehavior: - async for item in b.onload (): - self.processItem (item) + await l.navigate (self.url) - # wait until the browser has a) been idle for at least - # settings.idleTimeout or b) settings.timeout is exceeded - timeoutProc = asyncio.ensure_future (asyncio.sleep (self.settings.timeout)) - idleTimeout = None + idleProc = asyncio.ensure_future (idle.wait (self.settings.idleTimeout)) while True: - idleProc = asyncio.ensure_future (l.idle.wait ()) try: finished, pending = await asyncio.wait([idleProc, timeoutProc, handle], - return_when=asyncio.FIRST_COMPLETED, timeout=idleTimeout) + return_when=asyncio.FIRST_COMPLETED) except asyncio.CancelledError: idleProc.cancel () timeoutProc.cancel () break - if not finished: - # idle timeout - idleProc.cancel () - timeoutProc.cancel () - break - elif handle in finished: + if handle in finished: # something went wrong while processing the data + logger.error ('fetch failed', + uuid='43a0686a-a3a9-4214-9acd-43f6976f8ff3') idleProc.cancel () timeoutProc.cancel () handle.result () assert False # previous line should always raise Exception elif timeoutProc in finished: # global timeout + logger.debug ('global timeout', + uuid='2f858adc-9448-4ace-94b4-7cd1484c0728') idleProc.cancel () timeoutProc.result () break elif idleProc in finished: - # idle state change - isIdle = idleProc.result () - if isIdle: - # browser is idle, start the clock - idleTimeout = self.settings.idleTimeout - else: - idleTimeout = None - - for b in enabledBehavior: - async for item in b.onstop (): - self.processItem (item) - await l.tab.Page.stopLoading () + # idle timeout + logger.debug ('idle timeout', + uuid='90702590-94c4-44ef-9b37-02a16de444c3') + idleProc.result () + timeoutProc.cancel () + break + await behavior.stop () + await tab.Page.stopLoading () await asyncio.sleep (1) + await behavior.finish () - for b in enabledBehavior: - async for item in b.onfinish (): - self.processItem (item) - - # wait until loads from behavior scripts are done - await asyncio.sleep (1) - if not l.idle.get (): - while not await l.idle.wait (): pass + # wait until loads from behavior scripts are done and browser is + # idle for at least 1 second + try: + await asyncio.wait_for (idle.wait (1), timeout=1) + except (asyncio.TimeoutError, asyncio.CancelledError): + pass if handle.done (): handle.result () else: handle.cancel () +class SetEntry: + """ A object, to be used with sets, that compares equality only on its + primary property. """ + def __init__ (self, value, **props): + self.value = value + for k, v in props.items (): + setattr (self, k, v) + + def __eq__ (self, b): + assert isinstance (b, SetEntry) + return self.value == b.value + + def __hash__ (self): + return hash (self.value) + + def __repr__ (self): + return f'<SetEntry {self.value!r}>' + class RecursionPolicy: """ Abstract recursion policy """ @@ -242,19 +326,17 @@ class DepthLimit (RecursionPolicy): __slots__ = ('maxdepth', ) def __init__ (self, maxdepth=0): - if maxdepth < 0 or maxdepth > 1: - raise ValueError ('Unsupported') self.maxdepth = maxdepth def __call__ (self, urls): - if self.maxdepth <= 0: - return {} - else: - self.maxdepth -= 1 - return urls + newurls = set () + for u in urls: + if u.depth <= self.maxdepth: + newurls.add (u) + return newurls def __repr__ (self): - return '<DepthLimit {}>'.format (self.maxdepth) + return f'<DepthLimit {self.maxdepth}>' class PrefixLimit (RecursionPolicy): """ @@ -271,7 +353,11 @@ class PrefixLimit (RecursionPolicy): self.prefix = prefix def __call__ (self, urls): - return set (filter (lambda u: u.startswith (self.prefix), urls)) + return set (filter (lambda u: str(u.value).startswith (str (self.prefix)), urls)) + +def hasTemplate (s): + """ Return True if string s has string templates """ + return '{' in s and '}' in s class RecursiveController: """ @@ -281,47 +367,59 @@ class RecursiveController: """ __slots__ = ('url', 'output', 'command', 'logger', 'policy', 'have', - 'pending', 'stats', 'prefix', 'tempdir', 'running', 'concurrency', '_quit') + 'pending', 'stats', 'tempdir', 'running', 'concurrency', + 'copyLock') SCHEME_WHITELIST = {'http', 'https'} - def __init__ (self, url, output, command, logger, prefix='{host}-{date}-', + def __init__ (self, url, output, command, logger, tempdir=None, policy=DepthLimit (0), concurrency=1): self.url = url self.output = output self.command = command - self.prefix = prefix self.logger = logger.bind (context=type(self).__name__, seedurl=url) self.policy = policy self.tempdir = tempdir + # A lock if only a single output file (no template) is requested + self.copyLock = None if hasTemplate (output) else asyncio.Lock () + # some sanity checks. XXX move to argparse? + if self.copyLock and os.path.exists (self.output): + raise ValueError ('Output file exists') # tasks currently running self.running = set () # max number of tasks running self.concurrency = concurrency # keep in sync with StatsHandler self.stats = {'requests': 0, 'finished': 0, 'failed': 0, 'bytesRcv': 0, 'crashed': 0, 'ignored': 0} - # initiate graceful shutdown - self._quit = False - async def fetch (self, url): + async def fetch (self, entry, seqnum): """ Fetch a single URL using an external command - command is usually crocoite-grab + command is usually crocoite-single """ + assert isinstance (entry, SetEntry) + + url = entry.value + depth = entry.depth logger = self.logger.bind (url=url) def formatCommand (e): - return e.format (url=url, dest=dest.name) + # provide means to disable variable expansion + if e.startswith ('!'): + return e[1:] + else: + return e.format (url=url, dest=dest.name) - def formatPrefix (p): - return p.format (host=urlparse (url).hostname, date=datetime.utcnow ().isoformat ()) + def formatOutput (p): + return p.format (host=url.host, + date=datetime.utcnow ().isoformat (), seqnum=seqnum) def logStats (): logger.info ('stats', uuid='24d92d16-770e-4088-b769-4020e127a7ff', **self.stats) - if urlparse (url).scheme not in self.SCHEME_WHITELIST: + if url.scheme not in self.SCHEME_WHITELIST: self.stats['ignored'] += 1 logStats () self.logger.warning ('scheme not whitelisted', url=url, @@ -329,69 +427,115 @@ class RecursiveController: return dest = tempfile.NamedTemporaryFile (dir=self.tempdir, - prefix=formatPrefix (self.prefix), suffix='.warc.gz', + prefix=os.path.basename (self.output) + '-', suffix='.warc.gz', delete=False) - destpath = os.path.join (self.output, os.path.basename (dest.name)) command = list (map (formatCommand, self.command)) - logger.info ('fetch', uuid='1680f384-744c-4b8a-815b-7346e632e8db', command=command, destfile=destpath) - process = await asyncio.create_subprocess_exec (*command, stdout=asyncio.subprocess.PIPE, - stderr=asyncio.subprocess.DEVNULL, stdin=asyncio.subprocess.DEVNULL, - start_new_session=True) - while True: - data = await process.stdout.readline () - if not data: - break - data = json.loads (data) - uuid = data.get ('uuid') - if uuid == '8ee5e9c9-1130-4c5c-88ff-718508546e0c': - links = set (self.policy (map (removeFragment, data.get ('links', [])))) - links.difference_update (self.have) - self.pending.update (links) - elif uuid == '24d92d16-770e-4088-b769-4020e127a7ff': - for k in self.stats.keys (): - self.stats[k] += data.get (k, 0) + logger.info ('fetch', uuid='d1288fbe-8bae-42c8-af8c-f2fa8b41794f', + command=command) + try: + process = await asyncio.create_subprocess_exec (*command, + stdout=asyncio.subprocess.PIPE, + stderr=asyncio.subprocess.DEVNULL, + stdin=asyncio.subprocess.DEVNULL, + start_new_session=True, limit=100*1024*1024) + while True: + data = await process.stdout.readline () + if not data: + break + data = json.loads (data) + uuid = data.get ('uuid') + if uuid == '8ee5e9c9-1130-4c5c-88ff-718508546e0c': + links = set (self.policy (map (lambda x: SetEntry (URL(x).with_fragment(None), depth=depth+1), data.get ('links', [])))) + links.difference_update (self.have) + self.pending.update (links) + elif uuid == '24d92d16-770e-4088-b769-4020e127a7ff': + for k in self.stats.keys (): + self.stats[k] += data.get (k, 0) + logStats () + except asyncio.CancelledError: + # graceful cancellation + process.send_signal (signal.SIGINT) + except Exception as e: + process.kill () + raise e + finally: + code = await process.wait() + if code == 0: + if self.copyLock is None: + # atomically move once finished + lastDestpath = None + while True: + # XXX: must generate a new name every time, otherwise + # this loop never terminates + destpath = formatOutput (self.output) + assert destpath != lastDestpath + lastDestpath = destpath + + # python does not have rename(…, …, RENAME_NOREPLACE), + # but this is safe nontheless, since we’re + # single-threaded + if not os.path.exists (destpath): + # create the directory, so templates like + # /{host}/{date}/… are possible + os.makedirs (os.path.dirname (destpath), exist_ok=True) + os.rename (dest.name, destpath) + break + else: + # atomically (in the context of this process) append to + # existing file + async with self.copyLock: + with open (dest.name, 'rb') as infd, \ + open (self.output, 'ab') as outfd: + shutil.copyfileobj (infd, outfd) + os.unlink (dest.name) + else: + self.stats['crashed'] += 1 logStats () - code = await process.wait() - if code == 0: - # atomically move once finished - os.rename (dest.name, destpath) - else: - self.stats['crashed'] += 1 - logStats () - - def cancel (self): - """ Gracefully cancel this job, waiting for existing workers to shut down """ - self.logger.info ('cancel', - uuid='d58154c8-ec27-40f2-ab9e-e25c1b21cd88', - pending=len (self.pending), have=len (self.have), - running=len (self.running)) - self._quit = True async def run (self): def log (): + # self.have includes running jobs self.logger.info ('recursing', uuid='5b8498e4-868d-413c-a67e-004516b8452c', - pending=len (self.pending), have=len (self.have), + pending=len (self.pending), + have=len (self.have)-len(self.running), running=len (self.running)) - self.have = set () - self.pending = set ([self.url]) - - while self.pending and not self._quit: - # since pending is a set this picks a random item, which is fine - u = self.pending.pop () - self.have.add (u) - t = asyncio.ensure_future (self.fetch (u)) - self.running.add (t) - + seqnum = 1 + try: + self.have = set () + self.pending = set ([SetEntry (self.url, depth=0)]) + + while self.pending: + # since pending is a set this picks a random item, which is fine + u = self.pending.pop () + self.have.add (u) + t = asyncio.ensure_future (self.fetch (u, seqnum)) + self.running.add (t) + seqnum += 1 + + log () + + if len (self.running) >= self.concurrency or not self.pending: + done, pending = await asyncio.wait (self.running, + return_when=asyncio.FIRST_COMPLETED) + self.running.difference_update (done) + # propagate exceptions + for r in done: + r.result () + except asyncio.CancelledError: + self.logger.info ('cancel', + uuid='d58154c8-ec27-40f2-ab9e-e25c1b21cd88', + pending=len (self.pending), + have=len (self.have)-len (self.running), + running=len (self.running)) + finally: + done = await asyncio.gather (*self.running, + return_exceptions=True) + # propagate exceptions + for r in done: + if isinstance (r, Exception): + raise r + self.running = set () log () - if len (self.running) >= self.concurrency or not self.pending: - done, pending = await asyncio.wait (self.running, - return_when=asyncio.FIRST_COMPLETED) - self.running.difference_update (done) - - done = asyncio.gather (*self.running) - self.running = set () - log () - diff --git a/crocoite/data/click.yaml b/crocoite/data/click.yaml index f88d24d..78278b9 100644 --- a/crocoite/data/click.yaml +++ b/crocoite/data/click.yaml @@ -2,91 +2,116 @@ # Example URLs are random. Believe me. match: ^www\.facebook\.com$ selector: - - description: show more comments - selector: a.UFIPagerLink[role=button] + - description: Show comments and replies/nested comments on user pages. + selector: form[action="/ajax/ufi/modify.php"] a[data-testid^="UFI2CommentsPagerRenderer/pager_depth_"] urls: ["https://www.facebook.com/tagesschau"] - - description: show nested comments - selector: a.UFICommentLink[role=button] - - description: initially show comments below a single post/video, i.e. /user/post/123 - selector: form.commentable_item a[data-comment-prelude-ref=action_link_bling][rel=ignore] + - description: Initially show comments below a single post/video, i.e. /user/post/123. + selector: form[action="/ajax/ufi/modify.php"] a[data-testid="UFI2CommentsCount/root"] urls: ["https://www.facebook.com/tagesschau/posts/10157061068659407"] - - description: close the “register now” nag screen. for better screen shots + - description: Close the “register now” nag screen. For screenshots. selector: a#expanding_cta_close_button[role=button] urls: ["https://www.facebook.com/tagesschau"] --- match: ^twitter\.com$ selector: - - description: expand threads + - description: Expand threads. selector: a.ThreadedConversation-moreRepliesLink urls: ["https://twitter.com/realDonaldTrump/status/1068826073775964160"] - - description: show hidden profiles + - description: Show hidden profiles. selector: button.ProfileWarningTimeline-button urls: ["https://twitter.com/CookieCyboid"] - - description: show hidden/sensitive media. For screen-/snapshots. + - description: Show hidden/sensitive media. For screen-/snapshots. selector: button.Tombstone-action.js-display-this-media urls: ["https://twitter.com/CookieCyboid/status/1070807283305713665"] + - description: Show more replies. + selector: button.ThreadedConversation-showMoreThreadsButton + urls: ["https://twitter.com/fuglydug/status/1172160128101076995"] --- match: ^disqus\.com$ selector: - - description: load more comments + - description: Load more comments. selector: a.load-more__button multi: True --- -match: ^(www|np)\.reddit\.com$ +# new layout +match: ^www\.reddit\.com$ selector: - - description: show more comments, reddit’s javascript ignores events if too frequent - selector: span.morecomments a + - description: Show more comments. + selector: div[id^=moreComments-] > div > p + # reddit’s javascript ignores events if too frequent throttle: 500 - # disabled: No idea why it is not working. The selector is fine. - #urls: ["https://www.reddit.com/r/funny/comments/a21rxz/well_this_was_a_highlight_of_my_day/"] + urls: ["https://www.reddit.com/r/subredditcancer/comments/b2b80f/we_are_moderators_of_rwatchpeopledie_amaa_just/"] --- -match: ^www\.instagram\.com$ +# old layout +match: ^(old|np)\.reddit\.com$ selector: - - description: load more comments - selector: article div ul li button[type=button] - multi: True - urls: ["https://www.instagram.com/p/BqvAm_XnmdJ/"] + - description: Show more comments. + selector: span.morecomments a + # reddit’s javascript ignores events if too frequent + throttle: 500 + urls: ["https://old.reddit.com/r/subredditcancer/comments/b2b80f/we_are_moderators_of_rwatchpeopledie_amaa_just/"] --- match: ^www\.youtube\.com$ selector: - - description: expand comment thread - selector: ytd-comment-thread-renderer div.more-button + - description: Expand single comment. + selector: ytd-comment-thread-renderer span[slot=more-button] urls: ["https://www.youtube.com/watch?v=udtFqQuBFSc"] + - description: Show more comment thread replies. + selector: div.ytd-comment-replies-renderer > yt-next-continuation > paper-button + urls: ["https://www.youtube.com/watch?v=Lov0T3eXI2k"] + multi: True --- match: ^www\.patreon\.com$ selector: - - description: load more content - # this selector is so long, because there are no stable css classes - selector: div.col-xs-12 > div > div > div > div[display="flex"] > div > button[tabindex="0"][color="tertiary"][type="button"] - urls: ["https://www.patreon.com/nkjemisin"] - - description: load more comments - selector: div[display=flex] div[display=block] a[color="dark"][role="button"][tabindex="0"] + - description: Load more comments. + selector: div[data-tag=post-card] button[data-tag=loadMoreCommentsCta] urls: ["https://www.patreon.com/posts/what-im-on-22124040"] - - description: load more replies - selector: div > a[scale="0"][color=blue][size="1"] --- -match: ^(www\.)?gab\.ai$ +match: ^(www\.)?gab\.com$ selector: - - description: more replies - selector: post-detail post-comment .post-comment__replies__count a - urls: ["https://gab.ai/gab/posts/40014689"] - - description: more comments - selector: post-detail .post-comment-list__loading a - urls: ["https://gab.ai/gab/posts/41804462"] - - description: more posts - selector: post-list a.post-list__load-more + - description: Load more posts. + selector: div.item-list[role=feed] button.load-more multi: True - urls: ["https://gab.ai/gab"] + urls: ["https://gab.com/gab"] --- match: ^(www\.)?github\.com$ selector: - - description: show hidden issue items + - description: Show hidden issue items. urls: ["https://github.com/dominictarr/event-stream/issues/116"] selector: div#discussion_bucket form.ajax-pagination-form button.ajax-pagination-btn --- match: ^www\.gamasutra\.com$ selector: - - description: Load more comments + - description: Load more comments. urls: ["http://www.gamasutra.com/blogs/RaminShokrizade/20130626/194933/The_Top_F2P_Monetization_Tricks.php"] selector: div#dynamiccomments div.viewTopCmts a - +--- +match: ^(www\.)?steamcommunity\.com$ +selector: + - description: Load more content. + urls: ["https://steamcommunity.com/app/252950/reviews/?p=1&browsefilter=toprated&filterLanguage=all"] + selector: "#GetMoreContentBtn a" + multi: True +--- +match: ^imgur\.com$ +selector: + - description: Load more images of an album. + urls: ["https://imgur.com/a/JG1yc"] + selector: div.js-post-truncated a.post-loadall + - description: Expand all comments. For snapshots. + urls: ["https://imgur.com/a/JG1yc"] + selector: div.comments-info span.comments-expand + - description: Show bad replies. for snapshots. + urls: ["https://imgur.com/gallery/jRzMfRG"] + selector: div#comments div.bad-captions a.link +--- +match: ^(www\.)?vimeo\.com$ +selector: + - description: Load more videos on profile page. + urls: ["https://vimeo.com/dsam4a"] + selector: div.profile_main div.profile-load-more__button--wrapper button +# XXX: this works when using a non-headless browser, but does not otherwise +# - description: Expand video comments +# urls: ["https://vimeo.com/22439234"] +# selector: section#comments button.iris_comment-more +# multi: True diff --git a/crocoite/data/cookies.txt b/crocoite/data/cookies.txt new file mode 100644 index 0000000..6ac62c3 --- /dev/null +++ b/crocoite/data/cookies.txt @@ -0,0 +1,9 @@ +# Default cookies for crocoite. This file does *not* use Netscape’s cookie +# file format. Lines are expected to be in Set-Cookie format. +# And this line is a comment. + +# Reddit: +# skip over 18 prompt +over18=1; Domain=www.reddit.com +# skip quarantined subreddit prompt +_options={%22pref_quarantine_optin%22:true}; Domain=www.reddit.com diff --git a/crocoite/data/extract-links.js b/crocoite/data/extract-links.js index 4d1a3d0..5a4f9f0 100644 --- a/crocoite/data/extract-links.js +++ b/crocoite/data/extract-links.js @@ -25,11 +25,26 @@ function isClickable (o) { } /* --- end copy&paste */ -let x = document.body.querySelectorAll('a[href]'); let ret = []; +['a[href]', 'area[href]'].forEach (function (s) { + let x = document.querySelectorAll(s); + for (let i=0; i < x.length; i++) { + if (isClickable (x[i])) { + ret.push (x[i].href); + } + } +}); + +/* If Chrome loads plain-text documents it’ll wrap them into <pre>. Check those + * for links as well, assuming the whole line is a link (i.e. list of links). */ +let x = document.querySelectorAll ('body > pre'); for (let i=0; i < x.length; i++) { - if (isClickable (x[i])) { - ret.push (x[i].href); + if (isVisible (x[i])) { + x[i].innerText.split ('\n').forEach (function (s) { + if (s.match ('^https?://')) { + ret.push (s); + } + }); } } return ret; /* immediately return results, for use with Runtime.evaluate() */ diff --git a/crocoite/data/screenshot.js b/crocoite/data/screenshot.js new file mode 100644 index 0000000..a9a41e1 --- /dev/null +++ b/crocoite/data/screenshot.js @@ -0,0 +1,20 @@ +/* Find and scrollable full-screen elements and return their actual size + */ +(function () { +/* limit the number of elements queried */ +let elem = document.querySelectorAll ('body > div'); +let ret = []; +for (let i = 0; i < elem.length; i++) { + let e = elem[i]; + let s = window.getComputedStyle (e); + if (s.getPropertyValue ('position') == 'fixed' && + s.getPropertyValue ('overflow') == 'auto' && + s.getPropertyValue ('left') == '0px' && + s.getPropertyValue ('right') == '0px' && + s.getPropertyValue ('top') == '0px' && + s.getPropertyValue ('bottom') == '0px') { + ret.push (e.scrollHeight); + } +} +return ret; /* immediately return results, for use with Runtime.evaluate() */ +})(); diff --git a/crocoite/devtools.py b/crocoite/devtools.py index b071d2e..8b5c69d 100644 --- a/crocoite/devtools.py +++ b/crocoite/devtools.py @@ -25,7 +25,12 @@ Communication with Google Chrome through its DevTools protocol. import json, asyncio, logging, os from tempfile import mkdtemp import shutil +from http.cookies import Morsel + import aiohttp, websockets +from yarl import URL + +from .util import StrJsonEncoder logger = logging.getLogger (__name__) @@ -37,18 +42,17 @@ class Browser: Destroyed upon exit. """ - __slots__ = ('session', 'url', 'tab', 'loop') + __slots__ = ('session', 'url', 'tab') - def __init__ (self, url, loop=None): - self.url = url + def __init__ (self, url): + self.url = URL (url) self.session = None self.tab = None - self.loop = loop async def __aiter__ (self): """ List all tabs """ - async with aiohttp.ClientSession (loop=self.loop) as session: - async with session.get ('{}/json/list'.format (self.url)) as r: + async with aiohttp.ClientSession () as session: + async with session.get (self.url.with_path ('/json/list')) as r: resp = await r.json () for tab in resp: if tab['type'] == 'page': @@ -58,22 +62,35 @@ class Browser: """ Create tab """ assert self.tab is None assert self.session is None - self.session = aiohttp.ClientSession (loop=self.loop) - async with self.session.get ('{}/json/new'.format (self.url)) as r: + self.session = aiohttp.ClientSession () + async with self.session.get (self.url.with_path ('/json/new')) as r: resp = await r.json () self.tab = await Tab.create (**resp) return self.tab - async def __aexit__ (self, *args): + async def __aexit__ (self, excType, excValue, traceback): assert self.tab is not None assert self.session is not None + await self.tab.close () - async with self.session.get ('{}/json/close/{}'.format (self.url, self.tab.id)) as r: - resp = await r.text () - assert resp == 'Target is closing' + + try: + async with self.session.get (self.url.with_path (f'/json/close/{self.tab.id}')) as r: + resp = await r.text () + assert resp == 'Target is closing' + except aiohttp.client_exceptions.ClientConnectorError: + # oh boy, the whole browser crashed instead + if excType is Crashed: + # exception is reraised by `return False` + pass + else: + # this one is more important + raise + self.tab = None await self.session.close () self.session = None + return False class TabFunction: @@ -101,13 +118,13 @@ class TabFunction: return hash (self.name) def __getattr__ (self, k): - return TabFunction ('{}.{}'.format (self.name, k), self.tab) + return TabFunction (f'{self.name}.{k}', self.tab) async def __call__ (self, **kwargs): return await self.tab (self.name, **kwargs) def __repr__ (self): - return '<TabFunction {}>'.format (self.name) + return f'<TabFunction {self.name}>' class TabException (Exception): pass @@ -154,8 +171,8 @@ class Tab: self.msgid += 1 message = {'method': method, 'params': kwargs, 'id': msgid} t = self.transactions[msgid] = {'event': asyncio.Event (), 'result': None} - logger.debug ('← {}'.format (message)) - await self.ws.send (json.dumps (message)) + logger.debug (f'← {message}') + await self.ws.send (json.dumps (message, cls=StrJsonEncoder)) await t['event'].wait () ret = t['result'] del self.transactions[msgid] @@ -189,7 +206,7 @@ class Tab: # right now we cannot recover from this await markCrashed (e) break - logger.debug ('→ {}'.format (msg)) + logger.debug (f'→ {msg}') if 'id' in msg: msgid = msg['id'] t = self.transactions.get (msgid, None) @@ -266,11 +283,11 @@ class Process: async def __aenter__ (self): assert self.p is None - self.userDataDir = mkdtemp () + self.userDataDir = mkdtemp (prefix=__package__ + '-chrome-userdata-') # see https://github.com/GoogleChrome/chrome-launcher/blob/master/docs/chrome-flags-for-tools.md args = [self.binary, '--window-size={},{}'.format (*self.windowSize), - '--user-data-dir={}'.format (self.userDataDir), # use temporory user dir + f'--user-data-dir={self.userDataDir}', # use temporory user dir '--no-default-browser-check', '--no-first-run', # don’t show first run screen '--disable-breakpad', # no error reports @@ -315,12 +332,26 @@ class Process: if port is None: raise Exception ('Chrome died on us.') - return 'http://localhost:{}'.format (port) + return URL.build(scheme='http', host='localhost', port=port) async def __aexit__ (self, *exc): - self.p.terminate () - await self.p.wait () - shutil.rmtree (self.userDataDir) + try: + self.p.terminate () + await self.p.wait () + except ProcessLookupError: + # ok, fine, dead already + pass + + # Try to delete the temporary directory multiple times. It looks like + # Chrome will change files in there even after it exited (i.e. .wait() + # returned). Very strange. + for i in range (5): + try: + shutil.rmtree (self.userDataDir) + break + except: + await asyncio.sleep (0.2) + self.p = None return False @@ -328,7 +359,7 @@ class Passthrough: __slots__ = ('url', ) def __init__ (self, url): - self.url = url + self.url = URL (url) async def __aenter__ (self): return self.url @@ -336,3 +367,26 @@ class Passthrough: async def __aexit__ (self, *exc): return False +def toCookieParam (m): + """ + Convert Python’s http.cookies.Morsel to Chrome’s CookieParam, see + https://chromedevtools.github.io/devtools-protocol/1-3/Network#type-CookieParam + """ + + assert isinstance (m, Morsel) + + out = {'name': m.key, 'value': m.value} + + # unsupported by chrome + for k in ('max-age', 'comment', 'version'): + if m[k]: + raise ValueError (f'Unsupported cookie attribute {k} set, cannot convert') + + for mname, cname in [('expires', None), ('path', None), ('domain', None), ('secure', None), ('httponly', 'httpOnly')]: + value = m[mname] + if value: + cname = cname or mname + out[cname] = value + + return out + diff --git a/crocoite/html.py b/crocoite/html.py index fec9760..30f6ca5 100644 --- a/crocoite/html.py +++ b/crocoite/html.py @@ -107,6 +107,8 @@ eventAttributes = {'onabort', 'onvolumechange', 'onwaiting'} +default_namespace = constants.namespaces["html"] + class ChromeTreeWalker (TreeWalker): """ Recursive html5lib TreeWalker for Google Chrome method DOM.getDocument @@ -122,11 +124,14 @@ class ChromeTreeWalker (TreeWalker): elif name == '#document': for child in node.get ('children', []): yield from self.recurse (child) + elif name == '#cdata-section': + # html5lib cannot generate cdata, so we’re faking it by using + # an empty tag + yield from self.emptyTag (default_namespace, + '![CDATA[' + node['nodeValue'] + ']]', {}) else: - assert False, name + assert False, (name, node) else: - default_namespace = constants.namespaces["html"] - attributes = node.get ('attributes', []) convertedAttr = {} for i in range (0, len (attributes), 2): diff --git a/crocoite/irc.py b/crocoite/irc.py index 99485e4..d9c0634 100644 --- a/crocoite/irc.py +++ b/crocoite/irc.py @@ -22,16 +22,19 @@ IRC bot “chromebot” """ -import asyncio, argparse, uuid, json, tempfile +import asyncio, argparse, json, tempfile, time, random, os, shlex from datetime import datetime from urllib.parse import urlsplit -from enum import IntEnum, Enum +from enum import IntEnum, unique from collections import defaultdict from abc import abstractmethod from functools import wraps import bottom import websockets +from .util import StrJsonEncoder +from .cli import cookie + ### helper functions ### def prettyTimeDelta (seconds): """ @@ -53,7 +56,7 @@ def prettyBytes (b): while b >= 1024 and len (prefixes) > 1: b /= 1024 prefixes.pop (0) - return '{:.1f} {}'.format (b, prefixes[0]) + return f'{b:.1f} {prefixes[0]}' def isValidUrl (s): url = urlsplit (s) @@ -84,13 +87,45 @@ class Status(IntEnum): aborted = 3 finished = 4 +# see https://arxiv.org/html/0901.4016 on how to build proquints (human +# pronouncable unique ids) +toConsonant = 'bdfghjklmnprstvz' +toVowel = 'aiou' + +def u16ToQuint (v): + """ Transform a 16 bit unsigned integer into a single quint """ + assert 0 <= v < 2**16 + # quints are “big-endian” + return ''.join ([ + toConsonant[(v>>(4+2+4+2))&0xf], + toVowel[(v>>(4+2+4))&0x3], + toConsonant[(v>>(4+2))&0xf], + toVowel[(v>>4)&0x3], + toConsonant[(v>>0)&0xf], + ]) + +def uintToQuint (v, length=2): + """ Turn any integer into a proquint with fixed length """ + assert 0 <= v < 2**(length*16) + + return '-'.join (reversed ([u16ToQuint ((v>>(x*16))&0xffff) for x in range (length)])) + +def makeJobId (): + """ Create job id from time and randomness source """ + # allocate 48 bits for the time (in milliseconds) and add 16 random bits + # at the end (just to be sure) for a total of 64 bits. Should be enough to + # avoid collisions. + randbits = 16 + stamp = (int (time.time ()*1000) << randbits) | random.randint (0, 2**randbits-1) + return uintToQuint (stamp, 4) + class Job: """ Archival job """ __slots__ = ('id', 'stats', 'rstats', 'started', 'finished', 'nick', 'status', 'process', 'url') def __init__ (self, url, nick): - self.id = str (uuid.uuid4 ()) + self.id = makeJobId () self.stats = {} self.rstats = {} self.started = datetime.utcnow () @@ -104,32 +139,40 @@ class Job: def formatStatus (self): stats = self.stats rstats = self.rstats - return '{} ({}) {}. {} pages finished, {} pending; {} crashed, {} requests, {} failed, {} received.'.format ( - self.url, - self.id, - self.status.name, - rstats.get ('have', 0), - rstats.get ('pending', 0), - stats.get ('crashed', 0), - stats.get ('requests', 0), - stats.get ('failed', 0), - prettyBytes (stats.get ('bytesRcv', 0))) - -class NickMode(Enum): - operator = '@' - voice = '+' + return (f"{self.url} ({self.id}) {self.status.name}. " + f"{rstats.get ('have', 0)} pages finished, " + f"{rstats.get ('pending', 0)} pending; " + f"{stats.get ('crashed', 0)} crashed, " + f"{stats.get ('requests', 0)} requests, " + f"{stats.get ('failed', 0)} failed, " + f"{prettyBytes (stats.get ('bytesRcv', 0))} received.") + +@unique +class NickMode(IntEnum): + # the actual numbers don’t matter, but their order must be strictly + # increasing (with priviledge level) + operator = 100 + voice = 10 @classmethod def fromMode (cls, mode): return {'v': cls.voice, 'o': cls.operator}[mode] + @classmethod + def fromNickPrefix (cls, mode): + return {'@': cls.operator, '+': cls.voice}[mode] + + @property + def human (self): + return {self.operator: 'operator', self.voice: 'voice'}[self] + class User: """ IRC user """ __slots__ = ('name', 'modes') - def __init__ (self, name, modes=set ()): + def __init__ (self, name, modes=None): self.name = name - self.modes = modes + self.modes = modes or set () def __eq__ (self, b): return self.name == b.name @@ -138,15 +181,21 @@ class User: return hash (self.name) def __repr__ (self): - return '<User {} {}>'.format (self.name, self.modes) + return f'<User {self.name} {self.modes}>' + + def hasPriv (self, p): + if p is None: + return True + else: + return self.modes and max (self.modes) >= p @classmethod def fromName (cls, name): """ Get mode and name from NAMES command """ try: - modes = {NickMode(name[0])} + modes = {NickMode.fromNickPrefix (name[0])} name = name[1:] - except ValueError: + except KeyError: modes = set () return cls (name, modes) @@ -159,7 +208,8 @@ class ReplyContext: self.user = user def __call__ (self, message): - self.client.send ('PRIVMSG', target=self.target, message='{}: {}'.format (self.user.name, message)) + self.client.send ('PRIVMSG', target=self.target, + message=f'{self.user.name}: {message}') class RefCountEvent: """ @@ -200,9 +250,9 @@ class ArgparseBot (bottom.Client): __slots__ = ('channels', 'nick', 'parser', 'users', '_quit') - def __init__ (self, host, port, ssl, nick, logger, channels=[], loop=None): + def __init__ (self, host, port, ssl, nick, logger, channels=None, loop=None): super().__init__ (host=host, port=port, ssl=ssl, loop=loop) - self.channels = channels + self.channels = channels or [] self.nick = nick # map channel -> nick -> user self.users = defaultdict (dict) @@ -259,8 +309,13 @@ class ArgparseBot (bottom.Client): self.send ('JOIN', channel=c) # no need for NAMES here, server sends this automatically - async def onNameReply (self, target, channel_type, channel, users, **kwargs): - self.users[channel] = dict (map (lambda x: (x.name, x), map (User.fromName, users))) + async def onNameReply (self, channel, users, **kwargs): + # channels may be too big for a single message + addusers = dict (map (lambda x: (x.name, x), map (User.fromName, users))) + if channel not in self.users: + self.users[channel] = addusers + else: + self.users[channel].update (addusers) @staticmethod def parseMode (mode): @@ -274,7 +329,7 @@ class ArgparseBot (bottom.Client): ret.append ((action, c)) return ret - async def onMode (self, nick, user, host, channel, modes, params, **kwargs): + async def onMode (self, channel, modes, params, **kwargs): if channel not in self.channels: return @@ -290,7 +345,7 @@ class ArgparseBot (bottom.Client): # unknown mode, ignore pass - async def onPart (self, nick, user, host, message, channel, **kwargs): + async def onPart (self, nick, channel, **kwargs): if channel not in self.channels: return @@ -312,23 +367,27 @@ class ArgparseBot (bottom.Client): async def onMessage (self, nick, target, message, **kwargs): """ Message received """ - if target in self.channels and message.startswith (self.nick): + msgPrefix = self.nick + ':' + if target in self.channels and message.startswith (msgPrefix): user = self.users[target].get (nick, User (nick)) reply = ReplyContext (client=self, target=target, user=user) - # channel message that starts with our nick - command = message.split (' ')[1:] + # shlex.split supports quoting arguments, which str.split() does not + command = shlex.split (message[len (msgPrefix):]) try: args = self.parser.parse_args (command) except Exception as e: - reply ('{} -- {}'.format (e.args[1], e.args[0].format_usage ())) + reply (f'{e.args[1]} -- {e.args[0].format_usage ()}') return - if not args: - reply ('Sorry, I don’t understand {}'.format (command)) + if not args or not hasattr (args, 'func'): + reply (f'Sorry, I don’t understand {command}') return + minPriv = getattr (args, 'minPriv', None) if self._quit.armed and not getattr (args, 'allowOnShutdown', False): reply ('Sorry, I’m shutting down and cannot accept your request right now.') + elif not user.hasPriv (minPriv): + reply (f'Sorry, you need the privilege {minPriv.human} to use this command.') else: with self._quit: await args.func (user=user, args=args, reply=reply) @@ -336,23 +395,14 @@ class ArgparseBot (bottom.Client): async def onDisconnect (self, **kwargs): """ Auto-reconnect """ self.logger.info ('disconnect', uuid='4c74b2c8-2403-4921-879d-2279ad85db72') - if not self._quit.armed: - await asyncio.sleep (10, loop=self.loop) - self.logger.info ('reconnect', uuid='c53555cb-e1a4-4b69-b1c9-3320269c19d7') - await self.connect () - -def voice (func): - """ Calling user must have voice or ops """ - @wraps (func) - async def inner (self, *args, **kwargs): - user = kwargs.get ('user') - reply = kwargs.get ('reply') - if not user.modes.intersection ({NickMode.operator, NickMode.voice}): - reply ('Sorry, you must have voice to use this command.') - else: - ret = await func (self, *args, **kwargs) - return ret - return inner + while True: + if not self._quit.armed: + await asyncio.sleep (10, loop=self.loop) + self.logger.info ('reconnect', uuid='c53555cb-e1a4-4b69-b1c9-3320269c19d7') + try: + await self.connect () + finally: + break def jobExists (func): """ Chromebot job exists """ @@ -363,38 +413,45 @@ def jobExists (func): reply = kwargs.get ('reply') j = self.jobs.get (args.id, None) if not j: - reply ('Job {} is unknown'.format (args.id)) + reply (f'Job {args.id} is unknown') else: ret = await func (self, job=j, **kwargs) return ret return inner class Chromebot (ArgparseBot): - __slots__ = ('jobs', 'tempdir', 'destdir', 'processLimit') + __slots__ = ('jobs', 'tempdir', 'destdir', 'processLimit', 'blacklist', 'needVoice') + + def __init__ (self, host, port, ssl, nick, logger, channels=None, + tempdir=None, destdir='.', processLimit=1, + blacklist={}, needVoice=False, loop=None): + self.needVoice = needVoice - def __init__ (self, host, port, ssl, nick, logger, channels=[], - tempdir=tempfile.gettempdir(), destdir='.', processLimit=1, - loop=None): super().__init__ (host=host, port=port, ssl=ssl, nick=nick, logger=logger, channels=channels, loop=loop) self.jobs = {} - self.tempdir = tempdir + self.tempdir = tempdir or tempfile.gettempdir() self.destdir = destdir self.processLimit = asyncio.Semaphore (processLimit) + self.blacklist = blacklist def getParser (self): parser = NonExitingArgumentParser (prog=self.nick + ': ', add_help=False) subparsers = parser.add_subparsers(help='Sub-commands') archiveparser = subparsers.add_parser('a', help='Archive a site', add_help=False) - #archiveparser.add_argument('--timeout', default=1*60*60, type=int, help='Maximum time for archival', metavar='SEC', choices=[60, 1*60*60, 2*60*60]) - #archiveparser.add_argument('--idle-timeout', default=10, type=int, help='Maximum idle seconds (i.e. no requests)', dest='idleTimeout', metavar='SEC', choices=[1, 10, 20, 30, 60]) - #archiveparser.add_argument('--max-body-size', default=None, type=int, dest='maxBodySize', help='Max body size', metavar='BYTES', choices=[1*1024*1024, 10*1024*1024, 100*1024*1024]) archiveparser.add_argument('--concurrency', '-j', default=1, type=int, help='Parallel workers for this job', choices=range (1, 5)) archiveparser.add_argument('--recursive', '-r', help='Enable recursion', choices=['0', '1', 'prefix'], default='0') + archiveparser.add_argument('--insecure', '-k', + help='Disable certificate checking', action='store_true') + # parsing the cookie here, so we can give an early feedback without + # waiting for the job to crash on invalid arguments. + archiveparser.add_argument('--cookie', '-b', type=cookie, + help='Add a cookie', action='append', default=[]) archiveparser.add_argument('url', help='Website URL', type=isValidUrl, metavar='URL') - archiveparser.set_defaults (func=self.handleArchive) + archiveparser.set_defaults (func=self.handleArchive, + minPriv=NickMode.voice if self.needVoice else None) statusparser = subparsers.add_parser ('s', help='Get job status', add_help=False) statusparser.add_argument('id', help='Job id', metavar='UUID') @@ -402,31 +459,70 @@ class Chromebot (ArgparseBot): abortparser = subparsers.add_parser ('r', help='Revoke/abort job', add_help=False) abortparser.add_argument('id', help='Job id', metavar='UUID') - abortparser.set_defaults (func=self.handleAbort, allowOnShutdown=True) + abortparser.set_defaults (func=self.handleAbort, allowOnShutdown=True, + minPriv=NickMode.voice if self.needVoice else None) return parser - @voice + def isBlacklisted (self, url): + for k, v in self.blacklist.items(): + if k.match (url): + return v + return False + async def handleArchive (self, user, args, reply): """ Handle the archive command """ - j = Job (args.url, user.name) - assert j.id not in self.jobs, 'duplicate job id' + msg = self.isBlacklisted (args.url) + if msg: + reply (f'{args.url} cannot be queued: {msg}') + return + + # make sure the job id is unique. Since ids are time-based we can just + # wait. + while True: + j = Job (args.url, user.name) + if j.id not in self.jobs: + break + time.sleep (0.01) self.jobs[j.id] = j logger = self.logger.bind (job=j.id) - cmdline = ['crocoite-recursive', args.url, '--tempdir', self.tempdir, - '--prefix', j.id + '-{host}-{date}-', '--policy', - args.recursive, '--concurrency', str (args.concurrency), - self.destdir] - showargs = { 'recursive': args.recursive, 'concurrency': args.concurrency, } + if args.insecure: + showargs['insecure'] = args.insecure + warcinfo = {'chromebot': { + 'jobid': j.id, + 'user': user.name, + 'queued': j.started, + 'url': args.url, + 'recursive': args.recursive, + 'concurrency': args.concurrency, + }} + grabCmd = ['crocoite-single'] + # prefix warcinfo with !, so it won’t get expanded + grabCmd.extend (['--warcinfo', + '!' + json.dumps (warcinfo, cls=StrJsonEncoder)]) + for v in args.cookie: + grabCmd.extend (['--cookie', v.OutputString ()]) + if args.insecure: + grabCmd.append ('--insecure') + grabCmd.extend (['{url}', '{dest}']) + cmdline = ['crocoite', + '--tempdir', self.tempdir, + '--recursion', args.recursive, + '--concurrency', str (args.concurrency), + args.url, + os.path.join (self.destdir, + j.id + '-{host}-{date}-{seqnum}.warc.gz'), + '--'] + grabCmd + strargs = ', '.join (map (lambda x: '{}={}'.format (*x), showargs.items ())) - reply ('{} has been queued as {} with {}'.format (args.url, j.id, strargs)) + reply (f'{args.url} has been queued as {j.id} with {strargs}') logger.info ('queue', user=user.name, url=args.url, cmdline=cmdline, uuid='36cc34a6-061b-4cc5-84a9-4ab6552c8d75') @@ -437,7 +533,7 @@ class Chromebot (ArgparseBot): stdout=asyncio.subprocess.PIPE, stderr=asyncio.subprocess.DEVNULL, stdin=asyncio.subprocess.DEVNULL, - start_new_session=True) + start_new_session=True, limit=100*1024*1024) while True: data = await j.process.stdout.readline () if not data: @@ -477,7 +573,6 @@ class Chromebot (ArgparseBot): rstats = job.rstats reply (job.formatStatus ()) - @voice @jobExists async def handleAbort (self, user, args, reply, job): """ Handle abort command """ @@ -541,7 +636,11 @@ class Dashboard: if not buf: return - data = json.loads (buf) + try: + data = json.loads (buf) + except json.decoder.JSONDecodeError: + # ignore invalid + return msgid = data['uuid'] if msgid in self.ignoreMsgid: @@ -554,9 +653,8 @@ class Dashboard: elif msgid == '5c0f9a11-dcd8-4182-a60f-54f4d3ab3687': nesteddata = data['data'] nestedmsgid = nesteddata['uuid'] - if nestedmsgid == '1680f384-744c-4b8a-815b-7346e632e8db': + if nestedmsgid == 'd1288fbe-8bae-42c8-af8c-f2fa8b41794f': del nesteddata['command'] - del nesteddata['destfile'] buf = json.dumps (data) for c in self.clients: diff --git a/crocoite/logger.py b/crocoite/logger.py index cddc42d..ac389ca 100644 --- a/crocoite/logger.py +++ b/crocoite/logger.py @@ -34,6 +34,8 @@ from enum import IntEnum from pytz import utc +from .util import StrJsonEncoder + class Level(IntEnum): DEBUG = 0 INFO = 1 @@ -41,9 +43,9 @@ class Level(IntEnum): ERROR = 3 class Logger: - def __init__ (self, consumer=[], bindings={}): - self.bindings = bindings - self.consumer = consumer + def __init__ (self, consumer=None, bindings=None): + self.bindings = bindings or {} + self.consumer = consumer or [] def __call__ (self, level, *args, **kwargs): if not isinstance (level, Level): @@ -102,24 +104,13 @@ class PrintConsumer (Consumer): sys.stderr.flush () return kwargs -class JsonEncoder (json.JSONEncoder): - def default (self, obj): - if isinstance (obj, datetime): - return obj.isoformat () - - # make sure serialization always succeeds - try: - return json.JSONEncoder.default(self, obj) - except TypeError: - return str (obj) - class JsonPrintConsumer (Consumer): - def __init__ (self, minLevel=Level.INFO): + def __init__ (self, minLevel=Level.DEBUG): self.minLevel = minLevel def __call__ (self, **kwargs): if kwargs['level'] >= self.minLevel: - json.dump (kwargs, sys.stdout, cls=JsonEncoder) + json.dump (kwargs, sys.stdout, cls=StrJsonEncoder) sys.stdout.write ('\n') sys.stdout.flush () return kwargs @@ -130,12 +121,12 @@ class DatetimeConsumer (Consumer): return kwargs class WarcHandlerConsumer (Consumer): - def __init__ (self, warc, minLevel=Level.INFO): + def __init__ (self, warc, minLevel=Level.DEBUG): self.warc = warc self.minLevel = minLevel def __call__ (self, **kwargs): if kwargs['level'] >= self.minLevel: - self.warc._writeLog (json.dumps (kwargs, cls=JsonEncoder)) + self.warc._writeLog (json.dumps (kwargs, cls=StrJsonEncoder)) return kwargs diff --git a/crocoite/test_behavior.py b/crocoite/test_behavior.py index 280b35b..1efea08 100644 --- a/crocoite/test_behavior.py +++ b/crocoite/test_behavior.py @@ -18,19 +18,24 @@ # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN # THE SOFTWARE. -import asyncio, os, yaml, re -from urllib.parse import urlparse +import asyncio, os, yaml, re, math, struct from functools import partial +from operator import attrgetter + import pytest +from yarl import URL +from aiohttp import web import pkg_resources from .logger import Logger from .devtools import Process -from .behavior import Scroll, Behavior -from .controller import SinglePageController +from .behavior import Scroll, Behavior, ExtractLinks, ExtractLinksEvent, Crash, \ + Screenshot, ScreenshotEvent, DomSnapshot, DomSnapshotEvent, mapOrIgnore +from .controller import SinglePageController, EventHandler, ControllerSettings +from .devtools import Crashed with pkg_resources.resource_stream (__name__, os.path.join ('data', 'click.yaml')) as fd: - sites = list (yaml.load_all (fd)) + sites = list (yaml.safe_load_all (fd)) clickParam = [] for o in sites: for s in o['selector']: @@ -67,7 +72,7 @@ class ClickTester (Behavior): # assert any (map (lambda x: x['type'] == 'click', listeners)), listeners return - yield + yield # pragma: no cover @pytest.mark.parametrize("url,selector", clickParam) @pytest.mark.asyncio @@ -77,8 +82,10 @@ async def test_click_selectors (url, selector): Make sure the CSS selector exists on an example url """ logger = Logger () + settings = ControllerSettings (idleTimeout=5, timeout=10) # Some selectors are loaded dynamically and require scrolling controller = SinglePageController (url=url, logger=logger, + settings=settings, service=Process (), behavior=[Scroll, partial(ClickTester, selector=selector)]) await controller.run () @@ -87,12 +94,173 @@ matchParam = [] for o in sites: for s in o['selector']: for u in s.get ('urls', []): - matchParam.append ((o['match'], u)) + matchParam.append ((o['match'], URL (u))) @pytest.mark.parametrize("match,url", matchParam) @pytest.mark.asyncio async def test_click_match (match, url): """ Test urls must match """ - host = urlparse (url).netloc - assert re.match (match, host, re.I) + # keep this aligned with click.js + assert re.match (match, url.host, re.I) + + +class AccumHandler (EventHandler): + """ Test adapter that accumulates all incoming items """ + __slots__ = ('data') + + def __init__ (self): + super().__init__ () + self.data = [] + + async def push (self, item): + self.data.append (item) + +async def simpleServer (url, response): + async def f (req): + return web.Response (body=response, status=200, content_type='text/html', charset='utf-8') + + app = web.Application () + app.router.add_route ('GET', url.path, f) + runner = web.AppRunner(app) + await runner.setup() + site = web.TCPSite(runner, url.host, url.port) + await site.start() + return runner + +@pytest.mark.asyncio +async def test_extract_links (): + """ + Make sure the CSS selector exists on an example url + """ + + url = URL.build (scheme='http', host='localhost', port=8080) + runner = await simpleServer (url, """<html><head></head> + <body> + <div> + <a href="/relative">foo</a> + <a href="http://example.com/absolute/">foo</a> + <a href="https://example.com/absolute/secure">foo</a> + <a href="#anchor">foo</a> + <a href="http://neue_preise_f%c3%bcr_zahnimplantate_k%c3%b6nnten_sie_%c3%bcberraschen">foo</a> + + <a href="/hidden/visibility" style="visibility: hidden">foo</a> + <a href="/hidden/display" style="display: none">foo</a> + <div style="display: none"> + <a href="/hidden/display/insidediv">foo</a> + </div> + <!--<a href="/hidden/comment">foo</a>--> + + <p><img src="shapes.png" usemap="#shapes"> + <map name="shapes"><area shape=rect coords="50,50,100,100" href="/map/rect"></map></p> + </div> + </body></html>""") + + try: + handler = AccumHandler () + logger = Logger () + controller = SinglePageController (url=url, logger=logger, + service=Process (), behavior=[ExtractLinks], handler=[handler]) + await controller.run () + + links = [] + for d in handler.data: + if isinstance (d, ExtractLinksEvent): + links.extend (d.links) + assert sorted (links) == sorted ([ + url.with_path ('/relative'), + url.with_fragment ('anchor'), + URL ('http://neue_preise_f%C3%BCr_zahnimplantate_k%C3%B6nnten_sie_%C3%BCberraschen'), + URL ('http://example.com/absolute/'), + URL ('https://example.com/absolute/secure'), + url.with_path ('/hidden/visibility'), # XXX: shall we ignore these as well? + url.with_path ('/map/rect'), + ]) + finally: + await runner.cleanup () + +@pytest.mark.asyncio +async def test_crash (): + """ + Crashing through Behavior works? + """ + + url = URL.build (scheme='http', host='localhost', port=8080) + runner = await simpleServer (url, '<html></html>') + + try: + logger = Logger () + controller = SinglePageController (url=url, logger=logger, + service=Process (), behavior=[Crash]) + with pytest.raises (Crashed): + await controller.run () + finally: + await runner.cleanup () + +@pytest.mark.asyncio +async def test_screenshot (): + """ + Make sure screenshots are taken and have the correct dimensions. We can’t + and don’t want to check their content. + """ + # ceil(0) == 0, so starting with 1 + for expectHeight in (1, Screenshot.maxDim, Screenshot.maxDim+1, Screenshot.maxDim*2+Screenshot.maxDim//2): + url = URL.build (scheme='http', host='localhost', port=8080) + runner = await simpleServer (url, f'<html><body style="margin: 0; padding: 0;"><div style="height: {expectHeight}"></div></body></html>') + + try: + handler = AccumHandler () + logger = Logger () + controller = SinglePageController (url=url, logger=logger, + service=Process (), behavior=[Screenshot], handler=[handler]) + await controller.run () + + screenshots = list (filter (lambda x: isinstance (x, ScreenshotEvent), handler.data)) + assert len (screenshots) == math.ceil (expectHeight/Screenshot.maxDim) + totalHeight = 0 + for s in screenshots: + assert s.url == url + # PNG ident is fixed, IHDR is always the first chunk + assert s.data.startswith (b'\x89PNG\r\n\x1a\n\x00\x00\x00\x0dIHDR') + width, height = struct.unpack ('>II', s.data[16:24]) + assert height <= Screenshot.maxDim + totalHeight += height + # screenshot height is at least canvas height (XXX: get hardcoded + # value from devtools.Process) + assert totalHeight == max (expectHeight, 1080) + finally: + await runner.cleanup () + +@pytest.mark.asyncio +async def test_dom_snapshot (): + """ + Behavior plug-in works, <canvas> is replaced by static image, <script> is + stripped. Actual conversion from Chrome DOM to HTML is validated by module + .test_html + """ + + url = URL.build (scheme='http', host='localhost', port=8080) + runner = await simpleServer (url, f'<html><body><p>ÄÖÜäöü</p><script>alert("yes");</script><canvas id="canvas" width="1" height="1">Alternate text.</canvas></body></html>') + + try: + handler = AccumHandler () + logger = Logger () + controller = SinglePageController (url=url, logger=logger, + service=Process (), behavior=[DomSnapshot], handler=[handler]) + await controller.run () + + snapshots = list (filter (lambda x: isinstance (x, DomSnapshotEvent), handler.data)) + assert len (snapshots) == 1 + doc = snapshots[0].document + assert doc.startswith ('<HTML><HEAD><meta charset=utf-8></HEAD><BODY><P>ÄÖÜäöü</P><IMG id=canvas width=1 height=1 src="data:image/png;base64,'.encode ('utf-8')) + assert doc.endswith ('></BODY></HTML>'.encode ('utf-8')) + finally: + await runner.cleanup () + +def test_mapOrIgnore (): + def fail (x): + if x < 50: + raise Exception () + return x+1 + + assert list (mapOrIgnore (fail, range (100))) == list (range (51, 101)) diff --git a/crocoite/test_browser.py b/crocoite/test_browser.py index 06492b1..7084214 100644 --- a/crocoite/test_browser.py +++ b/crocoite/test_browser.py @@ -18,104 +18,30 @@ # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN # THE SOFTWARE. -import asyncio -import pytest +import asyncio, socket from operator import itemgetter -from aiohttp import web from http.server import BaseHTTPRequestHandler +from datetime import datetime + +from yarl import URL +from aiohttp import web +from multidict import CIMultiDict -from .browser import Item, SiteLoader, VarChangeEvent +from hypothesis import given +import hypothesis.strategies as st +from hypothesis.provisional import domains +import pytest + +from .browser import RequestResponsePair, SiteLoader, Request, \ + UnicodeBody, ReferenceTimestamp, Base64Body, UnicodeBody, Request, \ + Response, NavigateError, PageIdle, FrameNavigated from .logger import Logger, Consumer from .devtools import Crashed, Process # if you want to know what’s going on: +#import logging #logging.basicConfig(level=logging.DEBUG) -class TItem (Item): - """ This should be as close to Item as possible """ - - __slots__ = ('bodySend', '_body', '_requestBody') - base = 'http://localhost:8000/' - - def __init__ (self, path, status, headers, bodyReceive, bodySend=None, requestBody=None, failed=False, isRedirect=False): - super ().__init__ () - self.chromeResponse = {'response': {'headers': headers, 'status': status, 'url': self.base + path}} - self.body = bodyReceive, False - self.bodySend = bodyReceive if not bodySend else bodySend - self.requestBody = requestBody, False - self.failed = failed - self.isRedirect = isRedirect - -testItems = [ - TItem ('binary', 200, {'Content-Type': 'application/octet-stream'}, b'\x00\x01\x02', failed=True), - TItem ('attachment', 200, - {'Content-Type': 'text/plain; charset=utf-8', - 'Content-Disposition': 'attachment; filename="attachment.txt"', - }, - 'This is a simple text file with umlauts. ÄÖU.'.encode ('utf8'), failed=True), - TItem ('encoding/utf8', 200, {'Content-Type': 'text/plain; charset=utf-8'}, - 'This is a test, äöü μνψκ ¥¥¥¿ýý¡'.encode ('utf8')), - TItem ('encoding/iso88591', 200, {'Content-Type': 'text/plain; charset=ISO-8859-1'}, - 'This is a test, äöü.'.encode ('utf8'), - 'This is a test, äöü.'.encode ('ISO-8859-1')), - TItem ('encoding/latin1', 200, {'Content-Type': 'text/plain; charset=latin1'}, - 'This is a test, äöü.'.encode ('utf8'), - 'This is a test, äöü.'.encode ('latin1')), - TItem ('image', 200, {'Content-Type': 'image/png'}, - # 1×1 png image - b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x00\x01\x00\x00\x00\x01\x08\x00\x00\x00\x00:~\x9bU\x00\x00\x00\nIDAT\x08\x1dc\xf8\x0f\x00\x01\x01\x01\x006_g\x80\x00\x00\x00\x00IEND\xaeB`\x82'), - TItem ('empty', 200, {'Content-Type': 'text/plain'}, b''), - TItem ('headers/duplicate', 200, [('Content-Type', 'text/plain'), ('Duplicate', '1'), ('Duplicate', '2')], b''), - TItem ('headers/fetch/req', 200, {'Content-Type': 'text/plain'}, b''), - TItem ('headers/fetch/html', 200, {'Content-Type': 'text/html'}, - r"""<html><body><script> - let h = new Headers([["custom", "1"]]); - fetch("/headers/fetch/req", {"method": "GET", "headers": h}).then(x => console.log("done")); - </script></body></html>""".encode ('utf8')), - TItem ('redirect/301/empty', 301, {'Location': '/empty'}, b'', isRedirect=True), - TItem ('redirect/301/redirect/301/empty', 301, {'Location': '/redirect/301/empty'}, b'', isRedirect=True), - TItem ('nonexistent', 404, {}, b''), - TItem ('html', 200, {'Content-Type': 'text/html'}, - '<html><body><img src="/image"><img src="/nonexistent"></body></html>'.encode ('utf8')), - TItem ('html/alert', 200, {'Content-Type': 'text/html'}, - '<html><body><script>window.addEventListener("beforeunload", function (e) { e.returnValue = "bye?"; return e.returnValue; }); alert("stopping here"); if (confirm("are you sure?") || prompt ("42?")) { window.location = "/nonexistent"; }</script><script>document.write(\'<img src="/image">\');</script></body></html>'.encode ('utf8')), - TItem ('html/fetchPost', 200, {'Content-Type': 'text/html'}, - r"""<html><body><script> - let a = fetch("/html/fetchPost/binary", {"method": "POST", "body": "\x00"}); - let b = fetch("/html/fetchPost/form", {"method": "POST", "body": new URLSearchParams({"data": "!"})}); - let c = fetch("/html/fetchPost/binary/large", {"method": "POST", "body": "\x00".repeat(100*1024)}); - let d = fetch("/html/fetchPost/form/large", {"method": "POST", "body": new URLSearchParams({"data": "!".repeat(100*1024)})}); - </script></body></html>""".encode ('utf8')), - TItem ('html/fetchPost/binary', 200, {'Content-Type': 'application/octet-stream'}, b'\x00', requestBody=b'\x00'), - TItem ('html/fetchPost/form', 200, {'Content-Type': 'application/octet-stream'}, b'\x00', requestBody=b'data=%21'), - # XXX: these should trigger the need for getRequestPostData, but they don’t. oh well. - TItem ('html/fetchPost/binary/large', 200, {'Content-Type': 'application/octet-stream'}, b'\x00', requestBody=(100*1024)*b'\x00'), - TItem ('html/fetchPost/form/large', 200, {'Content-Type': 'application/octet-stream'}, b'\x00', requestBody=b'data=' + (100*1024)*b'%21'), - ] -testItemMap = dict ([(item.parsedUrl.path, item) for item in testItems]) - -def itemToResponse (item): - async def f (req): - headers = item.response['headers'] - return web.Response(body=item.bodySend, status=item.response['status'], - headers=headers) - return f - -@pytest.fixture -async def server (): - """ Simple HTTP server for testing notifications """ - import logging - logging.basicConfig(level=logging.DEBUG) - app = web.Application(debug=True) - for item in testItems: - app.router.add_route ('*', item.parsedUrl.path, itemToResponse (item)) - runner = web.AppRunner(app) - await runner.setup() - site = web.TCPSite(runner, 'localhost', 8080) - await site.start() - yield app - await runner.cleanup () - class AssertConsumer (Consumer): def __call__ (self, **kwargs): assert 'uuid' in kwargs @@ -128,164 +54,334 @@ def logger (): return Logger (consumer=[AssertConsumer ()]) @pytest.fixture -async def loader (server, logger): - def f (path): - if path.startswith ('/'): - path = 'http://localhost:8080{}'.format (path) - return SiteLoader (browser, path, logger) - async with Process () as browser: - yield f - -async def itemsLoaded (l, items): - items = dict ([(i.parsedUrl.path, i) for i in items]) - async for item in l: - assert item.chromeResponse is not None - golden = items.pop (item.parsedUrl.path) - if not golden: - assert False, 'url {} not supposed to be fetched'.format (item.url) - assert item.failed == golden.failed - if item.failed: - # response will be invalid if request failed - if not items: - break - else: - continue - assert item.isRedirect == golden.isRedirect - if golden.isRedirect: - assert item.body is None - else: - assert item.body[0] == golden.body[0] - assert item.requestBody[0] == golden.requestBody[0] - assert item.response['status'] == golden.response['status'] - assert item.statusText == BaseHTTPRequestHandler.responses.get (item.response['status'])[0] - for k, v in golden.responseHeaders: - actual = list (map (itemgetter (1), filter (lambda x: x[0] == k, item.responseHeaders))) - assert v in actual - - # we’re done when everything has been loaded - if not items: - break - -async def literalItem (lf, item, deps=[]): - async with lf (item.parsedUrl.path) as l: - await l.start () - await asyncio.wait_for (itemsLoaded (l, [item] + deps), timeout=30) +async def loader (logger): + async with Process () as browser, SiteLoader (browser, logger) as l: + yield l @pytest.mark.asyncio -async def test_empty (loader): - await literalItem (loader, testItemMap['/empty']) +async def test_crash (loader): + with pytest.raises (Crashed): + await loader.tab.Page.crash () @pytest.mark.asyncio -async def test_headers_duplicate (loader): - """ - Some headers, like Set-Cookie can be present multiple times. Chrome - separates these with a newline. - """ - async with loader ('/headers/duplicate') as l: - await l.start () - async for it in l: - if it.parsedUrl.path == '/headers/duplicate': - assert not it.failed - dup = list (filter (lambda x: x[0] == 'Duplicate', it.responseHeaders)) - assert len(dup) == 2 - assert list(sorted(map(itemgetter(1), dup))) == ['1', '2'] - break +async def test_invalidurl (loader): + host = 'nonexistent.example' -@pytest.mark.asyncio -async def test_headers_req (loader): - """ - Custom request headers. JavaScript’s Headers() does not support duplicate - headers, so we can’t generate those. - """ - async with loader ('/headers/fetch/html') as l: - await l.start () - async for it in l: - if it.parsedUrl.path == '/headers/fetch/req': - assert not it.failed - dup = list (filter (lambda x: x[0] == 'custom', it.requestHeaders)) - assert len(dup) == 1 - assert list(sorted(map(itemgetter(1), dup))) == ['1'] - break + # make sure the url does *not* resolve (some DNS intercepting ISP’s mess + # with this) + loop = asyncio.get_event_loop () + try: + resolved = await loop.getaddrinfo (host, None) + except socket.gaierror: + url = URL.build (scheme='http', host=host) + with pytest.raises (NavigateError): + await loader.navigate (url) + else: + pytest.skip (f'host {host} resolved to {resolved}') -@pytest.mark.asyncio -async def test_redirect (loader): - await literalItem (loader, testItemMap['/redirect/301/empty'], [testItemMap['/empty']]) - # chained redirects - await literalItem (loader, testItemMap['/redirect/301/redirect/301/empty'], [testItemMap['/redirect/301/empty'], testItemMap['/empty']]) +timestamp = st.one_of ( + st.integers(min_value=0, max_value=2**32-1), + st.floats (min_value=0, max_value=2**32-1), + ) -@pytest.mark.asyncio -async def test_encoding (loader): - """ Text responses are transformed to UTF-8. Make sure this works - correctly. """ - for item in {testItemMap['/encoding/utf8'], testItemMap['/encoding/latin1'], testItemMap['/encoding/iso88591']}: - await literalItem (loader, item) +@given(timestamp, timestamp, timestamp) +def test_referencetimestamp (relativeA, absoluteA, relativeB): + ts = ReferenceTimestamp (relativeA, absoluteA) + absoluteA = datetime.utcfromtimestamp (absoluteA) + absoluteB = ts (relativeB) + assert (absoluteA < absoluteB and relativeA < relativeB) or \ + (absoluteA >= absoluteB and relativeA >= relativeB) + assert abs ((absoluteB - absoluteA).total_seconds () - (relativeB - relativeA)) < 10e-6 -@pytest.mark.asyncio -async def test_binary (loader): - """ Browser should ignore content it cannot display (i.e. octet-stream) """ - await literalItem (loader, testItemMap['/binary']) +def urls (): + """ Build http/https URL """ + scheme = st.sampled_from (['http', 'https']) + # Path must start with a slash + pathSt = st.builds (lambda x: '/' + x, st.text ()) + args = st.fixed_dictionaries ({ + 'scheme': scheme, + 'host': domains (), + 'port': st.one_of (st.none (), st.integers (min_value=1, max_value=2**16-1)), + 'path': pathSt, + 'query_string': st.text (), + 'fragment': st.text (), + }) + return st.builds (lambda x: URL.build (**x), args) -@pytest.mark.asyncio -async def test_image (loader): - """ Images should be displayed inline """ - await literalItem (loader, testItemMap['/image']) +def urlsStr (): + return st.builds (lambda x: str (x), urls ()) -@pytest.mark.asyncio -async def test_attachment (loader): - """ And downloads won’t work in headless mode, even if it’s just a text file """ - await literalItem (loader, testItemMap['/attachment']) +asciiText = st.text (st.characters (min_codepoint=32, max_codepoint=126)) -@pytest.mark.asyncio -async def test_html (loader): - await literalItem (loader, testItemMap['/html'], [testItemMap['/image'], testItemMap['/nonexistent']]) - # make sure alerts are dismissed correctly (image won’t load otherwise) - await literalItem (loader, testItemMap['/html/alert'], [testItemMap['/image']]) +def chromeHeaders (): + # token as defined by https://tools.ietf.org/html/rfc7230#section-3.2.6 + token = st.sampled_from('abcdefghijklmnopqrstuvwxyz0123456789!#$%&\'*+-.^_`|~') + # XXX: the value should be asciiText without leading/trailing spaces + return st.dictionaries (token, token) -@pytest.mark.asyncio -async def test_post (loader): - """ XHR POST request with binary data""" - await literalItem (loader, testItemMap['/html/fetchPost'], - [testItemMap['/html/fetchPost/binary'], - testItemMap['/html/fetchPost/binary/large'], - testItemMap['/html/fetchPost/form'], - testItemMap['/html/fetchPost/form/large']]) +def fixedDicts (fixed, dynamic): + return st.builds (lambda x, y: x.update (y), st.fixed_dictionaries (fixed), st.lists (dynamic)) -@pytest.mark.asyncio -async def test_crash (loader): - async with loader ('/html') as l: - await l.start () - with pytest.raises (Crashed): - await l.tab.Page.crash () +def chromeRequestWillBeSent (reqid, url): + methodSt = st.sampled_from (['GET', 'POST', 'PUT', 'DELETE']) + return st.fixed_dictionaries ({ + 'requestId': reqid, + 'initiator': st.just ('Test'), + 'wallTime': timestamp, + 'timestamp': timestamp, + 'request': st.fixed_dictionaries ({ + 'url': url, + 'method': methodSt, + 'headers': chromeHeaders (), + # XXX: postData, hasPostData + }) + }) -@pytest.mark.asyncio -async def test_invalidurl (loader): - url = 'http://nonexistent.example/' - async with loader (url) as l: - await l.start () - async for it in l: - assert it.failed - break +def chromeResponseReceived (reqid, url): + mimeTypeSt = st.one_of (st.none (), st.just ('text/html')) + remoteIpAddressSt = st.one_of (st.none (), st.just ('127.0.0.1')) + protocolSt = st.one_of (st.none (), st.just ('h2')) + statusCodeSt = st.integers (min_value=100, max_value=999) + typeSt = st.sampled_from (['Document', 'Stylesheet', 'Image', 'Media', + 'Font', 'Script', 'TextTrack', 'XHR', 'Fetch', 'EventSource', + 'WebSocket', 'Manifest', 'SignedExchange', 'Ping', + 'CSPViolationReport', 'Other']) + return st.fixed_dictionaries ({ + 'requestId': reqid, + 'timestamp': timestamp, + 'type': typeSt, + 'response': st.fixed_dictionaries ({ + 'url': url, + 'requestHeaders': chromeHeaders (), # XXX: make this optional + 'headers': chromeHeaders (), + 'status': statusCodeSt, + 'statusText': asciiText, + 'mimeType': mimeTypeSt, + 'remoteIPAddress': remoteIpAddressSt, + 'protocol': protocolSt, + }) + }) + +def chromeReqResp (): + # XXX: will this gnerated the same url for all testcases? + reqid = st.shared (st.text (), 'reqresp') + url = st.shared (urlsStr (), 'reqresp') + return st.tuples (chromeRequestWillBeSent (reqid, url), + chromeResponseReceived (reqid, url)) + +def requestResponsePair (): + def f (creq, cresp, hasPostData, reqBody, respBody): + i = RequestResponsePair () + i.fromRequestWillBeSent (creq) + i.request.hasPostData = hasPostData + if hasPostData: + i.request.body = reqBody + + if cresp is not None: + i.fromResponseReceived (cresp) + if respBody is not None: + i.response.body = respBody + return i + + bodySt = st.one_of ( + st.none (), + st.builds (UnicodeBody, st.text ()), + st.builds (Base64Body.fromBytes, st.binary ()) + ) + return st.builds (lambda reqresp, hasPostData, reqBody, respBody: + f (reqresp[0], reqresp[1], hasPostData, reqBody, respBody), + chromeReqResp (), st.booleans (), bodySt, bodySt) + +@given(chromeReqResp ()) +def test_requestResponsePair (creqresp): + creq, cresp = creqresp + + item = RequestResponsePair () + + assert item.id is None + assert item.url is None + assert item.request is None + assert item.response is None + + item.fromRequestWillBeSent (creq) + + assert item.id == creq['requestId'] + url = URL (creq['request']['url']) + assert item.url == url + assert item.request is not None + assert item.request.timestamp == datetime.utcfromtimestamp (creq['wallTime']) + assert set (item.request.headers.keys ()) == set (creq['request']['headers'].keys ()) + assert item.response is None + + item.fromResponseReceived (cresp) + + # url will not be overwritten + assert item.id == creq['requestId'] == cresp['requestId'] + assert item.url == url + assert item.request is not None + assert set (item.request.headers.keys ()) == set (cresp['response']['requestHeaders'].keys ()) + assert item.response is not None + assert set (item.response.headers.keys ()) == set (cresp['response']['headers'].keys ()) + assert (item.response.timestamp - item.request.timestamp).total_seconds () - \ + (cresp['timestamp'] - creq['timestamp']) < 10e-6 + +@given(chromeReqResp ()) +def test_requestResponsePair_eq (creqresp): + creq, cresp = creqresp + + item = RequestResponsePair () + item2 = RequestResponsePair () + assert item == item + assert item == item2 + + item.fromRequestWillBeSent (creq) + assert item != item2 + item2.fromRequestWillBeSent (creq) + assert item == item + assert item == item2 + + item.fromResponseReceived (cresp) + assert item != item2 + item2.fromResponseReceived (cresp) + assert item == item + assert item == item2 + + # XXX: test for inequality with different parameters + +### Google Chrome integration tests ### + +serverUrl = URL.build (scheme='http', host='localhost', port=8080) +items = [ + RequestResponsePair ( + url=serverUrl.with_path ('/encoding/utf-8'), + request=Request (method='GET'), + response=Response (status=200, headers=CIMultiDict ([('Content-Type', 'text/html; charset=utf-8')]), + body=UnicodeBody ('äöü'), mimeType='text/html') + ), + RequestResponsePair ( + url=serverUrl.with_path ('/encoding/latin1'), + request=Request (method='GET'), + response=Response (status=200, headers=CIMultiDict ([('Content-Type', 'text/html; charset=latin1')]), + body=UnicodeBody ('äöü'), mimeType='text/html') + ), + RequestResponsePair ( + url=serverUrl.with_path ('/encoding/utf-16'), + request=Request (method='GET'), + response=Response (status=200, headers=CIMultiDict ([('Content-Type', 'text/html; charset=utf-16')]), + body=UnicodeBody ('äöü'), mimeType='text/html') + ), + RequestResponsePair ( + url=serverUrl.with_path ('/encoding/ISO-8859-1'), + request=Request (method='GET'), + response=Response (status=200, headers=CIMultiDict ([('Content-Type', 'text/html; charset=ISO-8859-1')]), + body=UnicodeBody ('äöü'), mimeType='text/html') + ), + RequestResponsePair ( + url=serverUrl.with_path ('/status/200'), + request=Request (method='GET'), + response=Response (status=200, headers=CIMultiDict ([('Content-Type', 'text/plain')]), + body=b'', + mimeType='text/plain'), + ), + # redirects never have a response body + RequestResponsePair ( + url=serverUrl.with_path ('/status/301'), + request=Request (method='GET'), + response=Response (status=301, + headers=CIMultiDict ([('Content-Type', 'text/plain'), + ('Location', str (serverUrl.with_path ('/status/301/redirected')))]), + body=None, + mimeType='text/plain'), + ), + RequestResponsePair ( + url=serverUrl.with_path ('/image/png'), + request=Request (method='GET'), + response=Response (status=200, headers=CIMultiDict ([('Content-Type', 'image/png')]), + body=Base64Body.fromBytes (b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x00\x01\x00\x00\x00\x01\x08\x00\x00\x00\x00:~\x9bU\x00\x00\x00\nIDAT\x08\x1dc\xf8\x0f\x00\x01\x01\x01\x006_g\x80\x00\x00\x00\x00IEND\xaeB`\x82'), + mimeType='image/png'), + ), + RequestResponsePair ( + url=serverUrl.with_path ('/script/alert'), + request=Request (method='GET'), + response=Response (status=200, headers=CIMultiDict ([('Content-Type', 'text/html; charset=utf-8')]), + body=UnicodeBody ('''<html><body><script> +window.addEventListener("beforeunload", function (e) { + e.returnValue = "bye?"; + return e.returnValue; +}); +alert("stopping here"); +if (confirm("are you sure?") || prompt ("42?")) { + window.location = "/nonexistent"; +} +</script></body></html>'''), mimeType='text/html') + ), + ] @pytest.mark.asyncio -async def test_varchangeevent (): - e = VarChangeEvent (True) - assert e.get () == True - - # no change at all - w = asyncio.ensure_future (e.wait ()) - finished, pending = await asyncio.wait ([w], timeout=0.1) - assert not finished and pending - - # no change - e.set (True) - finished, pending = await asyncio.wait ([w], timeout=0.1) - assert not finished and pending - - # changed - e.set (False) - await asyncio.sleep (0.1) # XXX: is there a yield() ? - assert w.done () - ret = w.result () - assert ret == False - assert e.get () == ret +# would be nice if we could use hypothesis here somehow +@pytest.mark.parametrize("golden", items) +async def test_integration_item (loader, golden): + async def f (req): + body = golden.response.body + contentType = golden.response.headers.get ('content-type', '') if golden.response.headers is not None else '' + charsetOff = contentType.find ('charset=') + if isinstance (body, UnicodeBody) and charsetOff != -1: + encoding = contentType[charsetOff+len ('charset='):] + body = golden.response.body.decode ('utf-8').encode (encoding) + return web.Response (body=body, status=golden.response.status, + headers=golden.response.headers) + + app = web.Application () + app.router.add_route (golden.request.method, golden.url.path, f) + runner = web.AppRunner(app) + await runner.setup() + site = web.TCPSite(runner, serverUrl.host, serverUrl.port) + try: + await site.start() + except Exception as e: + pytest.skip (e) + + haveReqResp = False + haveNavigated = False + try: + await loader.navigate (golden.url) + + it = loader.__aiter__ () + while True: + try: + item = await asyncio.wait_for (it.__anext__ (), timeout=1) + except asyncio.TimeoutError: + break + # XXX: can only check the first req/resp right now (due to redirect) + if isinstance (item, RequestResponsePair) and not haveReqResp: + # we do not know this in advance + item.request.initiator = None + item.request.headers = None + item.remoteIpAddress = None + item.protocol = None + item.resourceType = None + + if item.response: + assert item.response.statusText is not None + item.response.statusText = None + + del item.response.headers['server'] + del item.response.headers['content-length'] + del item.response.headers['date'] + assert item == golden + haveReqResp = True + elif isinstance (item, FrameNavigated): + # XXX: can’t check this, because of the redirect + #assert item.url == golden.url + haveNavigated = True + finally: + assert haveReqResp + assert haveNavigated + await runner.cleanup () + +def test_page_idle (): + for v in (True, False): + idle = PageIdle (v) + assert bool (idle) == v + diff --git a/crocoite/test_controller.py b/crocoite/test_controller.py new file mode 100644 index 0000000..7216a42 --- /dev/null +++ b/crocoite/test_controller.py @@ -0,0 +1,203 @@ +# Copyright (c) 2017–2018 crocoite contributors +# +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE. + +import asyncio + +from yarl import URL +from aiohttp import web + +import pytest + +from .logger import Logger +from .controller import ControllerSettings, SinglePageController, SetEntry, \ + IdleStateTracker +from .browser import PageIdle +from .devtools import Process +from .test_browser import loader + +@pytest.mark.asyncio +async def test_controller_timeout (): + """ Make sure the controller terminates, even if the site keeps reloading/fetching stuff """ + + async def f (req): + return web.Response (body="""<html> +<body> +<p>hello</p> +<script> +window.setTimeout (function () { window.location = '/' }, 250); +window.setInterval (function () { fetch('/').then (function (e) { console.log (e) }) }, 150); +</script> +</body> +</html>""", status=200, content_type='text/html', charset='utf-8') + + url = URL.build (scheme='http', host='localhost', port=8080) + app = web.Application () + app.router.add_route ('GET', '/', f) + runner = web.AppRunner(app) + await runner.setup() + site = web.TCPSite(runner, url.host, url.port) + await site.start() + + loop = asyncio.get_event_loop () + try: + logger = Logger () + settings = ControllerSettings (idleTimeout=1, timeout=5) + controller = SinglePageController (url=url, logger=logger, + service=Process (), behavior=[], settings=settings) + # give the controller a little more time to finish, since there are + # hard-coded asyncio.sleep calls in there right now. + # XXX fix this + before = loop.time () + await asyncio.wait_for (controller.run (), timeout=settings.timeout*2) + after = loop.time () + assert after-before >= settings.timeout, (settings.timeout*2, after-before) + finally: + # give the browser some time to close before interrupting the + # connection by destroying the HTTP server + await asyncio.sleep (1) + await runner.cleanup () + +@pytest.mark.asyncio +async def test_controller_idle_timeout (): + """ Make sure the controller terminates, even if the site keeps reloading/fetching stuff """ + + async def f (req): + return web.Response (body="""<html> +<body> +<p>hello</p> +<script> +window.setInterval (function () { fetch('/').then (function (e) { console.log (e) }) }, 2000); +</script> +</body> +</html>""", status=200, content_type='text/html', charset='utf-8') + + url = URL.build (scheme='http', host='localhost', port=8080) + app = web.Application () + app.router.add_route ('GET', '/', f) + runner = web.AppRunner(app) + await runner.setup() + site = web.TCPSite(runner, url.host, url.port) + await site.start() + + loop = asyncio.get_event_loop () + try: + logger = Logger () + settings = ControllerSettings (idleTimeout=1, timeout=60) + controller = SinglePageController (url=url, logger=logger, + service=Process (), behavior=[], settings=settings) + before = loop.time () + await asyncio.wait_for (controller.run (), settings.timeout*2) + after = loop.time () + assert settings.idleTimeout <= after-before <= settings.idleTimeout*2+3 + finally: + await runner.cleanup () + +def test_set_entry (): + a = SetEntry (1, a=2, b=3) + assert a == a + assert hash (a) == hash (a) + + b = SetEntry (1, a=2, b=4) + assert a == b + assert hash (a) == hash (b) + + c = SetEntry (2, a=2, b=3) + assert a != c + assert hash (a) != hash (c) + +@pytest.mark.asyncio +async def test_idle_state_tracker (): + # default is idle + loop = asyncio.get_event_loop () + idle = IdleStateTracker (loop) + assert idle._idle + + # idle change + await idle.push (PageIdle (False)) + assert not idle._idle + + # nothing happens for other objects + await idle.push ({}) + assert not idle._idle + + # no state change -> wait does not return + with pytest.raises (asyncio.TimeoutError): + await asyncio.wait_for (idle.wait (0.1), timeout=1) + + # wait at least timeout + delta = 0.2 + timeout = 1 + await idle.push (PageIdle (True)) + assert idle._idle + start = loop.time () + await idle.wait (timeout) + end = loop.time () + assert (timeout-delta) < (end-start) < (timeout+delta) + +@pytest.fixture +async def recordingServer (): + """ Simple HTTP server that records raw requests """ + url = URL ('http://localhost:8080') + reqs = [] + async def record (request): + reqs.append (request) + return web.Response(text='ok', content_type='text/plain') + app = web.Application() + app.add_routes([web.get(url.path, record)]) + runner = web.AppRunner(app) + await runner.setup() + site = web.TCPSite (runner, url.host, url.port) + await site.start() + yield url, reqs + await runner.cleanup () + +from .test_devtools import tab, browser +from http.cookies import Morsel, SimpleCookie + +@pytest.mark.asyncio +async def test_set_cookies (tab, recordingServer): + """ Make sure cookies are set properly and only affect the domain they were + set for """ + + logger = Logger () + + url, reqs = recordingServer + + cookies = [] + c = Morsel () + c.set ('foo', 'bar', '') + c['domain'] = 'localhost' + cookies.append (c) + c = Morsel () + c.set ('buz', 'beef', '') + c['domain'] = 'nonexistent.example' + + settings = ControllerSettings (idleTimeout=1, timeout=60, cookies=cookies) + controller = SinglePageController (url=url, logger=logger, + service=Process (), behavior=[], settings=settings) + await asyncio.wait_for (controller.run (), settings.timeout*2) + + assert len (reqs) == 1 + req = reqs[0] + reqCookies = SimpleCookie (req.headers['cookie']) + assert len (reqCookies) == 1 + c = next (iter (reqCookies.values ())) + assert c.key == cookies[0].key + assert c.value == cookies[0].value diff --git a/crocoite/test_devtools.py b/crocoite/test_devtools.py index 74d223f..bd1a828 100644 --- a/crocoite/test_devtools.py +++ b/crocoite/test_devtools.py @@ -24,7 +24,8 @@ import pytest from aiohttp import web import websockets -from .devtools import Browser, Tab, MethodNotFound, Crashed, InvalidParameter, Process, Passthrough +from .devtools import Browser, Tab, MethodNotFound, Crashed, \ + InvalidParameter, Process, Passthrough @pytest.fixture async def browser (): @@ -38,8 +39,9 @@ async def tab (browser): # make sure there are no transactions left over (i.e. no unawaited requests) assert not tab.transactions +docBody = "<html><body><p>Hello, world</p></body></html>" async def hello(request): - return web.Response(text="Hello, world") + return web.Response(text=docBody, content_type='text/html') @pytest.fixture async def server (): @@ -73,8 +75,10 @@ async def test_tab_close (browser): @pytest.mark.asyncio async def test_tab_notify_enable_disable (tab): - """ Make sure enabling/disabling notifications works for all known namespaces """ - for name in ('Debugger', 'DOM', 'Log', 'Network', 'Page', 'Performance', 'Profiler', 'Runtime', 'Security'): + """ Make sure enabling/disabling notifications works for all known + namespaces """ + for name in ('Debugger', 'DOM', 'Log', 'Network', 'Page', 'Performance', + 'Profiler', 'Runtime', 'Security'): f = getattr (tab, name) await f.enable () await f.disable () @@ -109,14 +113,45 @@ async def test_tab_crash (tab): async def test_load (tab, server): await tab.Network.enable () await tab.Page.navigate (url='http://localhost:8080') - method, req = await tab.get () - assert method == tab.Network.requestWillBeSent - method, resp = await tab.get () - assert method == tab.Network.responseReceived - assert tab.pending == 0 - body = await tab.Network.getResponseBody (requestId=req['requestId']) - assert body['body'] == "Hello, world" + + haveRequest = False + haveResponse = False + haveData = False + haveFinished = False + haveBody = False + req = None + resp = None + while not haveBody: + method, data = await tab.get () + + # it can be either of those two in no specified order + if method in (tab.Network.requestWillBeSent, tab.Network.requestWillBeSentExtraInfo) and not haveResponse: + if req is None: + req = data + assert data['requestId'] == req['requestId'] + haveRequest = True + elif method in (tab.Network.responseReceived, tab.Network.responseReceivedExtraInfo) and haveRequest: + if resp is None: + resp = data + assert data['requestId'] == resp['requestId'] + haveResponse = True + elif haveRequest and haveResponse and method == tab.Network.dataReceived: + assert data['dataLength'] == len (docBody) + assert data['requestId'] == req['requestId'] + haveData = True + elif haveData: + assert method == tab.Network.loadingFinished + assert data['requestId'] == req['requestId'] + haveBody = True + elif haveFinished: + body = await tab.Network.getResponseBody (requestId=req['requestId']) + assert body['body'] == docBody + haveBody = True + else: + assert False, (method, req) + await tab.Network.disable () + assert tab.pending == 0 @pytest.mark.asyncio async def test_recv_failure(browser): @@ -149,7 +184,8 @@ async def test_tab_function (tab): @pytest.mark.asyncio async def test_tab_function_hash (tab): - d = {tab.Network.enable: 1, tab.Network.disable: 2, tab.Page: 3, tab.Page.enable: 4} + d = {tab.Network.enable: 1, tab.Network.disable: 2, tab.Page: 3, + tab.Page.enable: 4} assert len (d) == 4 @pytest.mark.asyncio @@ -168,5 +204,5 @@ async def test_passthrough (): url = 'http://localhost:12345' async with Passthrough (url) as u: - assert u == url + assert str (u) == url diff --git a/crocoite/test_html.py b/crocoite/test_html.py index c71697a..c17903b 100644 --- a/crocoite/test_html.py +++ b/crocoite/test_html.py @@ -18,9 +18,11 @@ # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN # THE SOFTWARE. +import asyncio import pytest, html5lib from html5lib.serializer import HTMLSerializer from html5lib.treewalkers import getTreeWalker +from aiohttp import web from .html import StripTagFilter, StripAttributeFilter, ChromeTreeWalker from .test_devtools import tab, browser @@ -58,3 +60,37 @@ async def test_treewalker (tab): elif i == 1: assert result == framehtml +cdataDoc = '<test><![CDATA[Hello world]]></test>' +xmlHeader = '<?xml version="1.0" encoding="UTF-8"?>' +async def hello(request): + return web.Response(text=xmlHeader + cdataDoc, content_type='text/xml') + +@pytest.fixture +async def server (): + """ Simple HTTP server for testing notifications """ + app = web.Application() + app.add_routes([web.get('/test.xml', hello)]) + runner = web.AppRunner(app) + await runner.setup() + site = web.TCPSite(runner, 'localhost', 8080) + await site.start() + yield app + await runner.cleanup () + +@pytest.mark.asyncio +async def test_treewalker_cdata (tab, server): + ret = await tab.Page.navigate (url='http://localhost:8080/test.xml') + # wait until loaded XXX: replace with idle check + await asyncio.sleep (0.5) + dom = await tab.DOM.getDocument (depth=-1, pierce=True) + docs = list (ChromeTreeWalker (dom['root']).split ()) + assert len(docs) == 1 + for i, doc in enumerate (docs): + walker = ChromeTreeWalker (doc) + serializer = HTMLSerializer () + result = serializer.render (iter(walker)) + # chrome will display a pretty-printed viewer *plus* the original + # source (stripped of its xml header) + assert cdataDoc in result + + diff --git a/crocoite/test_irc.py b/crocoite/test_irc.py index 4d80a6d..9344de4 100644 --- a/crocoite/test_irc.py +++ b/crocoite/test_irc.py @@ -19,7 +19,7 @@ # THE SOFTWARE. import pytest -from .irc import ArgparseBot, RefCountEvent +from .irc import ArgparseBot, RefCountEvent, User, NickMode def test_mode_parse (): assert ArgparseBot.parseMode ('+a') == [('+', 'a')] @@ -51,3 +51,20 @@ def test_refcountevent_arm_with (event): event.arm () assert not event.event.is_set () assert event.event.is_set () + +def test_nick_mode (): + a = User.fromName ('a') + a2 = User.fromName ('a') + a3 = User.fromName ('+a') + b = User.fromName ('+b') + c = User.fromName ('@c') + + # equality is based on name only, not mode + assert a == a2 + assert a == a3 + assert a != b + + assert a.hasPriv (None) and not a.hasPriv (NickMode.voice) and not a.hasPriv (NickMode.operator) + assert b.hasPriv (None) and b.hasPriv (NickMode.voice) and not b.hasPriv (NickMode.operator) + assert c.hasPriv (None) and c.hasPriv (NickMode.voice) and c.hasPriv (NickMode.operator) + diff --git a/crocoite/test_logger.py b/crocoite/test_logger.py index 3af1321..26e420a 100644 --- a/crocoite/test_logger.py +++ b/crocoite/test_logger.py @@ -80,3 +80,12 @@ def test_datetime (logger): ret = logger.debug() assert 'date' in ret +def test_independence (): + """ Make sure two instances are completely independent """ + l1 = Logger () + c = QueueConsumer () + l1.connect (c) + l2 = Logger () + l2.info (nothing='nothing') + assert not c.data + diff --git a/crocoite/test_tools.py b/crocoite/test_tools.py index 947d020..416b954 100644 --- a/crocoite/test_tools.py +++ b/crocoite/test_tools.py @@ -25,9 +25,9 @@ import pytest from warcio.archiveiterator import ArchiveIterator from warcio.warcwriter import WARCWriter from warcio.statusandheaders import StatusAndHeaders +from pkg_resources import parse_version -from .tools import mergeWarc -from .util import packageUrl +from .tools import mergeWarc, Errata, FixableErrata @pytest.fixture def writer(): @@ -48,9 +48,11 @@ def recordsEqual(golden, underTest): def makeGolden(writer, records): # additional warcinfo is written. Content does not matter. - record = writer.create_warc_record (packageUrl ('warcinfo'), 'warcinfo', + record = writer.create_warc_record ( + '', + 'warcinfo', payload=b'', - warc_headers_dict={'Content-Type': 'text/plain; encoding=utf-8'}) + warc_headers_dict={'Content-Type': 'application/json; charset=utf-8'}) records.insert (0, record) return records @@ -96,7 +98,7 @@ def test_different_payload(writer): httpHeaders = StatusAndHeaders('200 OK', {}, protocol='HTTP/1.1') record = writer.create_warc_record ('http://example.com/', 'response', - payload=BytesIO('data{}'.format(i).encode ('utf8')), + payload=BytesIO(f'data{i}'.encode ('utf8')), warc_headers_dict=warcHeaders, http_headers=httpHeaders) records.append (record) @@ -195,3 +197,28 @@ def test_resp_revisit_other_url(writer): output.seek(0) recordsEqual (makeGolden (writer, records), ArchiveIterator (output)) +def test_errata_contains(): + """ Test version matching """ + e = Errata('some-uuid', 'description', ['a<1.0']) + assert {'a': parse_version('0.1')} in e + assert {'a': parse_version('1.0')} not in e + assert {'b': parse_version('1.0')} not in e + + e = Errata('some-uuid', 'description', ['a<1.0,>0.1']) + assert {'a': parse_version('0.1')} not in e + assert {'a': parse_version('0.2')} in e + assert {'a': parse_version('1.0')} not in e + + # a AND b + e = Errata('some-uuid', 'description', ['a<1.0', 'b>1.0']) + assert {'a': parse_version('0.1')} not in e + assert {'b': parse_version('1.1')} not in e + assert {'a': parse_version('0.1'), 'b': parse_version('1.1')} in e + +def test_errata_fixable (): + e = Errata('some-uuid', 'description', ['a<1.0', 'b>1.0']) + assert not e.fixable + + e = FixableErrata('some-uuid', 'description', ['a<1.0', 'b>1.0']) + assert e.fixable + diff --git a/crocoite/test_warc.py b/crocoite/test_warc.py new file mode 100644 index 0000000..3ec310c --- /dev/null +++ b/crocoite/test_warc.py @@ -0,0 +1,225 @@ +# Copyright (c) 2018 crocoite contributors +# +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE. + +from tempfile import NamedTemporaryFile +import json, urllib +from operator import itemgetter + +from warcio.archiveiterator import ArchiveIterator +from yarl import URL +from multidict import CIMultiDict +from hypothesis import given, reproduce_failure +import hypothesis.strategies as st +import pytest + +from .warc import WarcHandler +from .logger import Logger, WarcHandlerConsumer +from .controller import ControllerStart +from .behavior import Script, ScreenshotEvent, DomSnapshotEvent +from .browser import RequestResponsePair, Base64Body, UnicodeBody +from .test_browser import requestResponsePair, urls + +def test_log (): + logger = Logger () + + with NamedTemporaryFile() as fd: + with WarcHandler (fd, logger) as handler: + warclogger = WarcHandlerConsumer (handler) + logger.connect (warclogger) + golden = [] + + assert handler.log.tell () == 0 + golden.append (logger.info (foo=1, bar='baz', encoding='äöü⇔ΓΨ')) + assert handler.log.tell () != 0 + + handler.maxLogSize = 0 + golden.append (logger.info (bar=1, baz='baz')) + # should flush the log + assert handler.log.tell () == 0 + + fd.seek (0) + for it in ArchiveIterator (fd): + headers = it.rec_headers + assert headers['warc-type'] == 'metadata' + assert 'warc-target-uri' not in headers + assert headers['x-crocoite-type'] == 'log' + assert headers['content-type'] == f'application/json; charset={handler.logEncoding}' + + while True: + l = it.raw_stream.readline () + if not l: + break + data = json.loads (l.strip ()) + assert data == golden.pop (0) + +def jsonObject (): + """ JSON-encodable objects """ + return st.dictionaries (st.text (), st.one_of (st.integers (), st.text ())) + +def viewport (): + return st.builds (lambda x, y: f'{x}x{y}', st.integers (), st.integers ()) + +def event (): + return st.one_of ( + st.builds (ControllerStart, jsonObject ()), + st.builds (Script.fromStr, st.text (), st.one_of(st.none (), st.text ())), + st.builds (ScreenshotEvent, urls (), st.integers (), st.binary ()), + st.builds (DomSnapshotEvent, urls (), st.builds (lambda x: x.encode ('utf-8'), st.text ()), viewport()), + requestResponsePair (), + ) + +@pytest.mark.asyncio +@given (st.lists (event ())) +async def test_push (golden): + def checkWarcinfoId (headers): + if lastWarcinfoRecordid is not None: + assert headers['WARC-Warcinfo-ID'] == lastWarcinfoRecordid + + lastWarcinfoRecordid = None + + # null logger + logger = Logger () + with open('/tmp/test.warc.gz', 'w+b') as fd: + with WarcHandler (fd, logger) as handler: + for g in golden: + await handler.push (g) + + fd.seek (0) + it = iter (ArchiveIterator (fd)) + for g in golden: + if isinstance (g, ControllerStart): + rec = next (it) + + headers = rec.rec_headers + assert headers['warc-type'] == 'warcinfo' + assert 'warc-target-uri' not in headers + assert 'x-crocoite-type' not in headers + + data = json.load (rec.raw_stream) + assert data == g.payload + + lastWarcinfoRecordid = headers['warc-record-id'] + assert lastWarcinfoRecordid + elif isinstance (g, Script): + rec = next (it) + + headers = rec.rec_headers + assert headers['warc-type'] == 'resource' + assert headers['content-type'] == 'application/javascript; charset=utf-8' + assert headers['x-crocoite-type'] == 'script' + checkWarcinfoId (headers) + if g.path: + assert URL (headers['warc-target-uri']) == URL ('file://' + g.abspath) + else: + assert 'warc-target-uri' not in headers + + data = rec.raw_stream.read ().decode ('utf-8') + assert data == g.data + elif isinstance (g, ScreenshotEvent): + # XXX: check refers-to header + rec = next (it) + + headers = rec.rec_headers + assert headers['warc-type'] == 'conversion' + assert headers['x-crocoite-type'] == 'screenshot' + checkWarcinfoId (headers) + assert URL (headers['warc-target-uri']) == g.url, (headers['warc-target-uri'], g.url) + assert headers['warc-refers-to'] is None + assert int (headers['X-Crocoite-Screenshot-Y-Offset']) == g.yoff + + assert rec.raw_stream.read () == g.data + elif isinstance (g, DomSnapshotEvent): + rec = next (it) + + headers = rec.rec_headers + assert headers['warc-type'] == 'conversion' + assert headers['x-crocoite-type'] == 'dom-snapshot' + checkWarcinfoId (headers) + assert URL (headers['warc-target-uri']) == g.url + assert headers['warc-refers-to'] is None + + assert rec.raw_stream.read () == g.document + elif isinstance (g, RequestResponsePair): + rec = next (it) + + # request + headers = rec.rec_headers + assert headers['warc-type'] == 'request' + assert 'x-crocoite-type' not in headers + checkWarcinfoId (headers) + assert URL (headers['warc-target-uri']) == g.url + assert headers['x-chrome-request-id'] == g.id + + assert CIMultiDict (rec.http_headers.headers) == g.request.headers + if g.request.hasPostData: + if g.request.body is not None: + assert rec.raw_stream.read () == g.request.body + else: + # body fetch failed + assert headers['warc-truncated'] == 'unspecified' + assert not rec.raw_stream.read () + else: + assert not rec.raw_stream.read () + + # response + if g.response: + rec = next (it) + headers = rec.rec_headers + httpheaders = rec.http_headers + assert headers['warc-type'] == 'response' + checkWarcinfoId (headers) + assert URL (headers['warc-target-uri']) == g.url + assert headers['x-chrome-request-id'] == g.id + assert 'x-crocoite-type' not in headers + + # these are checked separately + filteredHeaders = CIMultiDict (httpheaders.headers) + for b in {'content-type', 'content-length'}: + if b in g.response.headers: + g.response.headers.popall (b) + if b in filteredHeaders: + filteredHeaders.popall (b) + assert filteredHeaders == g.response.headers + + expectedContentType = g.response.mimeType + if expectedContentType is not None: + assert httpheaders['content-type'].startswith (expectedContentType) + + if g.response.body is not None: + assert rec.raw_stream.read () == g.response.body + assert httpheaders['content-length'] == str (len (g.response.body)) + # body is never truncated if it exists + assert headers['warc-truncated'] is None + + # unencoded strings are converted to utf8 + if isinstance (g.response.body, UnicodeBody) and httpheaders['content-type'] is not None: + assert httpheaders['content-type'].endswith ('; charset=utf-8') + else: + # body fetch failed + assert headers['warc-truncated'] == 'unspecified' + assert not rec.raw_stream.read () + # content-length header should be kept intact + else: + assert False, f"invalid golden type {type(g)}" # pragma: no cover + + # no further records + with pytest.raises (StopIteration): + next (it) + diff --git a/crocoite/tools.py b/crocoite/tools.py index e2dc6a7..a2ddaa3 100644 --- a/crocoite/tools.py +++ b/crocoite/tools.py @@ -24,13 +24,23 @@ Misc tools import shutil, sys, os, logging, argparse, json from io import BytesIO + from warcio.archiveiterator import ArchiveIterator from warcio.warcwriter import WARCWriter -from .util import packageUrl, getSoftwareInfo +from yarl import URL + +from pkg_resources import parse_version, parse_requirements + +from .util import getSoftwareInfo, StrJsonEncoder +from .warc import jsonMime, makeContentType def mergeWarc (files, output): + # stats unique = 0 revisit = 0 + uniqueLength = 0 + revisitLength = 0 + payloadMap = {} writer = WARCWriter (output, gzip=True) @@ -48,9 +58,9 @@ def mergeWarc (files, output): 'parameters': {'inputs': files}, } payload = BytesIO (json.dumps (warcinfo, indent=2).encode ('utf-8')) - record = writer.create_warc_record (packageUrl ('warcinfo'), 'warcinfo', + record = writer.create_warc_record ('', 'warcinfo', payload=payload, - warc_headers_dict={'Content-Type': 'text/plain; encoding=utf-8'}) + warc_headers_dict={'Content-Type': makeContentType (jsonMime, 'utf-8')}) writer.write_record (record) for l in files: @@ -60,13 +70,15 @@ def mergeWarc (files, output): headers = record.rec_headers rid = headers.get_header('WARC-Record-ID') csum = headers.get_header('WARC-Payload-Digest') + length = int (headers.get_header ('Content-Length')) dup = payloadMap.get (csum, None) if dup is None: payloadMap[csum] = {'uri': headers.get_header('WARC-Target-URI'), 'id': rid, 'date': headers.get_header('WARC-Date')} unique += 1 + uniqueLength += length else: - logging.debug ('Record {} is duplicate of {}'.format (rid, dup['id'])) + logging.debug (f'Record {rid} is duplicate of {dup["id"]}') # Payload may be identical, but HTTP headers are # (probably) not. Include them. record = writer.create_revisit_record ( @@ -76,10 +88,21 @@ def mergeWarc (files, output): record.rec_headers.add_header ('WARC-Truncated', 'length') record.rec_headers.add_header ('WARC-Refers-To', dup['id']) revisit += 1 + revisitLength += length else: unique += 1 writer.write_record (record) - logging.info ('Wrote {} unique records, {} revisits'.format (unique, revisit)) + json.dump (dict ( + unique=dict (records=unique, bytes=uniqueLength), + revisit=dict (records=revisit, bytes=revisitLength), + ratio=dict ( + records=unique/(unique+revisit), + bytes=uniqueLength/(uniqueLength+revisitLength) + ), + ), + sys.stdout, + cls=StrJsonEncoder) + sys.stdout.write ('\n') def mergeWarcCli(): parser = argparse.ArgumentParser(description='Merge WARCs, reads filenames from stdin.') @@ -97,13 +120,19 @@ def extractScreenshot (): Extract page screenshots from a WARC generated by crocoite into files """ - parser = argparse.ArgumentParser(description='Extract screenshots.') - parser.add_argument('-f', '--force', action='store_true', help='Overwrite existing files') - parser.add_argument('input', type=argparse.FileType ('rb'), help='Input WARC') + parser = argparse.ArgumentParser(description='Extract screenshots from ' + 'WARC, write JSON info to stdout.') + parser.add_argument('-f', '--force', action='store_true', + help='Overwrite existing files') + parser.add_argument('-1', '--one', action='store_true', + help='Only extract the first screenshot into a file named prefix') + parser.add_argument('input', type=argparse.FileType ('rb'), + help='Input WARC') parser.add_argument('prefix', help='Output file prefix') args = parser.parse_args() + i = 0 with args.input: for record in ArchiveIterator (args.input): headers = record.rec_headers @@ -112,13 +141,177 @@ def extractScreenshot (): 'X-Crocoite-Screenshot-Y-Offset' not in headers: continue - urlSanitized = headers.get_header('WARC-Target-URI').replace ('/', '_') - xoff = 0 + url = URL (headers.get_header ('WARC-Target-URI')) yoff = int (headers.get_header ('X-Crocoite-Screenshot-Y-Offset')) - outpath = '{}-{}-{}-{}.png'.format (args.prefix, urlSanitized, xoff, yoff) + outpath = f'{args.prefix}{i:05d}.png' if not args.one else args.prefix if args.force or not os.path.exists (outpath): + json.dump ({'file': outpath, 'url': url, 'yoff': yoff}, + sys.stdout, cls=StrJsonEncoder) + sys.stdout.write ('\n') with open (outpath, 'wb') as out: shutil.copyfileobj (record.raw_stream, out) + i += 1 else: - print ('not overwriting {}'.format (outpath)) + print (f'not overwriting {outpath}', file=sys.stderr) + + if args.one: + break + +class Errata: + __slots__ = ('uuid', 'description', 'url', 'affects') + + def __init__ (self, uuid, description, affects, url=None): + self.uuid = uuid + self.description = description + self.url = url + # slightly abusing setuptool’s version parsing/matching here + self.affects = list (parse_requirements(affects)) + + def __contains__ (self, pkg): + """ + Return True if the versions in pkg are affected by this errata + + pkg must be a mapping from project_name to version + """ + matchedAll = [] + for a in self.affects: + haveVersion = pkg.get (a.project_name, None) + matchedAll.append (haveVersion is not None and haveVersion in a) + return all (matchedAll) + + def __repr__ (self): + return f'{self.__class__.__name__}({self.uuid!r}, {self.description!r}, {self.affects!r})' + + @property + def fixable (self): + return getattr (self, 'applyFix', None) is not None + + def toDict (self): + return {'uuid': self.uuid, + 'description': self.description, + 'url': self.url, + 'affects': list (map (str, self.affects)), + 'fixable': self.fixable} + +class FixableErrata(Errata): + __slots__ = ('stats') + + def __init__ (self, uuid, description, affects, url=None): + super().__init__ (uuid, description, affects, url) + # statistics for fixable erratas + self.stats = dict (records=dict (fixed=0, processed=0)) + + def applyFix (self, record): + raise NotImplementedError () # pragma: no cover + +class ContentTypeErrata (FixableErrata): + def __init__ (self): + super().__init__ ( + uuid='552c13dc-56e5-4539-9ad8-184ccae60930', + description='Content-Type header uses wrong argument name encoding instead of charset.', + url='https://github.com/PromyLOPh/crocoite/issues/19', + affects=['crocoite==1.0.0']) + + def applyFix (self, record): + # XXX: this is ugly. warcio’s write_record replaces any Content-Type + # header we’re setting with this one. But printing rec_headers shows + # the header, not .content_type. + contentType = record.content_type + if '; encoding=' in contentType: + contentType = contentType.replace ('; encoding=', '; charset=') + record.content_type = contentType + self.stats['records']['fixed'] += 1 + + self.stats['records']['processed'] += 1 + return record + +bugs = [ + Errata (uuid='34a176b3-ad3d-430f-a082-68087f304572', + description='Generated by version < 1.0. No erratas are supported for this version.', + affects=['crocoite<1.0'], + ), + ContentTypeErrata (), + ] + +def makeReport (fd): + alreadyFixed = set () + + for record in ArchiveIterator (fd): + if record.rec_type == 'warcinfo': + try: + data = json.load (record.raw_stream) + # errata records precceed everything else and indicate which + # ones were fixed already + if data['tool'] == 'crocoite-errata': + alreadyFixed.update (data['parameters']['errata']) + else: + haveVersions = dict ([(pkg['projectName'], parse_version(pkg['version'])) for pkg in data['software']['self']]) + yield from filter (lambda b: haveVersions in b and b.uuid not in alreadyFixed, bugs) + except json.decoder.JSONDecodeError: + pass + +def errataCheck (args): + hasErrata = False + for item in makeReport (args.input): + json.dump (item.toDict (), sys.stdout) + sys.stdout.write ('\n') + sys.stdout.flush () + hasErrata = True + return int (hasErrata) + +def errataFix (args): + errata = args.errata + + with args.input as infd, args.output as outfd: + writer = WARCWriter (outfd, gzip=True) + + warcinfo = { + 'software': getSoftwareInfo (), + 'tool': 'crocoite-errata', # not the name of the cli tool + 'parameters': {'errata': [errata.uuid]}, + } + payload = BytesIO (json.dumps (warcinfo, indent=2).encode ('utf-8')) + record = writer.create_warc_record ('', 'warcinfo', + payload=payload, + warc_headers_dict={'Content-Type': makeContentType (jsonMime, 'utf-8')}) + writer.write_record (record) + + for record in ArchiveIterator (infd): + fixedRecord = errata.applyFix (record) + writer.write_record (fixedRecord) + json.dump (errata.stats, sys.stdout) + sys.stdout.write ('\n') + sys.stdout.flush () + +def uuidToErrata (uuid, onlyFixable=True): + try: + e = next (filter (lambda x: x.uuid == uuid, bugs)) + except StopIteration: + raise argparse.ArgumentTypeError (f'Errata {uuid} does not exist') + if not isinstance (e, FixableErrata): + raise argparse.ArgumentTypeError (f'Errata {uuid} is not fixable') + return e + +def errata (): + parser = argparse.ArgumentParser(description=f'Show/fix erratas for WARCs generated by {__package__}.') + parser.add_argument('input', metavar='INPUT', type=argparse.FileType ('rb'), help='Input WARC') + + # XXX: required argument does not work here?! + subparsers = parser.add_subparsers() + + checkparser = subparsers.add_parser('check', help='Show erratas') + checkparser.set_defaults (func=errataCheck) + + fixparser = subparsers.add_parser('fix', help='Fix erratas') + fixparser.add_argument('errata', metavar='UUID', type=uuidToErrata, help='Apply fix for this errata') + fixparser.add_argument('output', metavar='OUTPUT', type=argparse.FileType ('wb'), help='Output WARC') + fixparser.set_defaults (func=errataFix) + + args = parser.parse_args() + + if not hasattr (args, 'func'): + parser.print_usage () + parser.exit () + + return args.func (args) diff --git a/crocoite/util.py b/crocoite/util.py index bd26909..da377a3 100644 --- a/crocoite/util.py +++ b/crocoite/util.py @@ -22,26 +22,30 @@ Random utility functions """ -import random, sys, platform +import random, sys, platform, os, json, urllib +from datetime import datetime import hashlib, pkg_resources -from urllib.parse import urlsplit, urlunsplit -def packageUrl (path): - """ - Create URL for package data stored into WARC - """ - return 'urn:' + __package__ + ':' + path +from yarl import URL + +class StrJsonEncoder (json.JSONEncoder): + """ JSON encoder that turns unknown classes into a string and thus never + fails """ + def default (self, obj): + if isinstance (obj, datetime): + return obj.isoformat () + + # make sure serialization always succeeds + try: + return json.JSONEncoder.default(self, obj) + except TypeError: + return str (obj) async def getFormattedViewportMetrics (tab): layoutMetrics = await tab.Page.getLayoutMetrics () # XXX: I’m not entirely sure which one we should use here - return '{}x{}'.format (layoutMetrics['layoutViewport']['clientWidth'], - layoutMetrics['layoutViewport']['clientHeight']) - -def removeFragment (u): - """ Remove fragment from url (i.e. #hashvalue) """ - s = urlsplit (u) - return urlunsplit ((s.scheme, s.netloc, s.path, s.query, '')) + viewport = layoutMetrics['layoutViewport'] + return f"{viewport['clientWidth']}x{viewport['clientHeight']}" def getSoftwareInfo (): """ Get software info for inclusion into warcinfo """ @@ -79,7 +83,7 @@ def getRequirements (dist): pkg = getattr (m, '__package__', None) # is loaded? if pkg in modules: - if f: + if f and os.path.isfile (f): with open (f, 'rb') as fd: contents = fd.read () h = hashlib.new ('sha512') diff --git a/crocoite/warc.py b/crocoite/warc.py index ebc460d..415b487 100644 --- a/crocoite/warc.py +++ b/crocoite/warc.py @@ -24,24 +24,36 @@ Classes writing data to WARC files import json, threading from io import BytesIO -from urllib.parse import urlsplit from datetime import datetime +from http.server import BaseHTTPRequestHandler from warcio.timeutils import datetime_to_iso_date from warcio.warcwriter import WARCWriter from warcio.statusandheaders import StatusAndHeaders +from yarl import URL -from .util import packageUrl +from .util import StrJsonEncoder from .controller import EventHandler, ControllerStart from .behavior import Script, DomSnapshotEvent, ScreenshotEvent -from .browser import Item +from .browser import RequestResponsePair, UnicodeBody + +# the official mimetype for json, according to https://tools.ietf.org/html/rfc8259 +jsonMime = 'application/json' +# mime for javascript, according to https://tools.ietf.org/html/rfc4329#section-7.2 +jsMime = 'application/javascript' + +def makeContentType (mime, charset=None): + """ Create value of Content-Type WARC header with optional charset """ + s = [mime] + if charset: + s.extend (['; charset=', charset]) + return ''.join (s) class WarcHandler (EventHandler): __slots__ = ('logger', 'writer', 'documentRecords', 'log', 'maxLogSize', 'logEncoding', 'warcinfoRecordId') - def __init__ (self, fd, - logger): + def __init__ (self, fd, logger): self.logger = logger self.writer = WARCWriter (fd, gzip=True) @@ -68,6 +80,7 @@ class WarcHandler (EventHandler): Adds default WARC headers. """ + assert url is None or isinstance (url, URL) d = {} if self.warcinfoRecordId: @@ -75,8 +88,11 @@ class WarcHandler (EventHandler): d.update (warc_headers_dict) warc_headers_dict = d - record = self.writer.create_warc_record (url, kind, payload=payload, - warc_headers_dict=warc_headers_dict, http_headers=http_headers) + record = self.writer.create_warc_record (str (url) if url else '', + kind, + payload=payload, + warc_headers_dict=warc_headers_dict, + http_headers=http_headers) self.writer.write_record (record) return record @@ -85,72 +101,52 @@ class WarcHandler (EventHandler): logger = self.logger.bind (reqId=item.id) req = item.request - resp = item.response - url = urlsplit (resp['url']) - - path = url.path - if url.query: - path += '?' + url.query - httpHeaders = StatusAndHeaders('{} {} HTTP/1.1'.format (req['method'], path), - item.requestHeaders, protocol='HTTP/1.1', is_http_request=True) - initiator = item.initiator + url = item.url + + path = url.relative().with_fragment(None) + httpHeaders = StatusAndHeaders(f'{req.method} {path} HTTP/1.1', + req.headers, protocol='HTTP/1.1', is_http_request=True) warcHeaders = { - 'X-Chrome-Initiator': json.dumps (initiator), + # required to correlate request with log entries 'X-Chrome-Request-ID': item.id, - 'WARC-Date': datetime_to_iso_date (datetime.utcfromtimestamp (item.chromeRequest['wallTime'])), + 'WARC-Date': datetime_to_iso_date (req.timestamp), } - if item.requestBody is not None: - payload, payloadBase64Encoded = item.requestBody - else: + body = item.request.body + if item.request.hasPostData and body is None: # oops, don’t know what went wrong here - logger.error ('requestBody missing', uuid='ee9adc58-e723-4595-9feb-312a67ead6a0') + logger.error ('requestBody missing', + uuid='ee9adc58-e723-4595-9feb-312a67ead6a0') warcHeaders['WARC-Truncated'] = 'unspecified' - payload = None - - if payload: - payload = BytesIO (payload) - warcHeaders['X-Chrome-Base64Body'] = str (payloadBase64Encoded) - record = self.writeRecord (req['url'], 'request', - payload=payload, http_headers=httpHeaders, + else: + body = BytesIO (body) + record = self.writeRecord (url, 'request', + payload=body, http_headers=httpHeaders, warc_headers_dict=warcHeaders) return record.rec_headers['WARC-Record-ID'] def _writeResponse (self, item, concurrentTo): # fetch the body reqId = item.id - rawBody = None - base64Encoded = False - bodyTruncated = None - if item.isRedirect or item.body is None: - # redirects reuse the same request, thus we cannot safely retrieve - # the body (i.e getResponseBody may return the new location’s - # body). No body available means we failed to retrieve it. - bodyTruncated = 'unspecified' - else: - rawBody, base64Encoded = item.body # now the response resp = item.response warcHeaders = { 'WARC-Concurrent-To': concurrentTo, - 'WARC-IP-Address': resp.get ('remoteIPAddress', ''), - 'X-Chrome-Protocol': resp.get ('protocol', ''), - 'X-Chrome-FromDiskCache': str (resp.get ('fromDiskCache')), - 'X-Chrome-ConnectionReused': str (resp.get ('connectionReused')), + # required to correlate request with log entries 'X-Chrome-Request-ID': item.id, - 'WARC-Date': datetime_to_iso_date (datetime.utcfromtimestamp ( - item.chromeRequest['wallTime']+ - (item.chromeResponse['timestamp']-item.chromeRequest['timestamp']))), + 'WARC-Date': datetime_to_iso_date (resp.timestamp), } - if bodyTruncated: - warcHeaders['WARC-Truncated'] = bodyTruncated - else: - warcHeaders['X-Chrome-Base64Body'] = str (base64Encoded) + # conditional WARC headers + if item.remoteIpAddress: + warcHeaders['WARC-IP-Address'] = item.remoteIpAddress - httpHeaders = StatusAndHeaders('{} {}'.format (resp['status'], - item.statusText), item.responseHeaders, - protocol='HTTP/1.1') + # HTTP headers + statusText = resp.statusText or \ + BaseHTTPRequestHandler.responses.get ( + resp.status, ('No status text available', ))[0] + httpHeaders = StatusAndHeaders(f'{resp.status} {statusText}', + resp.headers, protocol='HTTP/1.1') # Content is saved decompressed and decoded, remove these headers blacklistedHeaders = {'transfer-encoding', 'content-encoding'} @@ -160,20 +156,21 @@ class WarcHandler (EventHandler): # chrome sends nothing but utf8 encoded text. Fortunately HTTP # headers take precedence over the document’s <meta>, thus we can # easily override those. - contentType = resp.get ('mimeType') - if contentType: - if not base64Encoded: - contentType += '; charset=utf-8' - httpHeaders.replace_header ('content-type', contentType) - - if rawBody is not None: - httpHeaders.replace_header ('content-length', '{:d}'.format (len (rawBody))) - bodyIo = BytesIO (rawBody) + if resp.mimeType: + charset = 'utf-8' if isinstance (resp.body, UnicodeBody) else None + contentType = makeContentType (resp.mimeType, charset=charset) + httpHeaders.replace_header ('Content-Type', contentType) + + # response body + body = resp.body + if body is None: + warcHeaders['WARC-Truncated'] = 'unspecified' else: - bodyIo = BytesIO () + httpHeaders.replace_header ('Content-Length', str (len (body))) + body = BytesIO (body) - record = self.writeRecord (resp['url'], 'response', - warc_headers_dict=warcHeaders, payload=bodyIo, + record = self.writeRecord (item.url, 'response', + warc_headers_dict=warcHeaders, payload=body, http_headers=httpHeaders) if item.resourceType == 'Document': @@ -182,32 +179,38 @@ class WarcHandler (EventHandler): def _writeScript (self, item): writer = self.writer encoding = 'utf-8' - self.writeRecord (packageUrl ('script/{}'.format (item.path)), 'metadata', + # XXX: yes, we’re leaking information about the user here, but this is + # the one and only source URL of the scripts. + uri = URL(f'file://{item.abspath}') if item.path else None + self.writeRecord (uri, 'resource', payload=BytesIO (str (item).encode (encoding)), - warc_headers_dict={'Content-Type': 'application/javascript; charset={}'.format (encoding)}) + warc_headers_dict={ + 'Content-Type': makeContentType (jsMime, encoding), + 'X-Crocoite-Type': 'script', + }) def _writeItem (self, item): - if item.failed: - # should have been handled by the logger already - return - + assert item.request concurrentTo = self._writeRequest (item) - self._writeResponse (item, concurrentTo) + # items that failed loading don’t have a response + if item.response: + self._writeResponse (item, concurrentTo) def _addRefersTo (self, headers, url): refersTo = self.documentRecords.get (url) if refersTo: headers['WARC-Refers-To'] = refersTo else: - self.logger.error ('No document record found for {}'.format (url)) + self.logger.error (f'No document record found for {url}') return headers def _writeDomSnapshot (self, item): writer = self.writer - warcHeaders = {'X-DOM-Snapshot': str (True), + warcHeaders = { + 'X-Crocoite-Type': 'dom-snapshot', 'X-Chrome-Viewport': item.viewport, - 'Content-Type': 'text/html; charset=utf-8', + 'Content-Type': makeContentType ('text/html', 'utf-8') } self._addRefersTo (warcHeaders, item.url) @@ -218,53 +221,53 @@ class WarcHandler (EventHandler): def _writeScreenshot (self, item): writer = self.writer - warcHeaders = {'Content-Type': 'image/png', - 'X-Crocoite-Screenshot-Y-Offset': str (item.yoff)} + warcHeaders = { + 'Content-Type': makeContentType ('image/png'), + 'X-Crocoite-Screenshot-Y-Offset': str (item.yoff), + 'X-Crocoite-Type': 'screenshot', + } self._addRefersTo (warcHeaders, item.url) self.writeRecord (item.url, 'conversion', payload=BytesIO (item.data), warc_headers_dict=warcHeaders) - def _writeControllerStart (self, item): - payload = BytesIO (json.dumps (item.payload, indent=2).encode ('utf-8')) + def _writeControllerStart (self, item, encoding='utf-8'): + payload = BytesIO (json.dumps (item.payload, indent=2, cls=StrJsonEncoder).encode (encoding)) writer = self.writer - warcinfo = self.writeRecord (packageUrl ('warcinfo'), 'warcinfo', - warc_headers_dict={'Content-Type': 'text/plain; encoding=utf-8'}, + warcinfo = self.writeRecord (None, 'warcinfo', + warc_headers_dict={'Content-Type': makeContentType (jsonMime, encoding)}, payload=payload) self.warcinfoRecordId = warcinfo.rec_headers['WARC-Record-ID'] def _flushLogEntries (self): - writer = self.writer - self.log.seek (0) - # XXX: we should use the type continuation here - self.writeRecord (packageUrl ('log'), 'resource', payload=self.log, - warc_headers_dict={'Content-Type': 'text/plain; encoding={}'.format (self.logEncoding)}) - self.log = BytesIO () + if self.log.tell () > 0: + writer = self.writer + self.log.seek (0) + warcHeaders = { + 'Content-Type': makeContentType (jsonMime, self.logEncoding), + 'X-Crocoite-Type': 'log', + } + self.writeRecord (None, 'metadata', payload=self.log, + warc_headers_dict=warcHeaders) + self.log = BytesIO () def _writeLog (self, item): """ Handle log entries, called by .logger.WarcHandlerConsumer only """ self.log.write (item.encode (self.logEncoding)) self.log.write (b'\n') - # instead of locking, check we’re running in the main thread - if self.log.tell () > self.maxLogSize and \ - threading.current_thread () is threading.main_thread (): + if self.log.tell () > self.maxLogSize: self._flushLogEntries () route = {Script: _writeScript, - Item: _writeItem, + RequestResponsePair: _writeItem, DomSnapshotEvent: _writeDomSnapshot, ScreenshotEvent: _writeScreenshot, ControllerStart: _writeControllerStart, } - def push (self, item): - processed = False + async def push (self, item): for k, v in self.route.items (): if isinstance (item, k): v (self, item) - processed = True break - if not processed: - self.logger.debug ('unknown event {}'.format (repr (item))) - diff --git a/doc/_ext/clicklist.py b/doc/_ext/clicklist.py new file mode 100644 index 0000000..a69452c --- /dev/null +++ b/doc/_ext/clicklist.py @@ -0,0 +1,45 @@ +""" +Render click.yaml config file into human-readable list of supported sites +""" + +import pkg_resources, yaml +from docutils import nodes +from docutils.parsers.rst import Directive +from yarl import URL + +class ClickList (Directive): + def run(self): + # XXX: do this once only + fd = pkg_resources.resource_stream ('crocoite', 'data/click.yaml') + config = list (yaml.safe_load_all (fd)) + + l = nodes.definition_list () + for site in config: + urls = set () + v = nodes.definition () + vl = nodes.bullet_list () + v += vl + for s in site['selector']: + i = nodes.list_item () + i += nodes.paragraph (text=s['description']) + vl += i + urls.update (map (lambda x: URL(x).with_path ('/'), s.get ('urls', []))) + + item = nodes.definition_list_item () + term = ', '.join (map (lambda x: x.host, urls)) if urls else site['match'] + k = nodes.term (text=term) + item += k + + item += v + l += item + return [l] + +def setup(app): + app.add_directive ("clicklist", ClickList) + + return { + 'version': '0.1', + 'parallel_read_safe': True, + 'parallel_write_safe': True, + } + diff --git a/doc/conf.py b/doc/conf.py new file mode 100644 index 0000000..8336c27 --- /dev/null +++ b/doc/conf.py @@ -0,0 +1,44 @@ +# -*- coding: utf-8 -*- +import os, sys + +# -- Project information ----------------------------------------------------- + +project = 'crocoite' +copyright = '2019 crocoite contributors' +author = 'crocoite contributors' + +# -- General configuration --------------------------------------------------- + +sys.path.append(os.path.abspath("./_ext")) +extensions = [ + 'sphinx.ext.viewcode', + 'sphinx.ext.autodoc', + 'clicklist', +] + +# Add any paths that contain templates here, relative to this directory. +templates_path = ['_templates'] + +source_suffix = '.rst' +master_doc = 'index' +language = 'en' +exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store'] +pygments_style = 'tango' + +# -- Options for HTML output ------------------------------------------------- + +html_theme = 'alabaster' +html_theme_options = { + "description": "Preservation for the modern web", + "github_user": "PromyLOPh", + "github_repo": "crocoite", + "travis_button": True, + "github_button": True, + "codecov_button": True, + "fixed_sidebar": True, +} +#html_static_path = ['_static'] +html_sidebars = { + '**': ['about.html', 'navigation.html', 'searchbox.html'], +} + diff --git a/doc/develop.rst b/doc/develop.rst new file mode 100644 index 0000000..801ab21 --- /dev/null +++ b/doc/develop.rst @@ -0,0 +1,39 @@ +Development +----------- + +Generally crocoite provides reasonable defaults for Google Chrome via +:py:mod:`crocoite.devtools`. When debugging this software it might be necessary +to open a non-headless instance of the browser by running + +.. code:: bash + + google-chrome-stable --remote-debugging-port=9222 --auto-open-devtools-for-tabs + +and then passing the option :option:`--browser=http://localhost:9222` to +:program:`crocoite-single`. This allows human intervention through the +browser’s builtin console. + +Release guide +^^^^^^^^^^^^^ + +crocoite uses `semantic versioning`_. To create a new release, bump the version +number in ``setup.py`` according to the linked guide, create distribution +packages:: + + python setup.py sdist bdist_wheel + +Verify them:: + + twine check dist/* + +Try to install and use them in a separate sandbox. And finally sign and upload +a new version to pypi_:: + + gpg --detach-sign --armor dist/*.tar.gz + twine upload dist/* + +Then update the documentation using :program:`sphing-doc` and upload it as well. + +.. _semantic versioning: https://semver.org/spec/v2.0.0.html +.. _pypi: https://pypi.org + diff --git a/doc/index.rst b/doc/index.rst new file mode 100644 index 0000000..53f5f77 --- /dev/null +++ b/doc/index.rst @@ -0,0 +1,36 @@ +crocoite +======== + +Preservation for the modern web, powered by `headless Google +Chrome`_. + +.. _headless Google Chrome: https://developers.google.com/web/updates/2017/04/headless-chrome + +.. toctree:: + :maxdepth: 1 + :hidden: + + usage.rst + plugins.rst + rationale.rst + develop.rst + related.rst + +Features +-------- + +Google Chrome-powered + HTML renderer, JavaScript engine and network stack, supporting modern web + technologies and protocols +WARC output + Includes all network requests made by the browser +Site interaction + :ref:`Auto-expand on-click content <click>`, infinite-scrolling +DOM snapshot + Contains the page’s state, renderable without JavaScript +Image screenshot + Entire page +Machine-readable interface + Easy integration into custom tools/scripts + + diff --git a/doc/plugins.rst b/doc/plugins.rst new file mode 100644 index 0000000..062e1bf --- /dev/null +++ b/doc/plugins.rst @@ -0,0 +1,16 @@ +Plugins +======= + +crocoite comes with plug-ins that modify loaded sites’ or interact with them. + +.. _click: + +click +----- + +The following sites are currently supported. Note this is an ongoing +battle against layout changes and thus older software versions will stop +working very soon. + +.. clicklist:: + diff --git a/doc/rationale.rst b/doc/rationale.rst new file mode 100644 index 0000000..f37db7c --- /dev/null +++ b/doc/rationale.rst @@ -0,0 +1,76 @@ +Rationale +--------- + +Most modern websites depend heavily on executing code, usually JavaScript, on +the user’s machine. They also make use of new and emerging Web technologies +like HTML5, WebSockets, service workers and more. Even worse from the +preservation point of view, they also require some form of user interaction to +dynamically load more content (infinite scrolling, dynamic comment loading, +etc). + +The naive approach of fetching a HTML page, parsing it and extracting +links to referenced resources therefore is not sufficient to create a faithful +snapshot of these web applications. A full browser, capable of running scripts and +providing modern Web API’s is absolutely required for this task. Thankfully +Google Chrome runs without a display (headless mode) and can be controlled by +external programs, allowing them to navigate and extract or inject data. +This section describes the solutions crocoite offers and explains design +decisions taken. + +crocoite captures resources by listening to Chrome’s `network events`_ and +requesting the response body using `Network.getResponseBody`_. This approach +has caveats: The original HTTP requests and responses, as sent over the wire, +are not available. They are reconstructed from parsed data. The character +encoding for text documents is changed to UTF-8. And the content body of HTTP +redirects cannot be retrieved due to a race condition. + +.. _network events: https://chromedevtools.github.io/devtools-protocol/1-3/Network +.. _Network.getResponseBody: https://chromedevtools.github.io/devtools-protocol/1-3/Network#method-getResponseBody + +But at the same time it allows crocoite to rely on Chrome’s well-tested network +stack and HTTP parser. Thus it supports HTTP version 1 and 2 as well as +transport protocols like SSL and QUIC. Depending on Chrome also eliminates the +need for a man-in-the-middle proxy, like warcprox_, which has to decrypt SSL +traffic and present a fake certificate to the browser in order to store the +transmitted content. + +.. _warcprox: https://github.com/internetarchive/warcprox + +WARC records generated by crocoite therefore are an abstract view on the +resource they represent and not necessarily the data sent over the wire. A URL +fetched with HTTP/2 for example will still result in a HTTP/1.1 +request/response pair in the WARC file. This may be undesireable from +an archivist’s point of view (“save the data exactly like we received it”). But +this level of abstraction is inevitable when dealing with more than one +protocol. + +crocoite also interacts with and therefore alters the grabbed websites. It does +so by injecting `behavior scripts`_ into the site. Typically these are written +in JavaScript, because interacting with a page is easier this way. These +scripts then perform different tasks: Extracting targets from visible +hyperlinks, clicking buttons or scrolling the website to to load more content, +as well as taking a static screenshot of ``<canvas>`` elements for the DOM +snapshot (see below). + +.. _behavior scripts: https://github.com/PromyLOPh/crocoite/tree/master/crocoite/data + +Replaying archived WARC’s can be quite challenging and might not be possible +with current technology (or even at all): + +- Some sites request assets based on screen resolution, pixel ratio and + supported image formats (webp). Replaying those with different parameters + won’t work, since assets for those are missing. Example: missguided.com. +- Some fetch different scripts based on user agent. Example: youtube.com. +- Requests containing randomly generated JavaScript callback function names + won’t work. Example: weather.com. +- Range requests (Range: bytes=1-100) are captured as-is, making playback + difficult + +crocoite offers two methods to work around these issues. Firstly it can save a +DOM snapshot to the WARC file. It contains the entire DOM in HTML format minus +``<script>`` tags after the site has been fully loaded and thus can be +displayed without executing scripts. Obviously JavaScript-based navigation +does not work any more. Secondly it also saves a screenshot of the full page, +so even if future browsers cannot render and display the stored HTML a fully +rendered version of the website can be replayed instead. + diff --git a/doc/related.rst b/doc/related.rst new file mode 100644 index 0000000..62e2569 --- /dev/null +++ b/doc/related.rst @@ -0,0 +1,14 @@ +Related projects +---------------- + +brozzler_ + Uses Google Chrome as well, but intercepts traffic using a proxy. Supports + distributed crawling and immediate playback. +Squidwarc_ + Communicates with headless Google Chrome and uses the Network API to + retrieve requests like crocoite. Supports recursive crawls and page + scrolling, but neither custom JavaScript nor distributed crawling. + +.. _brozzler: https://github.com/internetarchive/brozzler +.. _Squidwarc: https://github.com/N0taN3rd/Squidwarc + diff --git a/doc/usage.rst b/doc/usage.rst new file mode 100644 index 0000000..34a3e7b --- /dev/null +++ b/doc/usage.rst @@ -0,0 +1,162 @@ +Usage +----- + +Quick start using pywb_, expects Google Chrome to be installed already: + +.. code:: bash + + pip install crocoite pywb + crocoite http://example.com/ example.com.warc.gz + wb-manager init test && wb-manager add test example.com.warc.gz + wayback & + $BROWSER http://localhost:8080 + +.. _pywb: https://github.com/ikreymer/pywb + +It is recommended to install at least Micrsoft’s Corefonts_ as well as DejaVu_, +Liberation_ or a similar font family covering a wide range of character sets. +Otherwise page screenshots may be unusable due to missing glyphs. + +.. _Corefonts: http://corefonts.sourceforge.net/ +.. _DejaVu: https://dejavu-fonts.github.io/ +.. _Liberation: https://pagure.io/liberation-fonts + +Recursion +^^^^^^^^^ + +.. program:: crocoite + +By default crocoite will only retrieve the URL specified on the command line. +However it can follow links as well. There’s currently two recursion strategies +available, depth- and prefix-based. + +.. code:: bash + + crocoite -r 1 https://example.com/ example.com.warc.gz + +will retrieve ``example.com`` and all pages directly refered to by it. +Increasing the number increases the depth, so a value of :samp:`2` would first grab +``example.com``, queue all pages linked there as well as every reference on +each of those pages. + +On the other hand + +.. code:: bash + + crocoite -r prefix https://example.com/dir/ example.com.warc.gz + +will retrieve the URL specified and all pages referenced which have the same +URL prefix. There trailing slash is significant. Without it crocoite would also +grab ``/dir-something`` or ``/dir.html`` for example. + +If an output file template is used each page is written to an individual file. For example + +.. code:: bash + + crocoite -r prefix https://example.com/ '{host}-{date}-{seqnum}.warc.gz' + +will write one file page page to files like +:file:`example.com-2019-09-09T15:15:15+02:00-1.warc.gz`. ``seqnum`` is unique to +each page of a single job and should always be used. + +When running a recursive job, increasing the concurrency (i.e. how many pages +are fetched at the same time) can speed up the process. For example you can +pass :option:`-j` :samp:`4` to retrieve four pages at the same time. Keep in mind +that each process starts a full browser that requires a lot of resources (one +to two GB of RAM and one or two CPU cores). + +Customizing +^^^^^^^^^^^ + +.. program:: crocoite-single + +Under the hood :program:`crocoite` starts one instance of +:program:`crocoite-single` to fetch each page. You can customize its options by +appending a command template like this: + +.. code:: bash + + crocoite -r prefix https://example.com example.com.warc.gz -- \ + crocoite-single --timeout 5 -k '{url}' '{dest}' + +This reduces the global timeout to 5 seconds and ignores TLS errors. If an +option is prefixed with an exclamation mark (``!``) it will not be expanded. +This is useful for passing :option:`--warcinfo`, which expects JSON-encoded data. + +Command line options +^^^^^^^^^^^^^^^^^^^^ + +Below is a list of all command line arguments available: + +.. program:: crocoite + +crocoite +++++++++ + +Front-end with recursion support and simple job management. + +.. option:: -j N, --concurrency N + + Maximum number of concurrent fetch jobs. + +.. option:: -r POLICY, --recursion POLICY + + Enables recursion based on POLICY, which can be a positive integer + (recursion depth) or the string :kbd:`prefix`. + +.. option:: --tempdir DIR + + Directory for temporary WARC files. + +.. program:: crocoite-single + +crocoite-single ++++++++++++++++ + +Back-end to fetch a single page. + +.. option:: -b SET-COOKIE, --cookie SET-COOKIE + + Add cookie to browser’s cookie jar. This option always *appends* cookies, + replacing those provided by :option:`-c`. + + .. versionadded:: 1.1 + +.. option:: -c FILE, --cookie-jar FILE + + Load cookies from FILE. :program:`crocoite` provides a default cookie file, + which contains cookies to, for example, circumvent age restrictions. This + option *replaces* that default file. + + .. versionadded:: 1.1 + +.. option:: --idle-timeout SEC + + Time after which a page is considered “idle”. + +.. option:: -k, --insecure + + Allow insecure connections, i.e. self-signed ore expired HTTPS certificates. + +.. option:: --timeout SEC + + Global archiving timeout. + + +.. option:: --warcinfo JSON + + Inject additional JSON-encoded information into the resulting WARC. + +IRC bot +^^^^^^^ + +A simple IRC bot (“chromebot”) is provided with the command :program:`crocoite-irc`. +It reads its configuration from a config file like the example provided in +:file:`contrib/chromebot.json` and supports the following commands: + +a <url> -j <concurrency> -r <policy> -k -b <set-cookie> + Archive <url> with <concurrency> processes according to recursion <policy> +s <uuid> + Get job status for <uuid> +r <uuid> + Revoke or abort running job with <uuid> @@ -4,3 +4,5 @@ test=pytest addopts=--cov-report=html --cov-report=xml --cov=crocoite --cov-config=setup.cfg [coverage:run] branch=True +[build_sphinx] +builder=dirhtml @@ -2,13 +2,15 @@ from setuptools import setup setup( name='crocoite', - version='0.1.0', + version='1.1.1', author='Lars-Dominik Braun', author_email='lars+crocoite@6xq.net', + url='https://6xq.net/crocoite/', packages=['crocoite'], license='LICENSE.txt', description='Save website to WARC using Google Chrome.', long_description=open('README.rst').read(), + long_description_content_type='text/x-rst', install_requires=[ 'warcio', 'html5lib>=0.999999999', @@ -17,20 +19,39 @@ setup( 'websockets', 'aiohttp', 'PyYAML', + 'yarl>=1.4,<1.5', + 'multidict', ], + extras_require={ + 'manhole': ['manhole>=1.6'], + }, entry_points={ 'console_scripts': [ - 'crocoite-grab = crocoite.cli:single', - 'crocoite-recursive = crocoite.cli:recursive', + # the main executable + 'crocoite = crocoite.cli:recursive', + # backend helper + 'crocoite-single = crocoite.cli:single', + # irc bot and dashboard 'crocoite-irc = crocoite.cli:irc', 'crocoite-irc-dashboard = crocoite.cli:dashboard', + # misc tools 'crocoite-merge-warc = crocoite.tools:mergeWarcCli', 'crocoite-extract-screenshot = crocoite.tools:extractScreenshot', + 'crocoite-errata = crocoite.tools:errata', ], }, package_data={ 'crocoite': ['data/*'], }, - setup_requires=["pytest-runner"], - tests_require=["pytest", 'pytest-asyncio', 'pytest-cov'], + setup_requires=['pytest-runner'], + tests_require=["pytest", 'pytest-asyncio', 'pytest-cov', 'hypothesis'], + python_requires='>=3.6', + classifiers=[ + 'Development Status :: 5 - Production/Stable', + 'License :: OSI Approved :: MIT License', + 'Operating System :: POSIX', + 'Programming Language :: Python :: 3.6', + 'Programming Language :: Python :: 3.7', + 'Topic :: Internet :: WWW/HTTP', + ], ) |