summaryrefslogtreecommitdiff
path: root/README.rst
diff options
context:
space:
mode:
authorLars-Dominik Braun <lars@6xq.net>2018-09-29 16:51:57 +0200
committerLars-Dominik Braun <lars@6xq.net>2018-09-29 16:51:57 +0200
commit07c34b2d004f16798c17ed479679a511c6bd2f29 (patch)
treed3b0696a23ed155ab5fad067f6ab003166343e77 /README.rst
parentcbcdde65aa667369b0890a042e5b44d6b1e377aa (diff)
downloadcrocoite-07c34b2d004f16798c17ed479679a511c6bd2f29.tar.gz
crocoite-07c34b2d004f16798c17ed479679a511c6bd2f29.tar.bz2
crocoite-07c34b2d004f16798c17ed479679a511c6bd2f29.zip
Add documentation
For -recursive and -irc
Diffstat (limited to 'README.rst')
-rw-r--r--README.rst35
1 files changed, 35 insertions, 0 deletions
diff --git a/README.rst b/README.rst
index b1fce2c..61c5b04 100644
--- a/README.rst
+++ b/README.rst
@@ -17,10 +17,12 @@ The following dependencies must be present to run crocoite:
- pychrome_
- warcio_
- html5lib_
+- bottom_ (IRC client)
.. _pychrome: https://github.com/fate0/pychrome
.. _warcio: https://github.com/webrecorder/warcio
.. _html5lib: https://github.com/html5lib/html5lib-python
+.. _bottom: https://github.com/numberoverzero/bottom
It is recommended to prepare a virtualenv and let pip handle the dependency
resolution for Python packages instead:
@@ -119,6 +121,39 @@ does not work any more. Secondly it also saves a screenshot of the full page,
so even if future browsers cannot render and display the stored HTML a fully
rendered version of the website can be replayed instead.
+Advanced usage
+--------------
+
+crocoite is built with the Unix philosophy (“do one thing and do it well”) in
+mind. Thus ``crocoite-grab`` can only save a single page. If you want recursion
+use ``crocoite-recursive``, which follows hyperlinks according to ``--policy``.
+It can either recurse a maximum number of levels or grab all pages with the
+same prefix as the start URL:
+
+.. code:: bash
+
+ crocoite-recursive --policy prefix http://www.example.com/dir/ output
+
+will save all pages in ``/dir/`` and below to individual files in the output
+directory ``output``. You can customize the command used to grab individual
+pages by appending it after ``output``. This way distributed grabs (ssh to a
+different machine and execute the job there, queue the command with Slurm, …)
+are possible.
+
+IRC bot
+^^^^^^^
+
+A simple IRC bot (“chromebot”) is provided with the command ``crocoite-irc``.
+It reads its configuration from a config file like the example provided in
+``contrib/chromebot.ini`` and supports the following commands:
+
+a <url> -j <concurrency> -r <policy>
+ Archive <url> with <concurrency> processes according to recursion <policy>
+s <uuid>
+ Get job status for <uuid>
+r <uuid>
+ Revoke or abort running job with <uuid>
+
Related projects
----------------