summaryrefslogtreecommitdiff
path: root/doc
diff options
context:
space:
mode:
authorLars-Dominik Braun <lars@6xq.net>2019-07-06 15:31:53 +0200
committerLars-Dominik Braun <lars@6xq.net>2019-07-06 15:31:53 +0200
commit46d6c7f296e8b29db307cd180f3743e36b29ffe3 (patch)
treeb1758c501904f6d826e719dc24564adaf701238b /doc
parenta8a0de408cbb00c0dbb19acd0d26eef53f778ec0 (diff)
downloadcrocoite-46d6c7f296e8b29db307cd180f3743e36b29ffe3.tar.gz
crocoite-46d6c7f296e8b29db307cd180f3743e36b29ffe3.tar.bz2
crocoite-46d6c7f296e8b29db307cd180f3743e36b29ffe3.zip
Improve documentation
Diffstat (limited to 'doc')
-rw-r--r--doc/develop.rst16
-rw-r--r--doc/usage.rst64
2 files changed, 68 insertions, 12 deletions
diff --git a/doc/develop.rst b/doc/develop.rst
index 8a8e8bd..801ab21 100644
--- a/doc/develop.rst
+++ b/doc/develop.rst
@@ -1,19 +1,17 @@
Development
-----------
-Generally crocoite provides reasonable defaults for Google Chrome via its
-`devtools module`_. When debugging this software it might be necessary to open
-a non-headless instance of the browser by running
+Generally crocoite provides reasonable defaults for Google Chrome via
+:py:mod:`crocoite.devtools`. When debugging this software it might be necessary
+to open a non-headless instance of the browser by running
.. code:: bash
google-chrome-stable --remote-debugging-port=9222 --auto-open-devtools-for-tabs
-and then passing the option ``--browser=http://localhost:9222`` to
-``crocoite-grab``. This allows human intervention through the browser’s builtin
-console.
-
-.. _devtools module: crocoite/devtools.py
+and then passing the option :option:`--browser=http://localhost:9222` to
+:program:`crocoite-single`. This allows human intervention through the
+browser’s builtin console.
Release guide
^^^^^^^^^^^^^
@@ -34,7 +32,7 @@ a new version to pypi_::
gpg --detach-sign --armor dist/*.tar.gz
twine upload dist/*
-Then update the documentation using ``sphing-doc`` and upload it as well.
+Then update the documentation using :program:`sphing-doc` and upload it as well.
.. _semantic versioning: https://semver.org/spec/v2.0.0.html
.. _pypi: https://pypi.org
diff --git a/doc/usage.rst b/doc/usage.rst
index b070f5c..9bba693 100644
--- a/doc/usage.rst
+++ b/doc/usage.rst
@@ -21,14 +21,72 @@ Otherwise page screenshots may be unusable due to missing glyphs.
.. _DejaVu: https://dejavu-fonts.github.io/
.. _Liberation: https://pagure.io/liberation-fonts
+Recursion
+^^^^^^^^^
+
+By default crocoite will only retrieve the URL specified on the command line.
+However it can follow links as well. There’s currently two recursion strategies
+available, depth- and prefix-based.
+
+.. code:: bash
+
+ crocoite -r 1 https://example.com/ example.com.warc.gz
+
+will retrieve ``example.com`` and all pages directly refered to by it.
+Increasing the number increases the depth, so a value of :samp:`2` would first grab
+``example.com``, queue all pages linked there as well as every reference on
+each of those pages.
+
+On the other hand
+
+.. code:: bash
+
+ crocoite -r prefix https://example.com/dir/ example.com.warc.gz
+
+will retrieve the URL specified and all pages referenced which have the same
+URL prefix. There trailing slash is significant. Without it crocoite would also
+grab ``/dir-something`` or ``/dir.html`` for example.
+
+If an output file template is used each page is written to an individual file. For example
+
+.. code:: bash
+
+ crocoite -r prefix https://example.com/ '{host}-{date}-{seqnum}.warc.gz'
+
+will write one file page page to files like
+:file:`example.com-2019-09-09T15:15:15+02:00-1.warc.gz`. ``seqnum`` is unique to
+each page of a single job and should always be used.
+
+When running a recursive job, increasing the concurrency (i.e. how many pages
+are fetched at the same time) can speed up the process. For example you can
+pass :option:`-j 4` to retrieve four pages at the same time. Keep in mind that each
+process starts a full browser that requires a lot of resources (one to two GB
+of RAM and one or two CPU cores).
+
+Customizing
+^^^^^^^^^^^
+
+Under the hood crocoite starts one instance of :program:`crocoite-single` to fetch
+each page. You can customize its options by appending a command template like
+this:
+
+.. code:: bash
+
+ crocoite -r prefix https://example.com example.com.warc.gz -- \
+ crocoite-single --timeout 5 -k '{url}' '{dest}'
+
+This reduces the global timeout to 5 seconds and ignores TLS errors. If an
+option is prefixed with an exclamation mark (``!``) it will not be expanded.
+This is useful for passing :option:`--warcinfo`, which expects JSON-encoded data.
+
IRC bot
^^^^^^^
-A simple IRC bot (“chromebot”) is provided with the command ``crocoite-irc``.
+A simple IRC bot (“chromebot”) is provided with the command :program:`crocoite-irc`.
It reads its configuration from a config file like the example provided in
-``contrib/chromebot.json`` and supports the following commands:
+:file:`contrib/chromebot.json` and supports the following commands:
-a <url> -j <concurrency> -r <policy>
+a <url> -j <concurrency> -r <policy> -k
Archive <url> with <concurrency> processes according to recursion <policy>
s <uuid>
Get job status for <uuid>