diff options
Diffstat (limited to 'doc')
-rw-r--r-- | doc/develop.rst | 16 | ||||
-rw-r--r-- | doc/usage.rst | 64 |
2 files changed, 68 insertions, 12 deletions
diff --git a/doc/develop.rst b/doc/develop.rst index 8a8e8bd..801ab21 100644 --- a/doc/develop.rst +++ b/doc/develop.rst @@ -1,19 +1,17 @@ Development ----------- -Generally crocoite provides reasonable defaults for Google Chrome via its -`devtools module`_. When debugging this software it might be necessary to open -a non-headless instance of the browser by running +Generally crocoite provides reasonable defaults for Google Chrome via +:py:mod:`crocoite.devtools`. When debugging this software it might be necessary +to open a non-headless instance of the browser by running .. code:: bash google-chrome-stable --remote-debugging-port=9222 --auto-open-devtools-for-tabs -and then passing the option ``--browser=http://localhost:9222`` to -``crocoite-grab``. This allows human intervention through the browser’s builtin -console. - -.. _devtools module: crocoite/devtools.py +and then passing the option :option:`--browser=http://localhost:9222` to +:program:`crocoite-single`. This allows human intervention through the +browser’s builtin console. Release guide ^^^^^^^^^^^^^ @@ -34,7 +32,7 @@ a new version to pypi_:: gpg --detach-sign --armor dist/*.tar.gz twine upload dist/* -Then update the documentation using ``sphing-doc`` and upload it as well. +Then update the documentation using :program:`sphing-doc` and upload it as well. .. _semantic versioning: https://semver.org/spec/v2.0.0.html .. _pypi: https://pypi.org diff --git a/doc/usage.rst b/doc/usage.rst index b070f5c..9bba693 100644 --- a/doc/usage.rst +++ b/doc/usage.rst @@ -21,14 +21,72 @@ Otherwise page screenshots may be unusable due to missing glyphs. .. _DejaVu: https://dejavu-fonts.github.io/ .. _Liberation: https://pagure.io/liberation-fonts +Recursion +^^^^^^^^^ + +By default crocoite will only retrieve the URL specified on the command line. +However it can follow links as well. There’s currently two recursion strategies +available, depth- and prefix-based. + +.. code:: bash + + crocoite -r 1 https://example.com/ example.com.warc.gz + +will retrieve ``example.com`` and all pages directly refered to by it. +Increasing the number increases the depth, so a value of :samp:`2` would first grab +``example.com``, queue all pages linked there as well as every reference on +each of those pages. + +On the other hand + +.. code:: bash + + crocoite -r prefix https://example.com/dir/ example.com.warc.gz + +will retrieve the URL specified and all pages referenced which have the same +URL prefix. There trailing slash is significant. Without it crocoite would also +grab ``/dir-something`` or ``/dir.html`` for example. + +If an output file template is used each page is written to an individual file. For example + +.. code:: bash + + crocoite -r prefix https://example.com/ '{host}-{date}-{seqnum}.warc.gz' + +will write one file page page to files like +:file:`example.com-2019-09-09T15:15:15+02:00-1.warc.gz`. ``seqnum`` is unique to +each page of a single job and should always be used. + +When running a recursive job, increasing the concurrency (i.e. how many pages +are fetched at the same time) can speed up the process. For example you can +pass :option:`-j 4` to retrieve four pages at the same time. Keep in mind that each +process starts a full browser that requires a lot of resources (one to two GB +of RAM and one or two CPU cores). + +Customizing +^^^^^^^^^^^ + +Under the hood crocoite starts one instance of :program:`crocoite-single` to fetch +each page. You can customize its options by appending a command template like +this: + +.. code:: bash + + crocoite -r prefix https://example.com example.com.warc.gz -- \ + crocoite-single --timeout 5 -k '{url}' '{dest}' + +This reduces the global timeout to 5 seconds and ignores TLS errors. If an +option is prefixed with an exclamation mark (``!``) it will not be expanded. +This is useful for passing :option:`--warcinfo`, which expects JSON-encoded data. + IRC bot ^^^^^^^ -A simple IRC bot (“chromebot”) is provided with the command ``crocoite-irc``. +A simple IRC bot (“chromebot”) is provided with the command :program:`crocoite-irc`. It reads its configuration from a config file like the example provided in -``contrib/chromebot.json`` and supports the following commands: +:file:`contrib/chromebot.json` and supports the following commands: -a <url> -j <concurrency> -r <policy> +a <url> -j <concurrency> -r <policy> -k Archive <url> with <concurrency> processes according to recursion <policy> s <uuid> Get job status for <uuid> |