summaryrefslogtreecommitdiff
path: root/crocoite/tools.py
AgeCommit message (Collapse)AuthorFilesLines
2019-07-28Fix wrong Content-Type header parameterLars-Dominik Braun1-12/+102
In line with HTTP “encoding” parameter should be called “charset”. Fixable errata item created. Fixes issue #19.
2019-07-02Stabilize WARC headersLars-Dominik Braun1-2/+2
In preparation for 1.0 release: - Correct mime types - Add X-Crocoite-Type, so logs, scripts, dom-snapshots and screenshots can be identified easily - Remove random WARC headers like X-Chrome-Initiator. We don’t want to maintain those. - Remove non-standard urn-based package URLs. Can’t use them without a urn-registration
2019-06-28tools: Add missing \n to JSON outputLars-Dominik Braun1-0/+1
Fixes 76811bd3f0b3fc8688939e31fdab2c71c89cc75b
2019-06-27extract-screenshot: Allow extracting only the first screenshotLars-Dominik Braun1-1/+6
2019-06-27merge: Dump machine-readable infoLars-Dominik Braun1-2/+18
2018-12-31extract-screenshot: Remove URL from filenameLars-Dominik Braun1-8/+19
URL’s can get quite long, overflowing the file name length limit. Instead use sequential filenames and output metadata to stdout.
2018-12-24Use f-strings where possibleLars-Dominik Braun1-7/+6
Replaces str.format, which is less readable due to its separation of format and arguments.
2018-12-17Add simple errata toolLars-Dominik Braun1-0/+71
Fixes #9.
2018-12-08tools: Add version info to merged WARCsLars-Dominik Braun1-1/+23
In preparation for #9. I was hoping to reuse one of schema.org’s microdata schema’s, but neither Action (archival action) nor SoftwareApplication (version information) seem to be suitable.
2018-11-19Coding styleLars-Dominik Braun1-1/+1
Fix a few random issues pointed out by pylint, mainly unused imports.
2018-11-17tools: Add original HTTP header to revisit recordLars-Dominik Braun1-1/+4
The payloads may be the same, but the headers are usually not.
2018-11-10tools: Fix WARC mergingLars-Dominik Braun1-18/+17
WARC-Target-URI was taken from the previous record, even if the URI was different. This essentially removes the revisited URL from the archive. Also add a few tests. And boy, warcio is a mess.
2018-06-25warc: Save DOM-/image screenshot as WARC conversionLars-Dominik Braun1-13/+16
Judging from the docs this is the proper way to store these resources. Enable both for the IRC bot by default, since they won’t interfere with IA’s wayback machine.
2018-05-05Rename command line toolsLars-Dominik Braun1-0/+97
Move contrib/ scripts to .tools and add entry points to setup.py, rename crocoite-standalone to crocoite-grab.