summaryrefslogtreecommitdiff
path: root/crocoite/util.py
AgeCommit message (Collapse)AuthorFilesLines
2019-07-02Stabilize WARC headersLars-Dominik Braun1-6/+2
In preparation for 1.0 release: - Correct mime types - Add X-Crocoite-Type, so logs, scripts, dom-snapshots and screenshots can be identified easily - Remove random WARC headers like X-Chrome-Initiator. We don’t want to maintain those. - Remove non-standard urn-based package URLs. Can’t use them without a urn-registration
2018-12-25warc: Add testsLars-Dominik Braun1-2/+2
Using hyothesis-based testcase generation. This is quite nice compared to manual test data generation, since it catches alot more corner cases (if done right). This commit also fixes a few issues, including: - log records will only be written if the log is nonempty - properly quote packageUrl path’s - drop old thread checking code - use placeholder url for scripts without name
2018-12-24Use f-strings where possibleLars-Dominik Braun1-2/+2
Replaces str.format, which is less readable due to its separation of format and arguments.
2018-12-21util: Skip missing source filesLars-Dominik Braun1-1/+1
Requirement extraction fails if the package is an .egg file (i.e. not extracted). Do not try to compute checksum/file length for them.
2018-12-21Parse URLs by defaultLars-Dominik Braun1-7/+15
Use library yarl (already pulled in by aiohttp). No URL processed should be a string.
2018-12-08tools: Add version info to merged WARCsLars-Dominik Braun1-1/+13
In preparation for #9. I was hoping to reuse one of schema.org’s microdata schema’s, but neither Action (archival action) nor SoftwareApplication (version information) seem to be suitable.
2018-12-01util: Remove unused functionLars-Dominik Braun1-5/+0
2018-11-19Coding styleLars-Dominik Braun1-1/+1
Fix a few random issues pointed out by pylint, mainly unused imports.
2018-11-06Switch single mode to asyncioLars-Dominik Braun1-2/+2
This is a direct port to asyncio without any design changes. These need to happen in further refinements. Fixes issue #1.
2018-08-04Add package information to warcinfoLars-Dominik Braun1-1/+44
Change warcinfo record format to JSON (this is permitted by the specs) and add Python version, dependencies and their versions as well as file hashes. This should give us enough information to figure out the exact environment used to create the WARC.
2018-06-25warc: Save DOM-/image screenshot as WARC conversionLars-Dominik Braun1-0/+6
Judging from the docs this is the proper way to store these resources. Enable both for the IRC bot by default, since they won’t interfere with IA’s wayback machine.
2017-12-24Refactor behavior scriptsLars-Dominik Braun1-0/+43
No functional changes, just cleanup. Replaces onload and onsnapshot events. Move screen metric emulation, DOM snapshots and screenshots here as well.