summaryrefslogtreecommitdiff
path: root/crocoite/warc.py
AgeCommit message (Collapse)AuthorFilesLines
2018-12-21Parse URLs by defaultLars-Dominik Braun1-11/+7
Use library yarl (already pulled in by aiohttp). No URL processed should be a string.
2018-11-19Coding styleLars-Dominik Braun1-2/+2
Fix a few random issues pointed out by pylint, mainly unused imports.
2018-11-06Switch single mode to asyncioLars-Dominik Braun1-23/+9
This is a direct port to asyncio without any design changes. These need to happen in further refinements. Fixes issue #1.
2018-08-04Properly handle failure to retrieve request bodyLars-Dominik Braun1-1/+15
Just truncate the WARC record like we do with responses. Also add a few tests, but they’re not covering the call to getRequestPostData. Not sure what we have to do here.
2018-08-04Reference warcinfo record in every other recordLars-Dominik Braun1-18/+30
2018-08-04Add package information to warcinfoLars-Dominik Braun1-1/+5
Change warcinfo record format to JSON (this is permitted by the specs) and add Python version, dependencies and their versions as well as file hashes. This should give us enough information to figure out the exact environment used to create the WARC.
2018-08-04Reintroduce WARC loggingLars-Dominik Braun1-4/+34
Commit 7730e0d64ec895091a0dd7eb0e3c6ce2ed02d981 removed logging to WARC files. Add it again, but with a different implementation.. Credits to structlog for inspiration.
2018-06-25warc: Add metadata to truncated recordsLars-Dominik Braun1-22/+28
Specifically for a) redirects (body missing) b) bodies larger than size limit and c) whenever we couldn’t fetch the response body for whatever reason. We gave it our best shot, but still failed miserably. Future generations will certainly appreciate that. Eh, maybe. Hopefully. Will they?
2018-06-25warc: Save DOM-/image screenshot as WARC conversionLars-Dominik Braun1-9/+30
Judging from the docs this is the proper way to store these resources. Enable both for the IRC bot by default, since they won’t interfere with IA’s wayback machine.
2018-06-21Fix a few issues pointed out by pylintLars-Dominik Braun1-4/+0
2018-06-20Add __slots__ to classesLars-Dominik Braun1-0/+2
This is mainly a quality of life change
2018-06-20Synchronous SiteLoader event handlingLars-Dominik Braun1-101/+64
Previously a browser crash stalled the entire grab, since events from pychrome were handled asynchronously in a different thread and exceptions were not propagated to the main thread. Now all browser events are stored in a queue and processed by the main thread, allowing us to handle browser crashes gracefully (more or less). This made the following additional changes necessary: - Clear separation between producer (browser) and consumer (WARC, stats, …) - Behavior scripts now yield events as well, instead of accessing the WARC writer - WARC logging was removed (for now) and WARC writer does not require serialization any more
2018-05-04Move page archiving logic to SinglePageControllerLars-Dominik Braun1-3/+3
In preparation for recursive crawls.
2018-05-04Move header unfolding into ItemLars-Dominik Braun1-21/+2
2018-05-04Fetch request POST bodyLars-Dominik Braun1-7/+5
If there is any and it was not included in the response already.
2018-04-14Fix base64 body detectionLars-Dominik Braun1-1/+1
Broken by commit a21d7332e33a3e47a363004196451721d449e70b
2018-03-25Move getResponseBody call to Item wrapperLars-Dominik Braun1-11/+2
2017-12-25Increase default body sizeLars-Dominik Braun1-2/+4
2017-12-24Refactor behavior scriptsLars-Dominik Braun1-2/+3
No functional changes, just cleanup. Replaces onload and onsnapshot events. Move screen metric emulation, DOM snapshots and screenshots here as well.
2017-12-22Add simple stats-keeping SiteLoaderLars-Dominik Braun1-4/+6
2017-12-22Don’t write WARC record if body cannot be retrievedLars-Dominik Braun1-19/+48
+refactoring.
2017-12-20Fix HTTP headers using the same key more than onceLars-Dominik Braun1-2/+15
This is an undocumented DevTools feature.
2017-12-19Serialize WARC writingLars-Dominik Braun1-0/+35
Logger and SiteWriter both access .write_record() concurrently, which can corrupt WARC files. Move the writer to its own thread and decouple it with a queue. Since we’re probably I/O-bound this may speed up writeback as well.
2017-12-17Don’t fetch redirected request bodyLars-Dominik Braun1-8/+12
We can’t do that safely due to a race-condition.
2017-11-29Use Chrome’s timestamps as WARC-DateLars-Dominik Braun1-0/+6
2017-11-29RefactoringLars-Dominik Braun1-0/+174
Reusable browser communication and WARC writing.