summaryrefslogtreecommitdiff
path: root/crocoite/warc.py
AgeCommit message (Collapse)AuthorFilesLines
2019-01-07Log Chrome’s responses to WARC by defaultLars-Dominik Braun1-5/+0
We may not be able to reproduce every failure, so logging as much as possible is important to figure out what went wrong. Also, in case a bug is uncovered in the future, we can check the logs and possibly fix it with -errata.
2019-01-03browser: Turn Item into RequestResponsePairLars-Dominik Braun1-54/+42
Previously Item was just a simple wrapper around Chrome’s Network.* events. This turned out to be quite nasty when testing, so its replacement, RequestResponsePair, does some level of abstraction. This makes testing alot easier, since we now can simply instantiate it without building a proper DevTools event. Should come without any functional changes.
2018-12-25warc: Add testsLars-Dominik Braun1-15/+14
Using hyothesis-based testcase generation. This is quite nice compared to manual test data generation, since it catches alot more corner cases (if done right). This commit also fixes a few issues, including: - log records will only be written if the log is nonempty - properly quote packageUrl path’s - drop old thread checking code - use placeholder url for scripts without name
2018-12-24Use f-strings where possibleLars-Dominik Braun1-9/+10
Replaces str.format, which is less readable due to its separation of format and arguments.
2018-12-21Parse URLs by defaultLars-Dominik Braun1-11/+7
Use library yarl (already pulled in by aiohttp). No URL processed should be a string.
2018-11-19Coding styleLars-Dominik Braun1-2/+2
Fix a few random issues pointed out by pylint, mainly unused imports.
2018-11-06Switch single mode to asyncioLars-Dominik Braun1-23/+9
This is a direct port to asyncio without any design changes. These need to happen in further refinements. Fixes issue #1.
2018-08-04Properly handle failure to retrieve request bodyLars-Dominik Braun1-1/+15
Just truncate the WARC record like we do with responses. Also add a few tests, but they’re not covering the call to getRequestPostData. Not sure what we have to do here.
2018-08-04Reference warcinfo record in every other recordLars-Dominik Braun1-18/+30
2018-08-04Add package information to warcinfoLars-Dominik Braun1-1/+5
Change warcinfo record format to JSON (this is permitted by the specs) and add Python version, dependencies and their versions as well as file hashes. This should give us enough information to figure out the exact environment used to create the WARC.
2018-08-04Reintroduce WARC loggingLars-Dominik Braun1-4/+34
Commit 7730e0d64ec895091a0dd7eb0e3c6ce2ed02d981 removed logging to WARC files. Add it again, but with a different implementation.. Credits to structlog for inspiration.
2018-06-25warc: Add metadata to truncated recordsLars-Dominik Braun1-22/+28
Specifically for a) redirects (body missing) b) bodies larger than size limit and c) whenever we couldn’t fetch the response body for whatever reason. We gave it our best shot, but still failed miserably. Future generations will certainly appreciate that. Eh, maybe. Hopefully. Will they?
2018-06-25warc: Save DOM-/image screenshot as WARC conversionLars-Dominik Braun1-9/+30
Judging from the docs this is the proper way to store these resources. Enable both for the IRC bot by default, since they won’t interfere with IA’s wayback machine.
2018-06-21Fix a few issues pointed out by pylintLars-Dominik Braun1-4/+0
2018-06-20Add __slots__ to classesLars-Dominik Braun1-0/+2
This is mainly a quality of life change
2018-06-20Synchronous SiteLoader event handlingLars-Dominik Braun1-101/+64
Previously a browser crash stalled the entire grab, since events from pychrome were handled asynchronously in a different thread and exceptions were not propagated to the main thread. Now all browser events are stored in a queue and processed by the main thread, allowing us to handle browser crashes gracefully (more or less). This made the following additional changes necessary: - Clear separation between producer (browser) and consumer (WARC, stats, …) - Behavior scripts now yield events as well, instead of accessing the WARC writer - WARC logging was removed (for now) and WARC writer does not require serialization any more
2018-05-04Move page archiving logic to SinglePageControllerLars-Dominik Braun1-3/+3
In preparation for recursive crawls.
2018-05-04Move header unfolding into ItemLars-Dominik Braun1-21/+2
2018-05-04Fetch request POST bodyLars-Dominik Braun1-7/+5
If there is any and it was not included in the response already.
2018-04-14Fix base64 body detectionLars-Dominik Braun1-1/+1
Broken by commit a21d7332e33a3e47a363004196451721d449e70b
2018-03-25Move getResponseBody call to Item wrapperLars-Dominik Braun1-11/+2
2017-12-25Increase default body sizeLars-Dominik Braun1-2/+4
2017-12-24Refactor behavior scriptsLars-Dominik Braun1-2/+3
No functional changes, just cleanup. Replaces onload and onsnapshot events. Move screen metric emulation, DOM snapshots and screenshots here as well.
2017-12-22Add simple stats-keeping SiteLoaderLars-Dominik Braun1-4/+6
2017-12-22Don’t write WARC record if body cannot be retrievedLars-Dominik Braun1-19/+48
+refactoring.
2017-12-20Fix HTTP headers using the same key more than onceLars-Dominik Braun1-2/+15
This is an undocumented DevTools feature.
2017-12-19Serialize WARC writingLars-Dominik Braun1-0/+35
Logger and SiteWriter both access .write_record() concurrently, which can corrupt WARC files. Move the writer to its own thread and decouple it with a queue. Since we’re probably I/O-bound this may speed up writeback as well.
2017-12-17Don’t fetch redirected request bodyLars-Dominik Braun1-8/+12
We can’t do that safely due to a race-condition.
2017-11-29Use Chrome’s timestamps as WARC-DateLars-Dominik Braun1-0/+6
2017-11-29RefactoringLars-Dominik Braun1-0/+174
Reusable browser communication and WARC writing.