summaryrefslogtreecommitdiff
path: root/crocoite/warc.py
AgeCommit message (Collapse)AuthorFilesLines
2018-06-21Fix a few issues pointed out by pylintLars-Dominik Braun1-4/+0
2018-06-20Add __slots__ to classesLars-Dominik Braun1-0/+2
This is mainly a quality of life change
2018-06-20Synchronous SiteLoader event handlingLars-Dominik Braun1-101/+64
Previously a browser crash stalled the entire grab, since events from pychrome were handled asynchronously in a different thread and exceptions were not propagated to the main thread. Now all browser events are stored in a queue and processed by the main thread, allowing us to handle browser crashes gracefully (more or less). This made the following additional changes necessary: - Clear separation between producer (browser) and consumer (WARC, stats, …) - Behavior scripts now yield events as well, instead of accessing the WARC writer - WARC logging was removed (for now) and WARC writer does not require serialization any more
2018-05-04Move page archiving logic to SinglePageControllerLars-Dominik Braun1-3/+3
In preparation for recursive crawls.
2018-05-04Move header unfolding into ItemLars-Dominik Braun1-21/+2
2018-05-04Fetch request POST bodyLars-Dominik Braun1-7/+5
If there is any and it was not included in the response already.
2018-04-14Fix base64 body detectionLars-Dominik Braun1-1/+1
Broken by commit a21d7332e33a3e47a363004196451721d449e70b
2018-03-25Move getResponseBody call to Item wrapperLars-Dominik Braun1-11/+2
2017-12-25Increase default body sizeLars-Dominik Braun1-2/+4
2017-12-24Refactor behavior scriptsLars-Dominik Braun1-2/+3
No functional changes, just cleanup. Replaces onload and onsnapshot events. Move screen metric emulation, DOM snapshots and screenshots here as well.
2017-12-22Add simple stats-keeping SiteLoaderLars-Dominik Braun1-4/+6
2017-12-22Don’t write WARC record if body cannot be retrievedLars-Dominik Braun1-19/+48
+refactoring.
2017-12-20Fix HTTP headers using the same key more than onceLars-Dominik Braun1-2/+15
This is an undocumented DevTools feature.
2017-12-19Serialize WARC writingLars-Dominik Braun1-0/+35
Logger and SiteWriter both access .write_record() concurrently, which can corrupt WARC files. Move the writer to its own thread and decouple it with a queue. Since we’re probably I/O-bound this may speed up writeback as well.
2017-12-17Don’t fetch redirected request bodyLars-Dominik Braun1-8/+12
We can’t do that safely due to a race-condition.
2017-11-29Use Chrome’s timestamps as WARC-DateLars-Dominik Braun1-0/+6
2017-11-29RefactoringLars-Dominik Braun1-0/+174
Reusable browser communication and WARC writing.