summaryrefslogtreecommitdiff
path: root/crocoite/browser.py
AgeCommit message (Collapse)AuthorFilesLines
2018-12-21Parse URLs by defaultLars-Dominik Braun1-17/+11
Use library yarl (already pulled in by aiohttp). No URL processed should be a string.
2018-11-24browser: Ignore load failures for nonexisting requestsLars-Dominik Braun1-2/+3
Fixes None dereference.
2018-11-22controller: Improve idle waitingLars-Dominik Braun1-2/+38
2018-11-19Coding styleLars-Dominik Braun1-4/+2
Fix a few random issues pointed out by pylint, mainly unused imports.
2018-11-17browser: clearBrowserCookies is supported unconditionallyLars-Dominik Braun1-4/+1
canClearBrowserCookies apparently has been removed from protocol 1.3.
2018-11-14Async chrome process startupLars-Dominik Braun1-72/+0
Move it to .devtools. Seems more fitting.
2018-11-06Switch site loader to async DevTools communicationLars-Dominik Braun1-107/+126
2018-08-05test_browser: Properly handle failed requestsLars-Dominik Braun1-5/+4
Fixes test failures. Very fragile code unfortunately.
2018-08-04Properly handle failure to retrieve request bodyLars-Dominik Braun1-1/+3
Just truncate the WARC record like we do with responses. Also add a few tests, but they’re not covering the call to getRequestPostData. Not sure what we have to do here.
2018-08-04Reintroduce WARC loggingLars-Dominik Braun1-18/+30
Commit 7730e0d64ec895091a0dd7eb0e3c6ce2ed02d981 removed logging to WARC files. Add it again, but with a different implementation.. Credits to structlog for inspiration.
2018-06-25warc: Save DOM-/image screenshot as WARC conversionLars-Dominik Braun1-3/+7
Judging from the docs this is the proper way to store these resources. Enable both for the IRC bot by default, since they won’t interfere with IA’s wayback machine.
2018-06-21Fix a few issues pointed out by pylintLars-Dominik Braun1-6/+3
2018-06-20Move tests to pytestLars-Dominik Braun1-162/+0
It just seems a little nicer than plain old unittest
2018-06-20Add __slots__ to classesLars-Dominik Braun1-0/+11
This is mainly a quality of life change
2018-06-20Synchronous SiteLoader event handlingLars-Dominik Braun1-252/+165
Previously a browser crash stalled the entire grab, since events from pychrome were handled asynchronously in a different thread and exceptions were not propagated to the main thread. Now all browser events are stored in a queue and processed by the main thread, allowing us to handle browser crashes gracefully (more or less). This made the following additional changes necessary: - Clear separation between producer (browser) and consumer (WARC, stats, …) - Behavior scripts now yield events as well, instead of accessing the WARC writer - WARC logging was removed (for now) and WARC writer does not require serialization any more
2018-06-08browser: Replace --remote-debugging-socket-fdLars-Dominik Braun1-23/+19
It was replaced by --remote-debugging-pipe in version 67. pychrome does not support that out of the box, so instead we’ll let Chrome choose its own port and poll a file in its user-data-dir.
2018-05-04Support --browser again for local crawlsLars-Dominik Braun1-1/+1
Broken by commit 75019eac4545bb2e8b90033834e91beef614cdf3
2018-05-04browser: Replace context manager decoratorLars-Dominik Braun1-51/+66
Use an actual class that supports multiple invokations.
2018-05-04Move header unfolding into ItemLars-Dominik Braun1-0/+22
2018-05-04Fetch request POST bodyLars-Dominik Braun1-1/+15
If there is any and it was not included in the response already.
2018-05-04Test chained redirectsLars-Dominik Braun1-12/+32
2018-04-14Fix base64 body detectionLars-Dominik Braun1-9/+9
Broken by commit a21d7332e33a3e47a363004196451721d449e70b
2018-04-14Add timeout to request body fetchLars-Dominik Braun1-3/+4
When something goes wrong, these block the entire grab.
2018-04-14Handle JavaScript dialogsLars-Dominik Braun1-2/+37
alert, confirm and prompt and beforeunload
2018-03-25Add a few simple testsLars-Dominik Braun1-0/+190
To be expanded, but it’s a start…
2018-03-25Replace deprecated logger.warnLars-Dominik Braun1-3/+3
2018-03-25ChromeService: Close listening socketLars-Dominik Braun1-0/+1
We passed it to the child and don’t need it any more.
2018-03-25Move getResponseBody call to Item wrapperLars-Dominik Braun1-2/+19
2018-03-18browser: Don’t overwrite LogEntry’s argsLars-Dominik Braun1-1/+1
2017-12-27Log messages from browser consoleLars-Dominik Braun1-0/+12
2017-12-23Set fake finished response for redirectsLars-Dominik Braun1-1/+4
Fixes bcfbdd9b45b7e872ee77e1366197443d855d8c7c
2017-12-23Drain tab event queue before stoppingLars-Dominik Braun1-0/+2
2017-12-22Add simple stats-keeping SiteLoaderLars-Dominik Braun1-2/+33
2017-12-22SiteLoader: Save entire finished responseLars-Dominik Braun1-2/+9
2017-12-17Add distributed archivingLars-Dominik Braun1-6/+15
Using celery. Also adds a plugin for the IRC bot sopel. Code still needs some love, but it should work.
2017-12-06Start Chrome browser instanceLars-Dominik Braun1-0/+52
Unless --browser argument is given. Uses sane settings and a temporary profile directory.
2017-11-29Add missing timestamp to response data for redirectsLars-Dominik Braun1-1/+1
Fixes 6f628ca24ac2b243dd4a611ff1ecff2d35aaa019
2017-11-29Use Chrome’s timestamps as WARC-DateLars-Dominik Braun1-8/+8
2017-11-29RefactoringLars-Dominik Braun1-0/+209
Reusable browser communication and WARC writing.