summaryrefslogtreecommitdiff
path: root/crocoite/browser.py
AgeCommit message (Collapse)AuthorFilesLines
2019-10-13browser: Work around missing responseReceived eventsLars-Dominik Braun1-0/+7
Looks like Chrome extensively reuses request ids now. Sucks, since we relied on their uniqueness. For now ignore requests without a dedicated responseReceived event. See issue #24.
2019-06-18Re-inject behavior scripts on site reloadLars-Dominik Braun1-1/+22
Fixes #13. Event handler’s push() is async now.
2019-06-18Fix idle state tracking race conditionLars-Dominik Braun1-31/+18
Closes #16. Expose SiteLoader’s page idle changes through events and move state tracking into controller event handler. Relies on tracking time instead of asyncio event, which is more reliable.
2019-03-16browser: Raise exception if navigation failedLars-Dominik Braun1-0/+5
Stop early if there’s nothing to do.
2019-03-16browser: Use different UUID for loadingFinished/FailedLars-Dominik Braun1-1/+1
2019-01-07Log Chrome’s responses to WARC by defaultLars-Dominik Braun1-9/+21
We may not be able to reproduce every failure, so logging as much as possible is important to figure out what went wrong. Also, in case a bug is uncovered in the future, we can check the logs and possibly fix it with -errata.
2019-01-05browser: Do not overwrite request data when prefetchingLars-Dominik Braun1-2/+0
Needs a testcase.
2019-01-03browser: Turn Item into RequestResponsePairLars-Dominik Braun1-111/+215
Previously Item was just a simple wrapper around Chrome’s Network.* events. This turned out to be quite nasty when testing, so its replacement, RequestResponsePair, does some level of abstraction. This makes testing alot easier, since we now can simply instantiate it without building a proper DevTools event. Should come without any functional changes.
2018-12-24Use f-strings where possibleLars-Dominik Braun1-1/+1
Replaces str.format, which is less readable due to its separation of format and arguments.
2018-12-21Parse URLs by defaultLars-Dominik Braun1-17/+11
Use library yarl (already pulled in by aiohttp). No URL processed should be a string.
2018-11-24browser: Ignore load failures for nonexisting requestsLars-Dominik Braun1-2/+3
Fixes None dereference.
2018-11-22controller: Improve idle waitingLars-Dominik Braun1-2/+38
2018-11-19Coding styleLars-Dominik Braun1-4/+2
Fix a few random issues pointed out by pylint, mainly unused imports.
2018-11-17browser: clearBrowserCookies is supported unconditionallyLars-Dominik Braun1-4/+1
canClearBrowserCookies apparently has been removed from protocol 1.3.
2018-11-14Async chrome process startupLars-Dominik Braun1-72/+0
Move it to .devtools. Seems more fitting.
2018-11-06Switch site loader to async DevTools communicationLars-Dominik Braun1-107/+126
2018-08-05test_browser: Properly handle failed requestsLars-Dominik Braun1-5/+4
Fixes test failures. Very fragile code unfortunately.
2018-08-04Properly handle failure to retrieve request bodyLars-Dominik Braun1-1/+3
Just truncate the WARC record like we do with responses. Also add a few tests, but they’re not covering the call to getRequestPostData. Not sure what we have to do here.
2018-08-04Reintroduce WARC loggingLars-Dominik Braun1-18/+30
Commit 7730e0d64ec895091a0dd7eb0e3c6ce2ed02d981 removed logging to WARC files. Add it again, but with a different implementation.. Credits to structlog for inspiration.
2018-06-25warc: Save DOM-/image screenshot as WARC conversionLars-Dominik Braun1-3/+7
Judging from the docs this is the proper way to store these resources. Enable both for the IRC bot by default, since they won’t interfere with IA’s wayback machine.
2018-06-21Fix a few issues pointed out by pylintLars-Dominik Braun1-6/+3
2018-06-20Move tests to pytestLars-Dominik Braun1-162/+0
It just seems a little nicer than plain old unittest
2018-06-20Add __slots__ to classesLars-Dominik Braun1-0/+11
This is mainly a quality of life change
2018-06-20Synchronous SiteLoader event handlingLars-Dominik Braun1-252/+165
Previously a browser crash stalled the entire grab, since events from pychrome were handled asynchronously in a different thread and exceptions were not propagated to the main thread. Now all browser events are stored in a queue and processed by the main thread, allowing us to handle browser crashes gracefully (more or less). This made the following additional changes necessary: - Clear separation between producer (browser) and consumer (WARC, stats, …) - Behavior scripts now yield events as well, instead of accessing the WARC writer - WARC logging was removed (for now) and WARC writer does not require serialization any more
2018-06-08browser: Replace --remote-debugging-socket-fdLars-Dominik Braun1-23/+19
It was replaced by --remote-debugging-pipe in version 67. pychrome does not support that out of the box, so instead we’ll let Chrome choose its own port and poll a file in its user-data-dir.
2018-05-04Support --browser again for local crawlsLars-Dominik Braun1-1/+1
Broken by commit 75019eac4545bb2e8b90033834e91beef614cdf3
2018-05-04browser: Replace context manager decoratorLars-Dominik Braun1-51/+66
Use an actual class that supports multiple invokations.
2018-05-04Move header unfolding into ItemLars-Dominik Braun1-0/+22
2018-05-04Fetch request POST bodyLars-Dominik Braun1-1/+15
If there is any and it was not included in the response already.
2018-05-04Test chained redirectsLars-Dominik Braun1-12/+32
2018-04-14Fix base64 body detectionLars-Dominik Braun1-9/+9
Broken by commit a21d7332e33a3e47a363004196451721d449e70b
2018-04-14Add timeout to request body fetchLars-Dominik Braun1-3/+4
When something goes wrong, these block the entire grab.
2018-04-14Handle JavaScript dialogsLars-Dominik Braun1-2/+37
alert, confirm and prompt and beforeunload
2018-03-25Add a few simple testsLars-Dominik Braun1-0/+190
To be expanded, but it’s a start…
2018-03-25Replace deprecated logger.warnLars-Dominik Braun1-3/+3
2018-03-25ChromeService: Close listening socketLars-Dominik Braun1-0/+1
We passed it to the child and don’t need it any more.
2018-03-25Move getResponseBody call to Item wrapperLars-Dominik Braun1-2/+19
2018-03-18browser: Don’t overwrite LogEntry’s argsLars-Dominik Braun1-1/+1
2017-12-27Log messages from browser consoleLars-Dominik Braun1-0/+12
2017-12-23Set fake finished response for redirectsLars-Dominik Braun1-1/+4
Fixes bcfbdd9b45b7e872ee77e1366197443d855d8c7c
2017-12-23Drain tab event queue before stoppingLars-Dominik Braun1-0/+2
2017-12-22Add simple stats-keeping SiteLoaderLars-Dominik Braun1-2/+33
2017-12-22SiteLoader: Save entire finished responseLars-Dominik Braun1-2/+9
2017-12-17Add distributed archivingLars-Dominik Braun1-6/+15
Using celery. Also adds a plugin for the IRC bot sopel. Code still needs some love, but it should work.
2017-12-06Start Chrome browser instanceLars-Dominik Braun1-0/+52
Unless --browser argument is given. Uses sane settings and a temporary profile directory.
2017-11-29Add missing timestamp to response data for redirectsLars-Dominik Braun1-1/+1
Fixes 6f628ca24ac2b243dd4a611ff1ecff2d35aaa019
2017-11-29Use Chrome’s timestamps as WARC-DateLars-Dominik Braun1-8/+8
2017-11-29RefactoringLars-Dominik Braun1-0/+209
Reusable browser communication and WARC writing.