summaryrefslogtreecommitdiff
path: root/crocoite/controller.py
AgeCommit message (Collapse)AuthorFilesLines
2019-05-30controller: Fix -recursive statsLars-Dominik Braun1-2/+5
have previously included running jobs. Remove them.
2019-05-30controller: Correctly re-raise exceptionsLars-Dominik Braun1-1/+2
asyncio.gather returns the task’s results or exception, not task objects. Probably a copy&paste error.
2019-05-30controller: Fix DepthLimitLars-Dominik Braun1-11/+31
The policy itself must be stateless, since there can be multiple ExtractLinks events (which would cause DepthLimit to reduce its depth every time).
2019-05-05irc: Add job info to warcinfo recordLars-Dominik Braun1-1/+5
2019-05-05cli: Allow adding extra data to warcinfo recordLars-Dominik Braun1-2/+7
2019-03-16Add more debug messagesLars-Dominik Braun1-2/+10
…to controller and behavior
2019-03-05Replace mutable default argumentsLars-Dominik Braun1-2/+2
This fixes IRC permission checks. Previously all users who joined the channel after the bot stored their modes in the same set(). Can be detected with pylint W0102.
2019-01-27recursive: Avoid deadlock if unknown exception occursLars-Dominik Braun1-0/+9
Kill the subprocess and make sure we retrieve exceptions from .fetch()
2019-01-27Increase subprocess’ StreamReader limitsLars-Dominik Braun1-1/+1
We’re sending quite big JSON objects since 3a2fcc69a8eb4237b2862b3e291971d38748f115.
2019-01-26controller: Make sure idleTimeout is always appliedLars-Dominik Braun1-1/+3
If the browser goes idle before we enter `while True` we never notice and thus the idleTimeout is never applied.
2019-01-05controller: Fix PrefixLimitLars-Dominik Braun1-1/+1
Probably broken by the transition to URL() in commit 5e444dd6511d97308a84ae9c86ebf14547d01f01 And yes, we desperately need some tests for this.
2019-01-03browser: Turn Item into RequestResponsePairLars-Dominik Braun1-6/+6
Previously Item was just a simple wrapper around Chrome’s Network.* events. This turned out to be quite nasty when testing, so its replacement, RequestResponsePair, does some level of abstraction. This makes testing alot easier, since we now can simply instantiate it without building a proper DevTools event. Should come without any functional changes.
2018-12-24Use f-strings where possibleLars-Dominik Braun1-1/+1
Replaces str.format, which is less readable due to its separation of format and arguments.
2018-12-22Fix recursive mode’s URL parsingLars-Dominik Braun1-1/+2
Broken by commit 5e444dd6511d97308a84ae9c86ebf14547d01f01. URL’s read from stdin must be converted from str.
2018-12-22Switch -recursive to asyncio’s .cancel()Lars-Dominik Braun1-53/+55
RecursiveController used a custom .cancel() method before. Instead we can simply cancel .run() and handle the CancelledError inside run() and fetch().
2018-12-21Remove unused EventHandler propertyLars-Dominik Braun1-6/+0
Crash detection was moved into -recursive’s return code checking a while ago.
2018-12-21Parse URLs by defaultLars-Dominik Braun1-5/+4
Use library yarl (already pulled in by aiohttp). No URL processed should be a string.
2018-12-08controller: Reraise queue processing errors earlyLars-Dominik Braun1-1/+7
2018-12-08tools: Add version info to merged WARCsLars-Dominik Braun1-11/+4
In preparation for #9. I was hoping to reuse one of schema.org’s microdata schema’s, but neither Action (archival action) nor SoftwareApplication (version information) seem to be suitable.
2018-12-02controller: Add only enabled behavior scripts to warcinfoLars-Dominik Braun1-5/+5
2018-12-02controller: Remove unused argumentLars-Dominik Braun1-4/+3
Has been replaced by handler a while ago.
2018-11-25single: Graceful ^CLars-Dominik Braun1-1/+8
Allow cancellation of timeout wait.
2018-11-24behavior: Fix scrollingLars-Dominik Braun1-2/+2
- Introduce stop() method callable from Python. Looks like the old method (global variable) was not working (any more?). This is much better anyway. - Restore state of scrolled elements (not window). Fixes weird screenshots of twitter.com.
2018-11-22controller: Improve idle waitingLars-Dominik Braun1-16/+27
2018-11-19controller: Add parameters to warcinfoLars-Dominik Braun1-0/+7
Add parameters the grab was run with, so we can actually reproduce a run.
2018-11-19Coding styleLars-Dominik Braun1-20/+16
Fix a few random issues pointed out by pylint, mainly unused imports.
2018-11-14Async chrome process startupLars-Dominik Braun1-66/+66
Move it to .devtools. Seems more fitting.
2018-11-06Switch single mode to asyncioLars-Dominik Braun1-103/+75
This is a direct port to asyncio without any design changes. These need to happen in further refinements. Fixes issue #1.
2018-10-30recursive: Actually stop the grab when canceledLars-Dominik Braun1-1/+3
This change was lost during the merge of 958563a3602780b48599c27acf212139c2e6904d.
2018-10-30Reduce idle wait time after stopping pageLars-Dominik Braun1-4/+4
2018-10-23single: Set and recursive: check exit statusLars-Dominik Braun1-7/+14
Use exit status to signal something is wrong. Check it within recursive, increment crashed counter and do not move the resulting WARC, it might be broken.
2018-10-14irc: Add PoC dashboardLars-Dominik Braun1-6/+12
Using websockets, vue and bulma.
2018-10-11recursive: Gracefully shut down on SIGINT/TERMLars-Dominik Braun1-3/+14
2018-10-03controller: Depth limit does not work with i>1Lars-Dominik Braun1-1/+3
No easy way to fix this, so just limit to [0, 1] for now.
2018-09-29Add documentationLars-Dominik Braun1-2/+3
For -recursive and -irc
2018-09-25Prevent recursing into arbitrary schemesLars-Dominik Braun1-1/+9
HTTP(S) only.
2018-09-25Parallelize recursive grabsLars-Dominik Braun1-4/+14
❤️ asyncio.
2018-09-25Add recursive controllerLars-Dominik Braun1-1/+129
Simple and sequential.
2018-09-25Log extracted linksLars-Dominik Braun1-0/+23
2018-08-21Remove celery and recursionLars-Dominik Braun1-118/+3
Gonna rewrite that properly.
2018-08-04Add package information to warcinfoLars-Dominik Braun1-6/+16
Change warcinfo record format to JSON (this is permitted by the specs) and add Python version, dependencies and their versions as well as file hashes. This should give us enough information to figure out the exact environment used to create the WARC.
2018-08-04Reintroduce WARC loggingLars-Dominik Braun1-23/+33
Commit 7730e0d64ec895091a0dd7eb0e3c6ce2ed02d981 removed logging to WARC files. Add it again, but with a different implementation.. Credits to structlog for inspiration.
2018-06-25warc: Save DOM-/image screenshot as WARC conversionLars-Dominik Braun1-7/+1
Judging from the docs this is the proper way to store these resources. Enable both for the IRC bot by default, since they won’t interfere with IA’s wayback machine.
2018-06-20Add __slots__ to classesLars-Dominik Braun1-0/+22
This is mainly a quality of life change
2018-06-20Synchronous SiteLoader event handlingLars-Dominik Braun1-99/+161
Previously a browser crash stalled the entire grab, since events from pychrome were handled asynchronously in a different thread and exceptions were not propagated to the main thread. Now all browser events are stored in a queue and processed by the main thread, allowing us to handle browser crashes gracefully (more or less). This made the following additional changes necessary: - Clear separation between producer (browser) and consumer (WARC, stats, …) - Behavior scripts now yield events as well, instead of accessing the WARC writer - WARC logging was removed (for now) and WARC writer does not require serialization any more
2018-05-05Extract only visible and clickable linksLars-Dominik Braun1-1/+1
2018-05-04Add distributed recursive crawlsLars-Dominik Braun1-5/+17
2018-05-04Add support for recursive crawlsLars-Dominik Braun1-0/+100
Only local right now, not distributed.
2018-05-04behavior: Add link extraction scriptLars-Dominik Braun1-1/+11
2018-05-04Move page archiving logic to SinglePageControllerLars-Dominik Braun1-0/+103
In preparation for recursive crawls.