summaryrefslogtreecommitdiff
path: root/crocoite
AgeCommit message (Collapse)AuthorFilesLines
2018-11-10tools: Fix WARC mergingLars-Dominik Braun2-18/+205
WARC-Target-URI was taken from the previous record, even if the URI was different. This essentially removes the revisited URL from the archive. Also add a few tests. And boy, warcio is a mess.
2018-11-08devtools: Disable websocket pings to ChromeLars-Dominik Braun2-1/+12
Chrome does not like that.
2018-11-06Switch single mode to asyncioLars-Dominik Braun5-175/+141
This is a direct port to asyncio without any design changes. These need to happen in further refinements. Fixes issue #1.
2018-11-06Switch site loader to async DevTools communicationLars-Dominik Braun2-229/+236
2018-11-06Add simple asyncio-based DevTool communicationLars-Dominik Braun2-0/+406
Inspired by pychrome/aiochrome, but includes crash handling and async get() instead of callbacks.
2018-11-03html: Add tests for tag/attribute strippingLars-Dominik Braun1-0/+38
2018-10-30recursive: Actually stop the grab when canceledLars-Dominik Braun1-1/+3
This change was lost during the merge of 958563a3602780b48599c27acf212139c2e6904d.
2018-10-30Reduce idle wait time after stopping pageLars-Dominik Braun1-4/+4
2018-10-30Increase default timeoutsLars-Dominik Braun1-2/+2
These are more sane than the previous super-short defaults. Obviously this will slow down recursive crawls.
2018-10-23single: Set and recursive: check exit statusLars-Dominik Braun2-12/+34
Use exit status to signal something is wrong. Check it within recursive, increment crashed counter and do not move the resulting WARC, it might be broken.
2018-10-22behavior: Unload script only if the handle is validLars-Dominik Braun1-2/+4
For some reason with Google Chrome 70 this is not the case any more.
2018-10-14irc: Add PoC dashboardLars-Dominik Braun3-16/+119
Using websockets, vue and bulma.
2018-10-14irc: Graceful bot shutdownLars-Dominik Braun3-16/+110
Wait for remaining jobs to finish without accepting new ones, but still allow some interaction with the bot (status/revoke).
2018-10-11recursive: Gracefully shut down on SIGINT/TERMLars-Dominik Braun2-4/+18
2018-10-10Add timezone to logger datesLars-Dominik Braun1-1/+3
UTC everywhere. Make that clear.
2018-10-03controller: Depth limit does not work with i>1Lars-Dominik Braun1-1/+3
No easy way to fix this, so just limit to [0, 1] for now.
2018-10-03irc: Fix mode parsingLars-Dominik Braun2-7/+37
Ignore unsupported modes, add tests.
2018-10-02irc: Refactoring/beautificationLars-Dominik Braun2-101/+266
Add logging, split bot into abstract bot implementation and actual chromebot implementation, move some reusable checks into decorators.
2018-09-29Add documentationLars-Dominik Braun2-3/+9
For -recursive and -irc
2018-09-29irc: Limit number of processes spawnedLars-Dominik Braun2-21/+25
2018-09-29Add simple IRC botLars-Dominik Braun2-0/+273
chromebot is back! Dropping sopel, because it does not work well with asyncio.
2018-09-25Prevent recursing into arbitrary schemesLars-Dominik Braun1-1/+9
HTTP(S) only.
2018-09-25Parallelize recursive grabsLars-Dominik Braun2-5/+17
❤️ asyncio.
2018-09-25Add recursive controllerLars-Dominik Braun2-1/+169
Simple and sequential.
2018-09-25Immediately flush loggerLars-Dominik Braun1-0/+2
Consumers can read the latest gossip faster now.
2018-09-25Log extracted linksLars-Dominik Braun2-2/+25
2018-08-21Remove celery and recursionLars-Dominik Braun3-317/+23
Gonna rewrite that properly.
2018-08-17behavior: Load more comments from FacebookLars-Dominik Braun1-0/+4
2018-08-05test_browser: Properly handle failed requestsLars-Dominik Braun2-15/+14
Fixes test failures. Very fragile code unfortunately.
2018-08-04Properly handle failure to retrieve request bodyLars-Dominik Braun3-5/+50
Just truncate the WARC record like we do with responses. Also add a few tests, but they’re not covering the call to getRequestPostData. Not sure what we have to do here.
2018-08-04Reference warcinfo record in every other recordLars-Dominik Braun1-18/+30
2018-08-04Add package information to warcinfoLars-Dominik Braun3-8/+65
Change warcinfo record format to JSON (this is permitted by the specs) and add Python version, dependencies and their versions as well as file hashes. This should give us enough information to figure out the exact environment used to create the WARC.
2018-08-04Reintroduce WARC loggingLars-Dominik Braun9-76/+337
Commit 7730e0d64ec895091a0dd7eb0e3c6ce2ed02d981 removed logging to WARC files. Add it again, but with a different implementation.. Credits to structlog for inspiration.
2018-06-25browser: Fix testcase race conditionLars-Dominik Braun1-0/+4
2018-06-25warc: Add metadata to truncated recordsLars-Dominik Braun1-22/+28
Specifically for a) redirects (body missing) b) bodies larger than size limit and c) whenever we couldn’t fetch the response body for whatever reason. We gave it our best shot, but still failed miserably. Future generations will certainly appreciate that. Eh, maybe. Hopefully. Will they?
2018-06-25warc: Save DOM-/image screenshot as WARC conversionLars-Dominik Braun6-37/+72
Judging from the docs this is the proper way to store these resources. Enable both for the IRC bot by default, since they won’t interfere with IA’s wayback machine.
2018-06-21Fix a few issues pointed out by pylintLars-Dominik Braun5-22/+10
2018-06-21browser: Add a few more testsLars-Dominik Braun1-3/+31
Increase coverage.
2018-06-20Move tests to pytestLars-Dominik Braun2-162/+177
It just seems a little nicer than plain old unittest
2018-06-20Add __slots__ to classesLars-Dominik Braun5-1/+56
This is mainly a quality of life change
2018-06-20Synchronous SiteLoader event handlingLars-Dominik Braun6-509/+514
Previously a browser crash stalled the entire grab, since events from pychrome were handled asynchronously in a different thread and exceptions were not propagated to the main thread. Now all browser events are stored in a queue and processed by the main thread, allowing us to handle browser crashes gracefully (more or less). This made the following additional changes necessary: - Clear separation between producer (browser) and consumer (WARC, stats, …) - Behavior scripts now yield events as well, instead of accessing the WARC writer - WARC logging was removed (for now) and WARC writer does not require serialization any more
2018-06-08browser: Replace --remote-debugging-socket-fdLars-Dominik Braun1-23/+19
It was replaced by --remote-debugging-pipe in version 67. pychrome does not support that out of the box, so instead we’ll let Chrome choose its own port and poll a file in its user-data-dir.
2018-06-03behavior: Wrap extract links script in anonymous namespaceLars-Dominik Braun2-2/+5
Otherwise it may clash with symbols defined by the page.
2018-05-20behavior: Patreon: Load more comments/repliesLars-Dominik Braun1-0/+4
2018-05-20behavior: Click Patreon’s “load more” buttonLars-Dominik Braun1-0/+6
2018-05-05Rename command line toolsLars-Dominik Braun1-0/+97
Move contrib/ scripts to .tools and add entry points to setup.py, rename crocoite-standalone to crocoite-grab.
2018-05-05Extract only visible and clickable linksLars-Dominik Braun2-4/+29
2018-05-04Share recursive argument parserLars-Dominik Braun2-14/+15
2018-05-04Support --browser again for local crawlsLars-Dominik Braun2-2/+6
Broken by commit 75019eac4545bb2e8b90033834e91beef614cdf3
2018-05-04Add distributed recursive crawlsLars-Dominik Braun3-31/+91