Age | Commit message (Collapse) | Author | Files | Lines | |
---|---|---|---|---|---|
2018-08-04 | Reintroduce WARC logging | Lars-Dominik Braun | 1 | -8/+8 | |
Commit 7730e0d64ec895091a0dd7eb0e3c6ce2ed02d981 removed logging to WARC files. Add it again, but with a different implementation.. Credits to structlog for inspiration. | |||||
2018-06-20 | Synchronous SiteLoader event handling | Lars-Dominik Braun | 1 | -6/+13 | |
Previously a browser crash stalled the entire grab, since events from pychrome were handled asynchronously in a different thread and exceptions were not propagated to the main thread. Now all browser events are stored in a queue and processed by the main thread, allowing us to handle browser crashes gracefully (more or less). This made the following additional changes necessary: - Clear separation between producer (browser) and consumer (WARC, stats, …) - Behavior scripts now yield events as well, instead of accessing the WARC writer - WARC logging was removed (for now) and WARC writer does not require serialization any more | |||||
2018-05-04 | Share recursive argument parser | Lars-Dominik Braun | 1 | -7/+13 | |
2018-05-04 | Support --browser again for local crawls | Lars-Dominik Braun | 1 | -1/+5 | |
Broken by commit 75019eac4545bb2e8b90033834e91beef614cdf3 | |||||
2018-05-04 | Add distributed recursive crawls | Lars-Dominik Braun | 1 | -23/+18 | |
2018-05-04 | Add support for recursive crawls | Lars-Dominik Braun | 1 | -2/+15 | |
Only local right now, not distributed. | |||||
2018-05-04 | behavior: Add link extraction script | Lars-Dominik Braun | 1 | -2/+3 | |
2018-05-04 | Move page archiving logic to SinglePageController | Lars-Dominik Braun | 1 | -114/+21 | |
In preparation for recursive crawls. | |||||
2017-12-25 | Increase default body size | Lars-Dominik Braun | 1 | -3/+3 | |
2017-12-24 | Refactor behavior scripts | Lars-Dominik Braun | 1 | -146/+28 | |
No functional changes, just cleanup. Replaces onload and onsnapshot events. Move screen metric emulation, DOM snapshots and screenshots here as well. | |||||
2017-12-22 | Add simple stats-keeping SiteLoader | Lars-Dominik Braun | 1 | -3/+7 | |
2017-12-20 | Increase hardcoded max timeouts | Lars-Dominik Braun | 1 | -2/+2 | |
We need a better solution for this. Sites loading a lot of responsive images easily need a minute after resizing. | |||||
2017-12-19 | Serialize WARC writing | Lars-Dominik Braun | 1 | -3/+3 | |
Logger and SiteWriter both access .write_record() concurrently, which can corrupt WARC files. Move the writer to its own thread and decouple it with a queue. Since we’re probably I/O-bound this may speed up writeback as well. | |||||
2017-12-19 | Select default behavior scripts by site URL | Lars-Dominik Braun | 1 | -1/+10 | |
2017-12-17 | Add distributed archiving | Lars-Dominik Braun | 1 | -145/+206 | |
Using celery. Also adds a plugin for the IRC bot sopel. Code still needs some love, but it should work. | |||||
2017-12-06 | Start Chrome browser instance | Lars-Dominik Braun | 1 | -44/+49 | |
Unless --browser argument is given. Uses sane settings and a temporary profile directory. | |||||
2017-12-06 | Add flags to disable screenshot/DOM snapshot | Lars-Dominik Braun | 1 | -5/+9 | |
2017-12-03 | Fix UTF-8 encoding name | Lars-Dominik Braun | 1 | -1/+1 | |
HTMLSerializer uses the exact string given in <meta charset=X>, thus it should be with hyphen. | |||||
2017-12-03 | Add page screenshot to WARC | Lars-Dominik Braun | 1 | -0/+14 | |
2017-11-29 | argparse: Add metavar | Lars-Dominik Braun | 1 | -7/+7 | |
2017-11-29 | Refactoring | Lars-Dominik Braun | 1 | -402/+50 | |
Reusable browser communication and WARC writing. | |||||
2017-11-26 | DOM snapshot: Generate valid HTML5 | Lars-Dominik Braun | 1 | -7/+12 | |
Some tags are “void”, i.e. cannot contain contents and don’t have a closing tag. | |||||
2017-11-25 | Ignore duplicate URLs when saving DOM snapshot | Lars-Dominik Braun | 1 | -1/+10 | |
2017-11-25 | Workaround broken device metrics reset | Lars-Dominik Braun | 1 | -1/+3 | |
Apparently neither width=0, height=0 nor clearDeviceMetricsOverride() do what they should, so manually reset to 1080p screen size. | |||||
2017-11-25 | Strip on* HTML attributes | Lars-Dominik Braun | 1 | -1/+27 | |
They can carry JavaScript as well and should not be allowed for DOM snapshots. | |||||
2017-11-25 | Rename --run-before-snapshot and document --on* options | Lars-Dominik Braun | 1 | -3/+3 | |
2017-11-24 | DOM snapshot: Save frames/subdocuments as well | Lars-Dominik Braun | 1 | -13/+36 | |
Request all subdocuments with pierce=True, split the result and save each document. Playback with pywb works, because timestamps of the snapshots are close to each other. | |||||
2017-11-24 | Reset device metrics | Lars-Dominik Braun | 1 | -2/+5 | |
2017-11-24 | Save onsnapshot script to WARC | Lars-Dominik Braun | 1 | -4/+8 | |
2017-11-22 | Make <canvas> static before DOM snapshot | Lars-Dominik Braun | 1 | -8/+13 | |
Use --run-before-snapshot=canvas-snapshot.js. Replaces <canvas> with image snapshot. We could use .captureStream() as well. | |||||
2017-11-22 | Emulate different screen sizes | Lars-Dominik Braun | 1 | -0/+25 | |
Causes the browser to load CSS assets and <img> srcset, for example. | |||||
2017-11-22 | Add example fixups for Instagram | Lars-Dominik Braun | 1 | -3/+9 | |
2017-11-21 | Move base64 metadata into WARC header | Lars-Dominik Braun | 1 | -1/+1 | |
2017-11-21 | Graceful page load timeout | Lars-Dominik Braun | 1 | -9/+26 | |
Stop scrolling script, wait for remaining resources to load. | |||||
2017-11-20 | Add page created from DOM snapshot | Lars-Dominik Braun | 1 | -6/+101 | |
2017-11-17 | Initial import | Lars-Dominik Braun | 1 | -0/+320 | |