See also Browser Automation
Goal: Have an all-in-one tool that can do the following
- Crawl links
- Scraper (fetch data)
- Parse
- Structured data extraction like https://github.com/crwlrsoft/schema-org
- Link checker
- And be part of an End to end (E2E) testing / Smoke Test testing infrastructure
- Ideally, evolve to later capture and replay live HTTP traffic into a test environment in order to continuously test with real data
We want this both for
- Internal Tiki URLs
- As anonymous
- As a specific user
- External
- The open web
- Corporate info behind the firewall
- Password-protected sites (ideally)
Use cases
- Run on various Tiki as https://dev.tiki.org/Pre-dogfood-servers-for-Tiki-26-release-process
- Test new versions of PHP when Using GlitchTip as part of the Tiki development process. Ref: PHP8
- The risky last part of the migration to PSR-12 (We did all the non risky parts already)
- Facilitate the creation of tests for complex Tiki instances. PluginList and Trackers mostly.
- Fetch data from external sites for upcoming Tiki AI Chatbot to be able to refer to external sources
- Replace PluginCasperJS with something modern and supported
- Later, combine with Machine Learning
Requirements
- In PHP, so it can be shipped in any Tiki instance
Potential tools
PHP
Roach
Spatie Crawler
Symfony Panther
Symfony BrowserKit
https://github.com/symfony/browser-kit
crwlr
- https://github.com/crwlrsoft/crawler
- https://www.crwlr.software/blog/good-reasons-to-use-the-crwlr-library
spekulatius/phpscraper
Laravel Dusk
Codeception
https://github.com/Codeception/Codeception
Arachnid
PHP spider
Acquia BLT (Build and Launch Tool)
This is for Drupal which has a testing component. Adding here for good ideas (ex.: what build blocks do they use?)
Not PHP
- https://github.com/cypress-io/cypress
- https://github.com/microsoft/playwright
- https://github.com/SeleniumHQ/selenium
URLs
- https://packagist.org/?query=Scraper
- https://packagist.org/?query=Crawler
- Examples from other communities
- https://github.com/topics/testing?l=php