Skip to content

Ne-Lexa/roach-php-bundle

Repository files navigation

roach-php-bundle

roach-php-bundle

Symfony bundle for Roach PHP.

Latest Stable Version PHP Version Require Tests Build Status Scrutinizer Code Quality Code Coverage License

Roach is a complete web scraping toolkit for PHP. It is a shameless clone heavily inspired by the popular Scrapy package for Python.

The Symfony bundle mostly provides the necessary container bindings for the various services Roach uses, as well as making certain configuration options available via a config file. To learn about how to actually start using Roach itself, check out the rest of the documentation.

Installing the Symfony bundle

Add nelexa/roach-php-bundle to your composer.json file:

composer require nelexa/roach-php-bundle

Versions & Dependencies

Bundle version roach-php/core version Symfony version PHP version(s)
0.3.0 0.3.0 ^5.3 | ^6.0 >= 8.0
1.0.0 ^1.0.0 ^6.0 >= 8.0
1.1.0 1.1.* ^6.0 >= 8.0

Register the bundle:

Register bundle into config/bundles.php (Flex did it automatically):

return [
    //...
    \Nelexa\RoachPhpBundle\RoachPhpBundle::class => ['all' => true],
];

Available Commands

The Symfony bundle of Roach registers a few console commands to make out development experience as pleasant as possible.

Run spider

php bin/console roach:run

After that, you will get the entire list of available spiders.


 Choose a spider class:
  [0] App\Spider\GoogleSpider
  [1] App\Spider\FacebookSpider
  [2] App\Spider\TwitterSpider

Simply select the desired spider (▼ or ▲) or enter its number and press Enter.

You can pass as the first argument the name spider class name to run or its alias. For example, if you have a class App\Spider\GoogleSpider, then you can pass the following aliases: GoogleSpider, google_spider or google.

php bin/console roach:run google

Sometimes it is useful to override the number of concurrent requests and the pre-request delay. To do this, you can pass the --concurrency and --delay options.

php bin/console roach:php google --concurrency 8 --delay 2

These options override the $concurrency and $requestDelay public properties of your spider.

Add the --output (-o) option and you can save the collected data to a JSON file.

php bin/console roach:php google --output 'path/to/data.json'

Starting the REPL

Roach ships with an interactive shell (often called Read-Evaluate-Print-Loop, or Repl for short) which makes prototyping our spiders a breeze. We can use the provided roach:shell command to launch a new Repl session.

php bin/console roach:shell "https://roach-php.dev/docs/introduction"

Generator classes

First install Symfony MakerBundle.

composer require --dev symfony/maker-bundle

Create a new roach spider class

php bin/console make:roach:spider

Create a new roach extension class

php bin/console make:roach:extension

Create a new roach item processor class

php bin/console make:roach:item:processor

Create a new roach downloader request middleware class

php bin/console make:roach:middleware:downloader:request

Create a new roach downloader response middleware class

php bin/console make:roach:middleware:downloader:response

Create a new roach spider item middleware class

php bin/console make:roach:middleware:spider:item

Create a new roach spider request middleware class

php bin/console make:roach:middleware:spider:request

Create a new roach spider response middleware class

php bin/console make:roach:middleware:spider:response

Screencast

asciicast

Credits

Changelog

Changes are documented in the releases page.

License

The MIT License (MIT). Please see LICENSE for more information.