-
Notifications
You must be signed in to change notification settings - Fork 750
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
31 changed files
with
4,552 additions
and
0 deletions.
There are no files selected for viewing
Large diffs are not rendered by default.
Oops, something went wrong.
128 changes: 128 additions & 0 deletions
128
website/versioned_docs/version-1.0.0/api/BasicCrawler.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,128 @@ | ||
--- | ||
id: version-1.0.0-basic-crawler | ||
title: BasicCrawler | ||
original_id: basic-crawler | ||
--- | ||
|
||
<a name="basiccrawler"></a> | ||
|
||
Provides a simple framework for parallel crawling of web pages. The URLs to crawl are fed either from a static list of URLs or from a dynamic queue of | ||
URLs enabling recursive crawling of websites. | ||
|
||
`BasicCrawler` is a low-level tool that requires the user to implement the page download and data extraction functionality themselves. If you want a | ||
crawler that already facilitates this functionality, please consider using [`CheerioCrawler`](../api/cheerio-crawler), | ||
[`PuppeteerCrawler`](../api/puppeteer-crawler) or [`PlaywrightCrawler`](../api/playwright-crawler). | ||
|
||
`BasicCrawler` invokes the user-provided [`BasicCrawlerOptions.handleRequestFunction`](../typedefs/basic-crawler-options#handlerequestfunction) for | ||
each [`Request`](../api/request) object, which represents a single URL to crawl. The [`Request`](../api/request) objects are fed from the | ||
[`RequestList`](../api/request-list) or the [`RequestQueue`](../api/request-queue) instances provided by the | ||
[`BasicCrawlerOptions.requestList`](../typedefs/basic-crawler-options#requestlist) or | ||
[`BasicCrawlerOptions.requestQueue`](../typedefs/basic-crawler-options#requestqueue) constructor options, respectively. | ||
|
||
If both [`BasicCrawlerOptions.requestList`](../typedefs/basic-crawler-options#requestlist) and | ||
[`BasicCrawlerOptions.requestQueue`](../typedefs/basic-crawler-options#requestqueue) options are used, the instance first processes URLs from the | ||
[`RequestList`](../api/request-list) and automatically enqueues all of them to [`RequestQueue`](../api/request-queue) before it starts their | ||
processing. This ensures that a single URL is not crawled multiple times. | ||
|
||
The crawler finishes if there are no more [`Request`](../api/request) objects to crawl. | ||
|
||
New requests are only dispatched when there is enough free CPU and memory available, using the functionality provided by the | ||
[`AutoscaledPool`](../api/autoscaled-pool) class. All [`AutoscaledPool`](../api/autoscaled-pool) configuration options can be passed to the | ||
`autoscaledPoolOptions` parameter of the `BasicCrawler` constructor. For user convenience, the `minConcurrency` and `maxConcurrency` | ||
[`AutoscaledPool`](../api/autoscaled-pool) options are available directly in the `BasicCrawler` constructor. | ||
|
||
**Example usage:** | ||
|
||
```javascript | ||
// Prepare a list of URLs to crawl | ||
const requestList = new Apify.RequestList({ | ||
sources: [{ url: 'http://www.example.com/page-1' }, { url: 'http://www.example.com/page-2' }], | ||
}); | ||
await requestList.initialize(); | ||
|
||
// Crawl the URLs | ||
const crawler = new Apify.BasicCrawler({ | ||
requestList, | ||
handleRequestFunction: async ({ request }) => { | ||
// 'request' contains an instance of the Request class | ||
// Here we simply fetch the HTML of the page and store it to a dataset | ||
const { body } = await Apify.utils.requestAsBrowser(request); | ||
await Apify.pushData({ | ||
url: request.url, | ||
html: body, | ||
}); | ||
}, | ||
}); | ||
|
||
await crawler.run(); | ||
``` | ||
|
||
## Properties | ||
|
||
### `stats` | ||
|
||
**Type**: [`Statistics`](../api/statistics) | ||
|
||
Contains statistics about the current run. | ||
|
||
--- | ||
|
||
### `requestList` | ||
|
||
**Type**: [`RequestList`](../api/request-list) | ||
|
||
A reference to the underlying [`RequestList`](../api/request-list) class that manages the crawler's [`Request`](../api/request)s. Only available if | ||
used by the crawler. | ||
|
||
--- | ||
|
||
### `requestQueue` | ||
|
||
**Type**: [`RequestQueue`](../api/request-queue) | ||
|
||
A reference to the underlying [`RequestQueue`](../api/request-queue) class that manages the crawler's [`Request`](../api/request)s. Only available if | ||
used by the crawler. | ||
|
||
--- | ||
|
||
### `sessionPool` | ||
|
||
**Type**: [`SessionPool`](../api/session-pool) | ||
|
||
A reference to the underlying [`SessionPool`](../api/session-pool) class that manages the crawler's [`Session`](../api/session)s. Only available if | ||
used by the crawler. | ||
|
||
--- | ||
|
||
### `autoscaledPool` | ||
|
||
**Type**: [`AutoscaledPool`](../api/autoscaled-pool) | ||
|
||
A reference to the underlying [`AutoscaledPool`](../api/autoscaled-pool) class that manages the concurrency of the crawler. Note that this property is | ||
only initialized after calling the [`BasicCrawler.run()`](../api/basic-crawler#run) function. You can use it to change the concurrency settings on the | ||
fly, to pause the crawler by calling [`AutoscaledPool.pause()`](../api/autoscaled-pool#pause) or to abort it by calling | ||
[`AutoscaledPool.abort()`](../api/autoscaled-pool#abort). | ||
|
||
--- | ||
|
||
<a name="basiccrawler"></a> | ||
|
||
## `new BasicCrawler(options)` | ||
|
||
**Parameters**: | ||
|
||
- **`options`**: [`BasicCrawlerOptions`](../typedefs/basic-crawler-options) - All `BasicCrawler` parameters are passed via an options object. | ||
|
||
--- | ||
|
||
<a name="run"></a> | ||
|
||
## `basicCrawler.run()` | ||
|
||
Runs the crawler. Returns a promise that gets resolved once all the requests are processed. | ||
|
||
**Returns**: | ||
|
||
`Promise<void>` | ||
|
||
--- |
158 changes: 158 additions & 0 deletions
158
website/versioned_docs/version-1.0.0/api/CheerioCrawler.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,158 @@ | ||
--- | ||
id: version-1.0.0-cheerio-crawler | ||
title: CheerioCrawler | ||
original_id: cheerio-crawler | ||
--- | ||
|
||
<a name="cheeriocrawler"></a> | ||
|
||
Provides a framework for the parallel crawling of web pages using plain HTTP requests and [cheerio](https://www.npmjs.com/package/cheerio) HTML | ||
parser. The URLs to crawl are fed either from a static list of URLs or from a dynamic queue of URLs enabling recursive crawling of websites. | ||
|
||
Since `CheerioCrawler` uses raw HTTP requests to download web pages, it is very fast and efficient on data bandwidth. However, if the target website | ||
requires JavaScript to display the content, you might need to use [`PuppeteerCrawler`](../api/puppeteer-crawler) or | ||
[`PlaywrightCrawler`](../api/playwright-crawler) instead, because it loads the pages using full-featured headless Chrome browser. | ||
|
||
`CheerioCrawler` downloads each URL using a plain HTTP request, parses the HTML content using [Cheerio](https://www.npmjs.com/package/cheerio) and | ||
then invokes the user-provided [`CheerioCrawlerOptions.handlePageFunction`](../typedefs/cheerio-crawler-options#handlepagefunction) to extract page | ||
data using a [jQuery](https://jquery.com/)-like interface to the parsed HTML DOM. | ||
|
||
The source URLs are represented using [`Request`](../api/request) objects that are fed from [`RequestList`](../api/request-list) or | ||
[`RequestQueue`](../api/request-queue) instances provided by the | ||
[`CheerioCrawlerOptions.requestList`](../typedefs/cheerio-crawler-options#requestlist) or | ||
[`CheerioCrawlerOptions.requestQueue`](../typedefs/cheerio-crawler-options#requestqueue) constructor options, respectively. | ||
|
||
If both [`CheerioCrawlerOptions.requestList`](../typedefs/cheerio-crawler-options#requestlist) and | ||
[`CheerioCrawlerOptions.requestQueue`](../typedefs/cheerio-crawler-options#requestqueue) are used, the instance first processes URLs from the | ||
[`RequestList`](../api/request-list) and automatically enqueues all of them to [`RequestQueue`](../api/request-queue) before it starts their | ||
processing. This ensures that a single URL is not crawled multiple times. | ||
|
||
The crawler finishes when there are no more [`Request`](../api/request) objects to crawl. | ||
|
||
`CheerioCrawler` downloads the web pages using the [`utils.requestAsBrowser()`](../api/utils#requestasbrowser) utility function. | ||
|
||
By default, `CheerioCrawler` only processes web pages with the `text/html` and `application/xhtml+xml` MIME content types (as reported by the | ||
`Content-Type` HTTP header), and skips pages with other content types. If you want the crawler to process other content types, use the | ||
[`CheerioCrawlerOptions.additionalMimeTypes`](../typedefs/cheerio-crawler-options#additionalmimetypes) constructor option. Beware that the parsing | ||
behavior differs for HTML, XML, JSON and other types of content. For details, see | ||
[`CheerioCrawlerOptions.handlePageFunction`](../typedefs/cheerio-crawler-options#handlepagefunction). | ||
|
||
New requests are only dispatched when there is enough free CPU and memory available, using the functionality provided by the | ||
[`AutoscaledPool`](../api/autoscaled-pool) class. All [`AutoscaledPool`](../api/autoscaled-pool) configuration options can be passed to the | ||
`autoscaledPoolOptions` parameter of the `CheerioCrawler` constructor. For user convenience, the `minConcurrency` and `maxConcurrency` | ||
[`AutoscaledPool`](../api/autoscaled-pool) options are available directly in the `CheerioCrawler` constructor. | ||
|
||
**Example usage:** | ||
|
||
```javascript | ||
// Prepare a list of URLs to crawl | ||
const requestList = new Apify.RequestList({ | ||
sources: [{ url: 'http://www.example.com/page-1' }, { url: 'http://www.example.com/page-2' }], | ||
}); | ||
await requestList.initialize(); | ||
|
||
// Crawl the URLs | ||
const crawler = new Apify.CheerioCrawler({ | ||
requestList, | ||
handlePageFunction: async ({ request, response, body, contentType, $ }) => { | ||
const data = []; | ||
|
||
// Do some data extraction from the page with Cheerio. | ||
$('.some-collection').each((index, el) => { | ||
data.push({ | ||
title: $(el) | ||
.find('.some-title') | ||
.text(), | ||
}); | ||
}); | ||
|
||
// Save the data to dataset. | ||
await Apify.pushData({ | ||
url: request.url, | ||
html: body, | ||
data, | ||
}); | ||
}, | ||
}); | ||
|
||
await crawler.run(); | ||
``` | ||
|
||
## Properties | ||
|
||
### `stats` | ||
|
||
**Type**: [`Statistics`](../api/statistics) | ||
|
||
Contains statistics about the current run. | ||
|
||
--- | ||
|
||
### `requestList` | ||
|
||
**Type**: [`RequestList`](../api/request-list) | ||
|
||
A reference to the underlying [`RequestList`](../api/request-list) class that manages the crawler's [`Request`](../api/request)s. Only available if | ||
used by the crawler. | ||
|
||
--- | ||
|
||
### `requestQueue` | ||
|
||
**Type**: [`RequestQueue`](../api/request-queue) | ||
|
||
A reference to the underlying [`RequestQueue`](../api/request-queue) class that manages the crawler's [`Request`](../api/request)s. Only available if | ||
used by the crawler. | ||
|
||
--- | ||
|
||
### `sessionPool` | ||
|
||
**Type**: [`SessionPool`](../api/session-pool) | ||
|
||
A reference to the underlying [`SessionPool`](../api/session-pool) class that manages the crawler's [`Session`](../api/session)s. Only available if | ||
used by the crawler. | ||
|
||
--- | ||
|
||
### `proxyConfiguration` | ||
|
||
**Type**: [`ProxyConfiguration`](../api/proxy-configuration) | ||
|
||
A reference to the underlying [`ProxyConfiguration`](../api/proxy-configuration) class that manages the crawler's proxies. Only available if used by | ||
the crawler. | ||
|
||
--- | ||
|
||
### `autoscaledPool` | ||
|
||
**Type**: [`AutoscaledPool`](../api/autoscaled-pool) | ||
|
||
A reference to the underlying [`AutoscaledPool`](../api/autoscaled-pool) class that manages the concurrency of the crawler. Note that this property is | ||
only initialized after calling the [`CheerioCrawler.run()`](../api/cheerio-crawler#run) function. You can use it to change the concurrency settings on | ||
the fly, to pause the crawler by calling [`AutoscaledPool.pause()`](../api/autoscaled-pool#pause) or to abort it by calling | ||
[`AutoscaledPool.abort()`](../api/autoscaled-pool#abort). | ||
|
||
--- | ||
|
||
<a name="cheeriocrawler"></a> | ||
|
||
## `new CheerioCrawler(options)` | ||
|
||
**Parameters**: | ||
|
||
- **`options`**: [`CheerioCrawlerOptions`](../typedefs/cheerio-crawler-options) - All `CheerioCrawler` parameters are passed via an options object. | ||
|
||
--- | ||
|
||
<a name="use"></a> | ||
|
||
## `cheerioCrawler.use(extension)` | ||
|
||
**EXPERIMENTAL** Function for attaching CrawlerExtensions such as the Unblockers. | ||
|
||
**Parameters**: | ||
|
||
- **`extension`**: `CrawlerExtension` - Crawler extension that overrides the crawler configuration. | ||
|
||
--- |
Oops, something went wrong.