Skip to content

Commit

Permalink
Build docs v1.0.0
Browse files Browse the repository at this point in the history
  • Loading branch information
mnmkng committed Jan 25, 2021
1 parent d6798df commit e8c70cf
Show file tree
Hide file tree
Showing 31 changed files with 4,552 additions and 0 deletions.
767 changes: 767 additions & 0 deletions website/versioned_docs/version-1.0.0/api/Apify.md

Large diffs are not rendered by default.

128 changes: 128 additions & 0 deletions website/versioned_docs/version-1.0.0/api/BasicCrawler.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
---
id: version-1.0.0-basic-crawler
title: BasicCrawler
original_id: basic-crawler
---

<a name="basiccrawler"></a>

Provides a simple framework for parallel crawling of web pages. The URLs to crawl are fed either from a static list of URLs or from a dynamic queue of
URLs enabling recursive crawling of websites.

`BasicCrawler` is a low-level tool that requires the user to implement the page download and data extraction functionality themselves. If you want a
crawler that already facilitates this functionality, please consider using [`CheerioCrawler`](../api/cheerio-crawler),
[`PuppeteerCrawler`](../api/puppeteer-crawler) or [`PlaywrightCrawler`](../api/playwright-crawler).

`BasicCrawler` invokes the user-provided [`BasicCrawlerOptions.handleRequestFunction`](../typedefs/basic-crawler-options#handlerequestfunction) for
each [`Request`](../api/request) object, which represents a single URL to crawl. The [`Request`](../api/request) objects are fed from the
[`RequestList`](../api/request-list) or the [`RequestQueue`](../api/request-queue) instances provided by the
[`BasicCrawlerOptions.requestList`](../typedefs/basic-crawler-options#requestlist) or
[`BasicCrawlerOptions.requestQueue`](../typedefs/basic-crawler-options#requestqueue) constructor options, respectively.

If both [`BasicCrawlerOptions.requestList`](../typedefs/basic-crawler-options#requestlist) and
[`BasicCrawlerOptions.requestQueue`](../typedefs/basic-crawler-options#requestqueue) options are used, the instance first processes URLs from the
[`RequestList`](../api/request-list) and automatically enqueues all of them to [`RequestQueue`](../api/request-queue) before it starts their
processing. This ensures that a single URL is not crawled multiple times.

The crawler finishes if there are no more [`Request`](../api/request) objects to crawl.

New requests are only dispatched when there is enough free CPU and memory available, using the functionality provided by the
[`AutoscaledPool`](../api/autoscaled-pool) class. All [`AutoscaledPool`](../api/autoscaled-pool) configuration options can be passed to the
`autoscaledPoolOptions` parameter of the `BasicCrawler` constructor. For user convenience, the `minConcurrency` and `maxConcurrency`
[`AutoscaledPool`](../api/autoscaled-pool) options are available directly in the `BasicCrawler` constructor.

**Example usage:**

```javascript
// Prepare a list of URLs to crawl
const requestList = new Apify.RequestList({
sources: [{ url: 'http://www.example.com/page-1' }, { url: 'http://www.example.com/page-2' }],
});
await requestList.initialize();

// Crawl the URLs
const crawler = new Apify.BasicCrawler({
requestList,
handleRequestFunction: async ({ request }) => {
// 'request' contains an instance of the Request class
// Here we simply fetch the HTML of the page and store it to a dataset
const { body } = await Apify.utils.requestAsBrowser(request);
await Apify.pushData({
url: request.url,
html: body,
});
},
});

await crawler.run();
```

## Properties

### `stats`

**Type**: [`Statistics`](../api/statistics)

Contains statistics about the current run.

---

### `requestList`

**Type**: [`RequestList`](../api/request-list)

A reference to the underlying [`RequestList`](../api/request-list) class that manages the crawler's [`Request`](../api/request)s. Only available if
used by the crawler.

---

### `requestQueue`

**Type**: [`RequestQueue`](../api/request-queue)

A reference to the underlying [`RequestQueue`](../api/request-queue) class that manages the crawler's [`Request`](../api/request)s. Only available if
used by the crawler.

---

### `sessionPool`

**Type**: [`SessionPool`](../api/session-pool)

A reference to the underlying [`SessionPool`](../api/session-pool) class that manages the crawler's [`Session`](../api/session)s. Only available if
used by the crawler.

---

### `autoscaledPool`

**Type**: [`AutoscaledPool`](../api/autoscaled-pool)

A reference to the underlying [`AutoscaledPool`](../api/autoscaled-pool) class that manages the concurrency of the crawler. Note that this property is
only initialized after calling the [`BasicCrawler.run()`](../api/basic-crawler#run) function. You can use it to change the concurrency settings on the
fly, to pause the crawler by calling [`AutoscaledPool.pause()`](../api/autoscaled-pool#pause) or to abort it by calling
[`AutoscaledPool.abort()`](../api/autoscaled-pool#abort).

---

<a name="basiccrawler"></a>

## `new BasicCrawler(options)`

**Parameters**:

- **`options`**: [`BasicCrawlerOptions`](../typedefs/basic-crawler-options) - All `BasicCrawler` parameters are passed via an options object.

---

<a name="run"></a>

## `basicCrawler.run()`

Runs the crawler. Returns a promise that gets resolved once all the requests are processed.

**Returns**:

`Promise<void>`

---
158 changes: 158 additions & 0 deletions website/versioned_docs/version-1.0.0/api/CheerioCrawler.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,158 @@
---
id: version-1.0.0-cheerio-crawler
title: CheerioCrawler
original_id: cheerio-crawler
---

<a name="cheeriocrawler"></a>

Provides a framework for the parallel crawling of web pages using plain HTTP requests and [cheerio](https://www.npmjs.com/package/cheerio) HTML
parser. The URLs to crawl are fed either from a static list of URLs or from a dynamic queue of URLs enabling recursive crawling of websites.

Since `CheerioCrawler` uses raw HTTP requests to download web pages, it is very fast and efficient on data bandwidth. However, if the target website
requires JavaScript to display the content, you might need to use [`PuppeteerCrawler`](../api/puppeteer-crawler) or
[`PlaywrightCrawler`](../api/playwright-crawler) instead, because it loads the pages using full-featured headless Chrome browser.

`CheerioCrawler` downloads each URL using a plain HTTP request, parses the HTML content using [Cheerio](https://www.npmjs.com/package/cheerio) and
then invokes the user-provided [`CheerioCrawlerOptions.handlePageFunction`](../typedefs/cheerio-crawler-options#handlepagefunction) to extract page
data using a [jQuery](https://jquery.com/)-like interface to the parsed HTML DOM.

The source URLs are represented using [`Request`](../api/request) objects that are fed from [`RequestList`](../api/request-list) or
[`RequestQueue`](../api/request-queue) instances provided by the
[`CheerioCrawlerOptions.requestList`](../typedefs/cheerio-crawler-options#requestlist) or
[`CheerioCrawlerOptions.requestQueue`](../typedefs/cheerio-crawler-options#requestqueue) constructor options, respectively.

If both [`CheerioCrawlerOptions.requestList`](../typedefs/cheerio-crawler-options#requestlist) and
[`CheerioCrawlerOptions.requestQueue`](../typedefs/cheerio-crawler-options#requestqueue) are used, the instance first processes URLs from the
[`RequestList`](../api/request-list) and automatically enqueues all of them to [`RequestQueue`](../api/request-queue) before it starts their
processing. This ensures that a single URL is not crawled multiple times.

The crawler finishes when there are no more [`Request`](../api/request) objects to crawl.

`CheerioCrawler` downloads the web pages using the [`utils.requestAsBrowser()`](../api/utils#requestasbrowser) utility function.

By default, `CheerioCrawler` only processes web pages with the `text/html` and `application/xhtml+xml` MIME content types (as reported by the
`Content-Type` HTTP header), and skips pages with other content types. If you want the crawler to process other content types, use the
[`CheerioCrawlerOptions.additionalMimeTypes`](../typedefs/cheerio-crawler-options#additionalmimetypes) constructor option. Beware that the parsing
behavior differs for HTML, XML, JSON and other types of content. For details, see
[`CheerioCrawlerOptions.handlePageFunction`](../typedefs/cheerio-crawler-options#handlepagefunction).

New requests are only dispatched when there is enough free CPU and memory available, using the functionality provided by the
[`AutoscaledPool`](../api/autoscaled-pool) class. All [`AutoscaledPool`](../api/autoscaled-pool) configuration options can be passed to the
`autoscaledPoolOptions` parameter of the `CheerioCrawler` constructor. For user convenience, the `minConcurrency` and `maxConcurrency`
[`AutoscaledPool`](../api/autoscaled-pool) options are available directly in the `CheerioCrawler` constructor.

**Example usage:**

```javascript
// Prepare a list of URLs to crawl
const requestList = new Apify.RequestList({
sources: [{ url: 'http://www.example.com/page-1' }, { url: 'http://www.example.com/page-2' }],
});
await requestList.initialize();

// Crawl the URLs
const crawler = new Apify.CheerioCrawler({
requestList,
handlePageFunction: async ({ request, response, body, contentType, $ }) => {
const data = [];

// Do some data extraction from the page with Cheerio.
$('.some-collection').each((index, el) => {
data.push({
title: $(el)
.find('.some-title')
.text(),
});
});

// Save the data to dataset.
await Apify.pushData({
url: request.url,
html: body,
data,
});
},
});

await crawler.run();
```

## Properties

### `stats`

**Type**: [`Statistics`](../api/statistics)

Contains statistics about the current run.

---

### `requestList`

**Type**: [`RequestList`](../api/request-list)

A reference to the underlying [`RequestList`](../api/request-list) class that manages the crawler's [`Request`](../api/request)s. Only available if
used by the crawler.

---

### `requestQueue`

**Type**: [`RequestQueue`](../api/request-queue)

A reference to the underlying [`RequestQueue`](../api/request-queue) class that manages the crawler's [`Request`](../api/request)s. Only available if
used by the crawler.

---

### `sessionPool`

**Type**: [`SessionPool`](../api/session-pool)

A reference to the underlying [`SessionPool`](../api/session-pool) class that manages the crawler's [`Session`](../api/session)s. Only available if
used by the crawler.

---

### `proxyConfiguration`

**Type**: [`ProxyConfiguration`](../api/proxy-configuration)

A reference to the underlying [`ProxyConfiguration`](../api/proxy-configuration) class that manages the crawler's proxies. Only available if used by
the crawler.

---

### `autoscaledPool`

**Type**: [`AutoscaledPool`](../api/autoscaled-pool)

A reference to the underlying [`AutoscaledPool`](../api/autoscaled-pool) class that manages the concurrency of the crawler. Note that this property is
only initialized after calling the [`CheerioCrawler.run()`](../api/cheerio-crawler#run) function. You can use it to change the concurrency settings on
the fly, to pause the crawler by calling [`AutoscaledPool.pause()`](../api/autoscaled-pool#pause) or to abort it by calling
[`AutoscaledPool.abort()`](../api/autoscaled-pool#abort).

---

<a name="cheeriocrawler"></a>

## `new CheerioCrawler(options)`

**Parameters**:

- **`options`**: [`CheerioCrawlerOptions`](../typedefs/cheerio-crawler-options) - All `CheerioCrawler` parameters are passed via an options object.

---

<a name="use"></a>

## `cheerioCrawler.use(extension)`

**EXPERIMENTAL** Function for attaching CrawlerExtensions such as the Unblockers.

**Parameters**:

- **`extension`**: `CrawlerExtension` - Crawler extension that overrides the crawler configuration.

---
Loading

0 comments on commit e8c70cf

Please sign in to comment.