Build docs v1.0.0

apify · Jan 25, 2021 · e8c70cf · e8c70cf
1 parent d6798df
commit e8c70cf
Show file tree

Hide file tree

Showing 31 changed files with 4,552 additions and 0 deletions.
diff --git a/website/versioned_docs/version-1.0.0/api/Apify.md b/website/versioned_docs/version-1.0.0/api/Apify.md
diff --git a/website/versioned_docs/version-1.0.0/api/BasicCrawler.md b/website/versioned_docs/version-1.0.0/api/BasicCrawler.md
@@ -0,0 +1,128 @@
+---
+id: version-1.0.0-basic-crawler
+title: BasicCrawler
+original_id: basic-crawler
+---
+
+<a name="basiccrawler"></a>
+
+Provides a simple framework for parallel crawling of web pages. The URLs to crawl are fed either from a static list of URLs or from a dynamic queue of
+URLs enabling recursive crawling of websites.
+
+`BasicCrawler` is a low-level tool that requires the user to implement the page download and data extraction functionality themselves. If you want a
+crawler that already facilitates this functionality, please consider using [`CheerioCrawler`](../api/cheerio-crawler),
+[`PuppeteerCrawler`](../api/puppeteer-crawler) or [`PlaywrightCrawler`](../api/playwright-crawler).
+
+`BasicCrawler` invokes the user-provided [`BasicCrawlerOptions.handleRequestFunction`](../typedefs/basic-crawler-options#handlerequestfunction) for
+each [`Request`](../api/request) object, which represents a single URL to crawl. The [`Request`](../api/request) objects are fed from the
+[`RequestList`](../api/request-list) or the [`RequestQueue`](../api/request-queue) instances provided by the
+[`BasicCrawlerOptions.requestList`](../typedefs/basic-crawler-options#requestlist) or
+[`BasicCrawlerOptions.requestQueue`](../typedefs/basic-crawler-options#requestqueue) constructor options, respectively.
+
+If both [`BasicCrawlerOptions.requestList`](../typedefs/basic-crawler-options#requestlist) and
+[`BasicCrawlerOptions.requestQueue`](../typedefs/basic-crawler-options#requestqueue) options are used, the instance first processes URLs from the
+[`RequestList`](../api/request-list) and automatically enqueues all of them to [`RequestQueue`](../api/request-queue) before it starts their
+processing. This ensures that a single URL is not crawled multiple times.
+
+The crawler finishes if there are no more [`Request`](../api/request) objects to crawl.
+
+New requests are only dispatched when there is enough free CPU and memory available, using the functionality provided by the
+[`AutoscaledPool`](../api/autoscaled-pool) class. All [`AutoscaledPool`](../api/autoscaled-pool) configuration options can be passed to the
+`autoscaledPoolOptions` parameter of the `BasicCrawler` constructor. For user convenience, the `minConcurrency` and `maxConcurrency`
+[`AutoscaledPool`](../api/autoscaled-pool) options are available directly in the `BasicCrawler` constructor.
+
+**Example usage:**
+
+```javascript
+// Prepare a list of URLs to crawl
+const requestList = new Apify.RequestList({
+    sources: [{ url: 'http://www.example.com/page-1' }, { url: 'http://www.example.com/page-2' }],
+});
+await requestList.initialize();
+
+// Crawl the URLs
+const crawler = new Apify.BasicCrawler({
+    requestList,
+    handleRequestFunction: async ({ request }) => {
+        // 'request' contains an instance of the Request class
+        // Here we simply fetch the HTML of the page and store it to a dataset
+        const { body } = await Apify.utils.requestAsBrowser(request);
+        await Apify.pushData({
+            url: request.url,
+            html: body,
+        });
+    },
+});
+
+await crawler.run();
+```
+
+## Properties
+
+### `stats`
+
+**Type**: [`Statistics`](../api/statistics)
+
+Contains statistics about the current run.
+
+---
+
+### `requestList`
+
+**Type**: [`RequestList`](../api/request-list)
+
+A reference to the underlying [`RequestList`](../api/request-list) class that manages the crawler's [`Request`](../api/request)s. Only available if
+used by the crawler.
+
+---
+
+### `requestQueue`
+
+**Type**: [`RequestQueue`](../api/request-queue)
+
+A reference to the underlying [`RequestQueue`](../api/request-queue) class that manages the crawler's [`Request`](../api/request)s. Only available if
+used by the crawler.
+
+---
+
+### `sessionPool`
+
+**Type**: [`SessionPool`](../api/session-pool)
+
+A reference to the underlying [`SessionPool`](../api/session-pool) class that manages the crawler's [`Session`](../api/session)s. Only available if
+used by the crawler.
+
+---
+
+### `autoscaledPool`
+
+**Type**: [`AutoscaledPool`](../api/autoscaled-pool)
+
+A reference to the underlying [`AutoscaledPool`](../api/autoscaled-pool) class that manages the concurrency of the crawler. Note that this property is
+only initialized after calling the [`BasicCrawler.run()`](../api/basic-crawler#run) function. You can use it to change the concurrency settings on the
+fly, to pause the crawler by calling [`AutoscaledPool.pause()`](../api/autoscaled-pool#pause) or to abort it by calling
+[`AutoscaledPool.abort()`](../api/autoscaled-pool#abort).
+
+---
+
+<a name="basiccrawler"></a>
+
+## `new BasicCrawler(options)`
+
+**Parameters**:
+
+-   **`options`**: [`BasicCrawlerOptions`](../typedefs/basic-crawler-options) - All `BasicCrawler` parameters are passed via an options object.
+
+---
+
+<a name="run"></a>
+
+## `basicCrawler.run()`
+
+Runs the crawler. Returns a promise that gets resolved once all the requests are processed.
+
+**Returns**:
+
+`Promise<void>`
+
+---
diff --git a/website/versioned_docs/version-1.0.0/api/CheerioCrawler.md b/website/versioned_docs/version-1.0.0/api/CheerioCrawler.md
@@ -0,0 +1,158 @@
+---
+id: version-1.0.0-cheerio-crawler
+title: CheerioCrawler
+original_id: cheerio-crawler
+---
+
+<a name="cheeriocrawler"></a>
+
+Provides a framework for the parallel crawling of web pages using plain HTTP requests and [cheerio](https://www.npmjs.com/package/cheerio) HTML
+parser. The URLs to crawl are fed either from a static list of URLs or from a dynamic queue of URLs enabling recursive crawling of websites.
+
+Since `CheerioCrawler` uses raw HTTP requests to download web pages, it is very fast and efficient on data bandwidth. However, if the target website
+requires JavaScript to display the content, you might need to use [`PuppeteerCrawler`](../api/puppeteer-crawler) or
+[`PlaywrightCrawler`](../api/playwright-crawler) instead, because it loads the pages using full-featured headless Chrome browser.
+
+`CheerioCrawler` downloads each URL using a plain HTTP request, parses the HTML content using [Cheerio](https://www.npmjs.com/package/cheerio) and
+then invokes the user-provided [`CheerioCrawlerOptions.handlePageFunction`](../typedefs/cheerio-crawler-options#handlepagefunction) to extract page
+data using a [jQuery](https://jquery.com/)-like interface to the parsed HTML DOM.
+
+The source URLs are represented using [`Request`](../api/request) objects that are fed from [`RequestList`](../api/request-list) or
+[`RequestQueue`](../api/request-queue) instances provided by the
+[`CheerioCrawlerOptions.requestList`](../typedefs/cheerio-crawler-options#requestlist) or
+[`CheerioCrawlerOptions.requestQueue`](../typedefs/cheerio-crawler-options#requestqueue) constructor options, respectively.
+
+If both [`CheerioCrawlerOptions.requestList`](../typedefs/cheerio-crawler-options#requestlist) and
+[`CheerioCrawlerOptions.requestQueue`](../typedefs/cheerio-crawler-options#requestqueue) are used, the instance first processes URLs from the
+[`RequestList`](../api/request-list) and automatically enqueues all of them to [`RequestQueue`](../api/request-queue) before it starts their
+processing. This ensures that a single URL is not crawled multiple times.
+
+The crawler finishes when there are no more [`Request`](../api/request) objects to crawl.
+
+`CheerioCrawler` downloads the web pages using the [`utils.requestAsBrowser()`](../api/utils#requestasbrowser) utility function.
+
+By default, `CheerioCrawler` only processes web pages with the `text/html` and `application/xhtml+xml` MIME content types (as reported by the
+`Content-Type` HTTP header), and skips pages with other content types. If you want the crawler to process other content types, use the
+[`CheerioCrawlerOptions.additionalMimeTypes`](../typedefs/cheerio-crawler-options#additionalmimetypes) constructor option. Beware that the parsing
+behavior differs for HTML, XML, JSON and other types of content. For details, see
+[`CheerioCrawlerOptions.handlePageFunction`](../typedefs/cheerio-crawler-options#handlepagefunction).
+
+New requests are only dispatched when there is enough free CPU and memory available, using the functionality provided by the
+[`AutoscaledPool`](../api/autoscaled-pool) class. All [`AutoscaledPool`](../api/autoscaled-pool) configuration options can be passed to the
+`autoscaledPoolOptions` parameter of the `CheerioCrawler` constructor. For user convenience, the `minConcurrency` and `maxConcurrency`
+[`AutoscaledPool`](../api/autoscaled-pool) options are available directly in the `CheerioCrawler` constructor.
+
+**Example usage:**
+
+```javascript
+// Prepare a list of URLs to crawl
+const requestList = new Apify.RequestList({
+    sources: [{ url: 'http://www.example.com/page-1' }, { url: 'http://www.example.com/page-2' }],
+});
+await requestList.initialize();
+
+// Crawl the URLs
+const crawler = new Apify.CheerioCrawler({
+    requestList,
+    handlePageFunction: async ({ request, response, body, contentType, $ }) => {
+        const data = [];
+
+        // Do some data extraction from the page with Cheerio.
+        $('.some-collection').each((index, el) => {
+            data.push({
+                title: $(el)
+                    .find('.some-title')
+                    .text(),
+            });
+        });
+
+        // Save the data to dataset.
+        await Apify.pushData({
+            url: request.url,
+            html: body,
+            data,
+        });
+    },
+});
+
+await crawler.run();
+```
+
+## Properties
+
+### `stats`
+
+**Type**: [`Statistics`](../api/statistics)
+
+Contains statistics about the current run.
+
+---
+
+### `requestList`
+
+**Type**: [`RequestList`](../api/request-list)
+
+A reference to the underlying [`RequestList`](../api/request-list) class that manages the crawler's [`Request`](../api/request)s. Only available if
+used by the crawler.
+
+---
+
+### `requestQueue`
+
+**Type**: [`RequestQueue`](../api/request-queue)
+
+A reference to the underlying [`RequestQueue`](../api/request-queue) class that manages the crawler's [`Request`](../api/request)s. Only available if
+used by the crawler.
+
+---
+
+### `sessionPool`
+
+**Type**: [`SessionPool`](../api/session-pool)
+
+A reference to the underlying [`SessionPool`](../api/session-pool) class that manages the crawler's [`Session`](../api/session)s. Only available if
+used by the crawler.
+
+---
+
+### `proxyConfiguration`
+
+**Type**: [`ProxyConfiguration`](../api/proxy-configuration)
+
+A reference to the underlying [`ProxyConfiguration`](../api/proxy-configuration) class that manages the crawler's proxies. Only available if used by
+the crawler.
+
+---
+
+### `autoscaledPool`
+
+**Type**: [`AutoscaledPool`](../api/autoscaled-pool)
+
+A reference to the underlying [`AutoscaledPool`](../api/autoscaled-pool) class that manages the concurrency of the crawler. Note that this property is
+only initialized after calling the [`CheerioCrawler.run()`](../api/cheerio-crawler#run) function. You can use it to change the concurrency settings on
+the fly, to pause the crawler by calling [`AutoscaledPool.pause()`](../api/autoscaled-pool#pause) or to abort it by calling
+[`AutoscaledPool.abort()`](../api/autoscaled-pool#abort).
+
+---
+
+<a name="cheeriocrawler"></a>
+
+## `new CheerioCrawler(options)`
+
+**Parameters**:
+
+-   **`options`**: [`CheerioCrawlerOptions`](../typedefs/cheerio-crawler-options) - All `CheerioCrawler` parameters are passed via an options object.
+
+---
+
+<a name="use"></a>
+
+## `cheerioCrawler.use(extension)`
+
+**EXPERIMENTAL** Function for attaching CrawlerExtensions such as the Unblockers.
+
+**Parameters**:
+
+-   **`extension`**: `CrawlerExtension` - Crawler extension that overrides the crawler configuration.
+
+---