Skip to content

Releases: apify/crawlee

v2.2.2

14 Feb 14:17
Compare
Choose a tag to compare

What's Changed

  • fix: ensure request.headers is set by @B4nan in #1281
  • fix: cookies setting in preNavigationHooks by @AndreyBykov in #1283
  • refactor: improve logging for fetching next request and timeouts by @B4nan in #1292

This release should help with the infamous 0 concurrency bug. The problem is probably still there, but should be much less common. The main difference is that we now use shorter timeouts for API calls from RequestQueue.

Full Changelog: v2.2.1...v2.2.2

v2.2.1

03 Jan 15:01
Compare
Choose a tag to compare

What's Changed

  • fix: ignore requests that are no longer in progress by @B4nan in #1258
  • fix: do not use tryCancel() from inside sync callback by @B4nan in #1265
  • fix: revert to puppeteer 10.x by @B4nan in #1276
  • fix: wait when body is not available in infiniteScroll() from Puppeteer utils by @B4nan in #1277
  • fix: expose logger classes on the utils.log instance by @B4nan in #1278

Full Changelog: v2.2.0...v2.2.1

v2.2.0

17 Dec 13:26
Compare
Choose a tag to compare

Proxy per page

Up until now, browser crawlers used the same session (and therefore the same proxy) for
all request from a single browser - now get a new proxy for each session. This means
that with incognito pages, each page will get a new proxy, aligning the behaviour with
CheerioCrawler.

This feature is not enabled by default. To use it, we need to enable useIncognitoPages
flag under launchContext:

new Apify.Playwright({
    launchContext: {
        useIncognitoPages: true,
    },
    // ...
})

Note that currently there is a performance overhead for using useIncognitoPages.
Use this flag at your own will.

We are planning to enable this feature by default in SDK v3.0.

Abortable timeouts

Previously when a page function timed out, the task still kept running. This could lead to requests being processed multiple times. In v2.2 we now have abortable timeouts that will cancel the task as early as possible.

Mitigation of zero concurrency issue

Several new timeouts were added to the task function, which should help mitigate the zero concurrency bug. Namely fetching of next request information and reclaiming failed requests back to the queue are now executed with a timeout with 3 additional retries before the task fails. The timeout is always at least 300s (5 minutes), or handleRequestTimeoutSecs if that value is higher.

Full list of changes

  • fix RequestError: URI malformed in cheerio crawler (#1205)
  • only provide Cookie header if cookies are present (#1218)
  • handle extra cases for diffCookie (#1217)
  • implement proxy per page in browser crawlers (#1228)
  • add fingerprinting support (#1243)
  • implement abortable timeouts (#1245)
  • add timeouts with retries to runTaskFunction() (#1250)
  • automatically convert google spreadsheet URLs to CSV exports (#1255)

v2.1.0

07 Oct 12:42
Compare
Choose a tag to compare

What's Changed

  • feat: warn if apify proxy is used in proxyUrls by @szmarczak in #1173
  • feat: use puppeteer emulating scrolls instead of window.scrollBy by @vladfrangu in #1170
  • feat: support channel and user links in YouTube regex by @vladfrangu in #1178
  • feat: add support for cgroups V2 to utils.getMemoryInfo by @mnmkng in #1177
  • feat: add purgeLocalStorage method by @vladfrangu in #1187
  • feat: allow passing forceCloud down to the KV store by @vladfrangu in #1186
  • fix: automatically convert gdoc share urls to csv download ones in request list by @B4nan in #1174
  • fix YOUTUBE_REGEX_STRING being too greedy by @B4nan in #1171
  • fix: incorrect offset in fixUrl function by @szmarczak in #1184
  • fix: catch errors inside request interceptors by @B4nan in #1192
  • fix: use encodeURIComponent instead of encodeURI by @szmarczak in #1198
  • fix: merge cookies provided by user with session cookies by @B4nan in #1201

Full Changelog: v2.0.7...v2.1.0

v2.0.7

08 Sep 07:57
Compare
Choose a tag to compare
  • Fix casting of int/bool environment variables (e.g. APIFY_LOCAL_STORAGE_ENABLE_WAL_MODE), closes #956
  • Fix incognito pages and user data dir (#1145)
  • Add @ts-ignore comments to imports of optional peer dependencies (#1152)
  • Use config instance in sdk.openSessionPool() (#1154)
  • Add a breaking callback to infiniteScroll (#1140)

v2.0.6

27 Aug 11:50
7b7a971
Compare
Choose a tag to compare
  • Fix deprecation messages logged from ProxyConfiguration and CheerioCrawler.
  • Update got-scraping to receive multiple improvements.

v2.0.5

24 Aug 16:58
Compare
Choose a tag to compare

2.0.5 / 2021/08/24

  • Fix error handling in puppeteer crawler

v2.0.4

23 Aug 14:41
d3261ca
Compare
Choose a tag to compare

This update introduces persistent browser headers when using got-scraping.

v2.0.3

20 Aug 13:48
b9e99b7
Compare
Choose a tag to compare
  • chore: add aborting event to events docs [skip ci] c89f532
  • fix: refactor requestAsBrowser to Got 12 (#1111) ef9a4ad
  • fix: limit handleRequestTimeoutMillis to max valid value (#1116) 5948958
  • fix: disable SSL validation on MITM proxies (#1117) 853c5cd
  • fix: bump got-scraping to 3.0.1 (#1121) b9e99b7

This release improves the stability of the SDK.

v2.0.2

12 Aug 08:39
e0ae762
Compare
Choose a tag to compare
  • Fix serialization issues in CheerioCrawler caused by parser conflicts in recent versions of cheerio.