Web Connector: Inadequate Mimics in requests.get and Ineffective Playwright Mimics for Cloudflare-Protected Hosts #3616

makoronius · 2025-01-07T00:00:44Z

Analysis and Attempt to Resolve Website Scraping Errors

Out of curiosity and professional interest, I attempted to scrape the entire company's support website (customer knowledgebase) but encountered the following error:

RuntimeError: Failed to fetch 'https://xxxxx/': Error accessing https://xxxxx/: Forbidden (403) for https://xxxx/

I can provide the full URL to any developer interested in addressing this issue, as it pertains to my company's public customer support website hosted on ZenDesk behind Cloudflare as I suppose based on headers). However, I prefer not to share it publicly here.

Response headers from curl (obfuscated a bit):

< x-zendesk-origin-server: app-server-xxxx-xxxx
< x-zendesk-processed-host-header: xxxx
< X-Zendesk-Zorg: yes
< Set-Cookie: __cf_bm=xxxx; path=/; expires=Mon, 06-Jan-25 23:14:31 GMT; domain=xxxx; HttpOnly; Secure; SameSite=None
< Report-To: {"endpoints":[{"url":"https:\/\/xxxx.cloudflare.com\/report\/v4?s=xxxx"}],"group":"xxxx","max_age":604800}
< NEL: {"success_fraction":0.01,"report_to":"xxxx","max_age":604800}
< Set-Cookie: __cfruid=xxxx; path=/; domain=xxxx; HttpOnly; Secure; SameSite=None
< Set-Cookie: _cfuvid=xxxx; path=/; domain=xxxx; HttpOnly; Secure; SameSite=None
< Server: cloudflare

Initial Observations

Interestingly, a simple curl request worked without issue:

curl -L -v https://xxxxx/ -A "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36"

Even with basic headers, the request succeeded:

> User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36
> Accept: */*

Debugging and Modifications

I began troubleshooting /backend/onyx/connectors/web/connector.py to identify the root cause. My initial modification involved adding User-Agent and Accept headers to all requests.get calls:

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
    "Accept": "*/*"
}
response = requests.get(url, headers=headers)

Unfortunately, this change did not resolve the issue. The function check_internet_connection still returned the same 403 Forbidden error when fetching the first URL.

Playwright Integration

Suspecting that requests was insufficient to bypass the restrictions, I used ChatGPT to refactor /backend/onyx/connectors/web/connector.py to leverage Playwright. The idea was to utilize Playwright's browser mimics to bypass restrictions.

However, the problem persisted, and the 403 Forbidden error continued to occur.

Intercepting and Analyzing Headers

I hypothesized that Playwright might still be sending headers that Cloudflare identified as "bot-like." To investigate further, I intercepted and logged all headers with the following code:

page.on("request", lambda request: print(f"Request headers for {request.url}: {request.headers}"))

This revealed some suspicious headers that Playwright added by default to mimic a browser more effectively:

{'sec-ch-ua': '"Not A(Brand";v="99", "HeadlessChrome";v="121", "Chromium";v="121"', 
 'sec-ch-ua-mobile': '?0', 
 'sec-ch-ua-platform': '"Linux"'}

These headers likely triggered Cloudflare's detection mechanisms.

Header Interception and Customization

Using ChatGPT's suggestions, I implemented a method to intercept and customize headers for outgoing requests. This approach replaced the problematic headers while retaining critical ones like User-Agent. The modified code is as follows:

page.route("**/*", lambda route, request: route.continue_(
    headers={
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Connection": "keep-alive",
    }
))

(Note: The Accept header, while seemingly unnecessary, was included based on ChatGPT's recommendations and left unchanged since the code works as intended.)

Current State and Next Steps

Currently, I have a functional prototype inspired by ChatGPT, but the code is not well-polished. I would greatly appreciate it if someone could thoroughly review and refine this implementation before integrating it into production. The concept is effective, but the code requires significant improvements. Apologies for its current state.

Attachment:
connector.zip

About installation:
Software: Locally hosted on Ubuntu 24.04.1 LTS + docker + ollama + llama3.2 / mistral
Hardware: 12th Gen Intel(R) Core(TM) i9-12900, 64Gb RAM, NVIDIA GeForce RTX 3060 Ti 8Gb VRAM, Samsung SSD 980 PRO 1TB, WDC WD30PURZ-85A 2.7 TB

The text was updated successfully, but these errors were encountered:

makoronius · 2025-01-08T21:50:15Z

Next day: failed again. Investigating. Since it is appeared that it's a ZenDesk articles, I will use specially intended for this purposes ZenDesk connector

makoronius changed the title ~~Web Connector: Inadequate Mimics in requests.get and Ineffective Playwright Mimics for CloudFront-Protected Hosts~~ Web Connector: Inadequate Mimics in requests.get and Ineffective Playwright Mimics for Cloudflare-Protected Hosts Jan 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Web Connector: Inadequate Mimics in requests.get and Ineffective Playwright Mimics for Cloudflare-Protected Hosts #3616

Web Connector: Inadequate Mimics in requests.get and Ineffective Playwright Mimics for Cloudflare-Protected Hosts #3616

makoronius commented Jan 7, 2025 •

edited

Loading

makoronius commented Jan 8, 2025

Web Connector: Inadequate Mimics in requests.get and Ineffective Playwright Mimics for Cloudflare-Protected Hosts #3616

Web Connector: Inadequate Mimics in requests.get and Ineffective Playwright Mimics for Cloudflare-Protected Hosts #3616

Comments

makoronius commented Jan 7, 2025 • edited Loading

Analysis and Attempt to Resolve Website Scraping Errors

Initial Observations

Debugging and Modifications

Playwright Integration

Intercepting and Analyzing Headers

Header Interception and Customization

Current State and Next Steps

makoronius commented Jan 8, 2025

makoronius commented Jan 7, 2025 •

edited

Loading