Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Web Connector: Inadequate Mimics in requests.get and Ineffective Playwright Mimics for Cloudflare-Protected Hosts #3616

Open
makoronius opened this issue Jan 7, 2025 · 1 comment

Comments

@makoronius
Copy link

makoronius commented Jan 7, 2025

Analysis and Attempt to Resolve Website Scraping Errors

Out of curiosity and professional interest, I attempted to scrape the entire company's support website (customer knowledgebase) but encountered the following error:

RuntimeError: Failed to fetch 'https://xxxxx/': Error accessing https://xxxxx/: Forbidden (403) for https://xxxx/

I can provide the full URL to any developer interested in addressing this issue, as it pertains to my company's public customer support website hosted on ZenDesk behind Cloudflare as I suppose based on headers). However, I prefer not to share it publicly here.

Response headers from curl (obfuscated a bit):

< x-zendesk-origin-server: app-server-xxxx-xxxx
< x-zendesk-processed-host-header: xxxx
< X-Zendesk-Zorg: yes
< Set-Cookie: __cf_bm=xxxx; path=/; expires=Mon, 06-Jan-25 23:14:31 GMT; domain=xxxx; HttpOnly; Secure; SameSite=None
< Report-To: {"endpoints":[{"url":"https:\/\/xxxx.cloudflare.com\/report\/v4?s=xxxx"}],"group":"xxxx","max_age":604800}
< NEL: {"success_fraction":0.01,"report_to":"xxxx","max_age":604800}
< Set-Cookie: __cfruid=xxxx; path=/; domain=xxxx; HttpOnly; Secure; SameSite=None
< Set-Cookie: _cfuvid=xxxx; path=/; domain=xxxx; HttpOnly; Secure; SameSite=None
< Server: cloudflare

Initial Observations

Interestingly, a simple curl request worked without issue:

curl -L -v https://xxxxx/ -A "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36"

Even with basic headers, the request succeeded:

> User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36
> Accept: */*

Debugging and Modifications

I began troubleshooting /backend/onyx/connectors/web/connector.py to identify the root cause. My initial modification involved adding User-Agent and Accept headers to all requests.get calls:

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
    "Accept": "*/*"
}
response = requests.get(url, headers=headers)

Unfortunately, this change did not resolve the issue. The function check_internet_connection still returned the same 403 Forbidden error when fetching the first URL.

Playwright Integration

Suspecting that requests was insufficient to bypass the restrictions, I used ChatGPT to refactor /backend/onyx/connectors/web/connector.py to leverage Playwright. The idea was to utilize Playwright's browser mimics to bypass restrictions.

However, the problem persisted, and the 403 Forbidden error continued to occur.

Intercepting and Analyzing Headers

I hypothesized that Playwright might still be sending headers that Cloudflare identified as "bot-like." To investigate further, I intercepted and logged all headers with the following code:

page.on("request", lambda request: print(f"Request headers for {request.url}: {request.headers}"))

This revealed some suspicious headers that Playwright added by default to mimic a browser more effectively:

{'sec-ch-ua': '"Not A(Brand";v="99", "HeadlessChrome";v="121", "Chromium";v="121"', 
 'sec-ch-ua-mobile': '?0', 
 'sec-ch-ua-platform': '"Linux"'}

These headers likely triggered Cloudflare's detection mechanisms.

Header Interception and Customization

Using ChatGPT's suggestions, I implemented a method to intercept and customize headers for outgoing requests. This approach replaced the problematic headers while retaining critical ones like User-Agent. The modified code is as follows:

page.route("**/*", lambda route, request: route.continue_(
    headers={
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Connection": "keep-alive",
    }
))

(Note: The Accept header, while seemingly unnecessary, was included based on ChatGPT's recommendations and left unchanged since the code works as intended.)

Current State and Next Steps

Currently, I have a functional prototype inspired by ChatGPT, but the code is not well-polished. I would greatly appreciate it if someone could thoroughly review and refine this implementation before integrating it into production. The concept is effective, but the code requires significant improvements. Apologies for its current state.

Attachment:
connector.zip

About installation:
Software: Locally hosted on Ubuntu 24.04.1 LTS + docker + ollama + llama3.2 / mistral
Hardware: 12th Gen Intel(R) Core(TM) i9-12900, 64Gb RAM, NVIDIA GeForce RTX 3060 Ti 8Gb VRAM, Samsung SSD 980 PRO 1TB, WDC WD30PURZ-85A 2.7 TB

@makoronius makoronius changed the title Web Connector: Inadequate Mimics in requests.get and Ineffective Playwright Mimics for CloudFront-Protected Hosts Web Connector: Inadequate Mimics in requests.get and Ineffective Playwright Mimics for Cloudflare-Protected Hosts Jan 7, 2025
@makoronius
Copy link
Author

Next day: failed again. Investigating. Since it is appeared that it's a ZenDesk articles, I will use specially intended for this purposes ZenDesk connector

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant