You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Analysis and Attempt to Resolve Website Scraping Errors
Out of curiosity and professional interest, I attempted to scrape the entire company's support website (customer knowledgebase) but encountered the following error:
RuntimeError: Failed to fetch 'https://xxxxx/': Error accessing https://xxxxx/: Forbidden (403) for https://xxxx/
I can provide the full URL to any developer interested in addressing this issue, as it pertains to my company's public customer support website hosted on ZenDesk behind Cloudflare as I suppose based on headers). However, I prefer not to share it publicly here.
Interestingly, a simple curl request worked without issue:
curl -L -v https://xxxxx/ -A "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36"
Even with basic headers, the request succeeded:
> User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36
> Accept: */*
Debugging and Modifications
I began troubleshooting /backend/onyx/connectors/web/connector.py to identify the root cause. My initial modification involved adding User-Agent and Accept headers to all requests.get calls:
headers= {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
"Accept": "*/*"
}
response=requests.get(url, headers=headers)
Unfortunately, this change did not resolve the issue. The function check_internet_connection still returned the same 403 Forbidden error when fetching the first URL.
Playwright Integration
Suspecting that requests was insufficient to bypass the restrictions, I used ChatGPT to refactor /backend/onyx/connectors/web/connector.py to leverage Playwright. The idea was to utilize Playwright's browser mimics to bypass restrictions.
However, the problem persisted, and the 403 Forbidden error continued to occur.
Intercepting and Analyzing Headers
I hypothesized that Playwright might still be sending headers that Cloudflare identified as "bot-like." To investigate further, I intercepted and logged all headers with the following code:
page.on("request", lambdarequest: print(f"Request headers for {request.url}: {request.headers}"))
This revealed some suspicious headers that Playwright added by default to mimic a browser more effectively:
These headers likely triggered Cloudflare's detection mechanisms.
Header Interception and Customization
Using ChatGPT's suggestions, I implemented a method to intercept and customize headers for outgoing requests. This approach replaced the problematic headers while retaining critical ones like User-Agent. The modified code is as follows:
(Note: The Accept header, while seemingly unnecessary, was included based on ChatGPT's recommendations and left unchanged since the code works as intended.)
Current State and Next Steps
Currently, I have a functional prototype inspired by ChatGPT, but the code is not well-polished. I would greatly appreciate it if someone could thoroughly review and refine this implementation before integrating it into production. The concept is effective, but the code requires significant improvements. Apologies for its current state.
About installation: Software: Locally hosted on Ubuntu 24.04.1 LTS + docker + ollama + llama3.2 / mistral Hardware: 12th Gen Intel(R) Core(TM) i9-12900, 64Gb RAM, NVIDIA GeForce RTX 3060 Ti 8Gb VRAM, Samsung SSD 980 PRO 1TB, WDC WD30PURZ-85A 2.7 TB
The text was updated successfully, but these errors were encountered:
makoronius
changed the title
Web Connector: Inadequate Mimics in requests.get and Ineffective Playwright Mimics for CloudFront-Protected Hosts
Web Connector: Inadequate Mimics in requests.get and Ineffective Playwright Mimics for Cloudflare-Protected Hosts
Jan 7, 2025
Next day: failed again. Investigating. Since it is appeared that it's a ZenDesk articles, I will use specially intended for this purposes ZenDesk connector
Analysis and Attempt to Resolve Website Scraping Errors
Out of curiosity and professional interest, I attempted to scrape the entire company's support website (customer knowledgebase) but encountered the following error:
I can provide the full URL to any developer interested in addressing this issue, as it pertains to my company's public customer support website hosted on ZenDesk behind Cloudflare as I suppose based on headers). However, I prefer not to share it publicly here.
Response headers from curl (obfuscated a bit):
Initial Observations
Interestingly, a simple
curl
request worked without issue:Even with basic headers, the request succeeded:
Debugging and Modifications
I began troubleshooting
/backend/onyx/connectors/web/connector.py
to identify the root cause. My initial modification involved adding User-Agent and Accept headers to allrequests.get
calls:Unfortunately, this change did not resolve the issue. The function
check_internet_connection
still returned the same 403 Forbidden error when fetching the first URL.Playwright Integration
Suspecting that
requests
was insufficient to bypass the restrictions, I used ChatGPT to refactor/backend/onyx/connectors/web/connector.py
to leverage Playwright. The idea was to utilize Playwright's browser mimics to bypass restrictions.However, the problem persisted, and the 403 Forbidden error continued to occur.
Intercepting and Analyzing Headers
I hypothesized that Playwright might still be sending headers that Cloudflare identified as "bot-like." To investigate further, I intercepted and logged all headers with the following code:
This revealed some suspicious headers that Playwright added by default to mimic a browser more effectively:
These headers likely triggered Cloudflare's detection mechanisms.
Header Interception and Customization
Using ChatGPT's suggestions, I implemented a method to intercept and customize headers for outgoing requests. This approach replaced the problematic headers while retaining critical ones like User-Agent. The modified code is as follows:
(Note: The Accept header, while seemingly unnecessary, was included based on ChatGPT's recommendations and left unchanged since the code works as intended.)
Current State and Next Steps
Currently, I have a functional prototype inspired by ChatGPT, but the code is not well-polished. I would greatly appreciate it if someone could thoroughly review and refine this implementation before integrating it into production. The concept is effective, but the code requires significant improvements. Apologies for its current state.
Attachment:
connector.zip
About installation:
Software: Locally hosted on Ubuntu 24.04.1 LTS + docker + ollama + llama3.2 / mistral
Hardware: 12th Gen Intel(R) Core(TM) i9-12900, 64Gb RAM, NVIDIA GeForce RTX 3060 Ti 8Gb VRAM, Samsung SSD 980 PRO 1TB, WDC WD30PURZ-85A 2.7 TB
The text was updated successfully, but these errors were encountered: