-
-
Notifications
You must be signed in to change notification settings - Fork 987
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] anti-captcha support. #211
Comments
I do agree that would be a nice feature. I haven't encountered the captcha issue too much (thankfully) but I'd still like to eliminate it altogether. My understanding of anti-captcha services at the moment is that the worthwhile ones aren't free, which unfortunately prevents it from being something I can universally implement in the code base. I suppose I could put the responsibility on the public instance maintainer to provide an API key for activating the service -- I'm not sure how Invidious and others handle it, but I assume it's something like that. In any case, I'll look into it. Thanks for the recommendation! |
Invidious implements it very nicely, I actually didn't know it was an option until talking with the developers. You'd just add a line in your config for your API key. I'm willing to pay if it means my instance can be used without worry of getting blocked, plus it's super cheap. |
You can implement the anti-captcha API, it's not universal nor a standard, but it's very common and easy to clone. A lot of projects provide an anti-captcha API clone like https://capmonster.cloud or mine which I plan to release it publicly as soon as I find it stable. Implementing an anti captcha solution into whoogle is a great way to provide the tools for public instances maintainers to offer a reliable service that work even when Google is trying to rate limit the server. |
Yea I'm also getting blocked by google... |
Also having this issue on a DigitalOcean droplet. |
Rather than an anti-captcha feature, can you make it so that Whoogle will actually display Google's captcha so that it can be solved? Right now, the captcha is not displayed. |
That's not possible, you can't load a same key of Google Recaptcha from another domain that is not whitelisted. |
Hmm.. I found a workaround for this on Heroku instances.. restarting the heroku dyno assigns the instance a new IP address.. So I never faced the captcha issue after this.. & afaik this doesn't cause any downtime |
I'm not sure I understand. My Whoogle docker is querying Google. Google responds with a captcha. Whoogle just needs to let that captcha come through and display it in the browser so I can solve it. This shouldn't involve another domain. |
I use a VPN and Whoogle docker queries Google through this VPN. Once in a while, Whoogle will get a captcha and I would have to reconnect the VPN connection so that I get a new public IP address. I know this solves the problem. What is interesting is that when Whoogle chokes on the Google captcha, if I go to Google directly (also thought the VPN, so my direct connection would come from the same public IP as Whoogle docker), Google does NOT show me a captcha. It seems Google is somehow detecting that the Whoogle query is "weird" while my direct query to Google from my computer is not weird. |
hmm.. Can it be different user agents? |
Whoogle is not a browser, it doesn't interpret JavaScript so it can't "show" you the CAPTCHA.
No Google rate limit based on the IP address and that's it. |
This comment was marked as outdated.
This comment was marked as outdated.
I honestly think all of this can be solved by using a better scraping method. I worked on scrapers to get pass "gated" sites such as GSRPs (Google Search Result Pages), paywalls, etc. and the reliability of scraping (i.e. not getting cockblocked by a captcha because they detected that you were a "bot" - making a request not from their frontend) comes down to these factors:
Now, 1 should be solved with proxies, 2 should be solved with careful rate limiting implementation (esp. w/ multiple proxies) within whoogle, 3 can be solved with careful HTTPS handshake implementation within whoogle, and 4 can be implemented using something like playwright plus browser stealth libraries that plug into playwright OR (given that this application is written in python and won't be able to use those stealth libraries that are typically written in js) use playwright to control a "stealthy" web browser instance, such as https://github.com/ulixee/secret-agent. Note that 4 is extreme overkill for most people's use cases (most bot solutions grade you on a sliding scale, so as long as you get 1, 2, and 3, your score will be still high enough to not require this bullshit)!! I literally never get blocked on Google this way (not from Whoogle, my private application), no matter how many requests I send. Whoogle should adopt at the very least 2 and 3, and really direct people towards using a proxy (instead of trying to remove the captcha - that is a losing solution). Honestly, that should suffice to close this issue once and for all. |
Actually, ignore all that bullshit I said above, @unixfox's method is 10000 times easier. We should do that. |
This comment has been minimized.
This comment has been minimized.
Still happening at the end of 2024 |
with stuff that can get blocked easily, anti-captcha support would be huge. invidious has it implemented perfectly, and it allows public instances to be used with out any major rate-limiting. just an idea. thanks!
The text was updated successfully, but these errors were encountered: