Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] anti-captcha support. #211

Open
ghost opened this issue Feb 26, 2021 · 17 comments · May be fixed by #915
Open

[FEATURE] anti-captcha support. #211

ghost opened this issue Feb 26, 2021 · 17 comments · May be fixed by #915
Labels
enhancement New feature or request

Comments

@ghost
Copy link

ghost commented Feb 26, 2021

with stuff that can get blocked easily, anti-captcha support would be huge. invidious has it implemented perfectly, and it allows public instances to be used with out any major rate-limiting. just an idea. thanks!

@ghost ghost added the enhancement New feature or request label Feb 26, 2021
@benbusby
Copy link
Owner

benbusby commented Mar 1, 2021

I do agree that would be a nice feature. I haven't encountered the captcha issue too much (thankfully) but I'd still like to eliminate it altogether.

My understanding of anti-captcha services at the moment is that the worthwhile ones aren't free, which unfortunately prevents it from being something I can universally implement in the code base. I suppose I could put the responsibility on the public instance maintainer to provide an API key for activating the service -- I'm not sure how Invidious and others handle it, but I assume it's something like that.

In any case, I'll look into it. Thanks for the recommendation!

@ghost
Copy link
Author

ghost commented Mar 1, 2021

Invidious implements it very nicely, I actually didn't know it was an option until talking with the developers. You'd just add a line in your config for your API key. I'm willing to pay if it means my instance can be used without worry of getting blocked, plus it's super cheap.

@unixfox
Copy link

unixfox commented Apr 25, 2021

I do agree that would be a nice feature. I haven't encountered the captcha issue too much (thankfully) but I'd still like to eliminate it altogether.

My understanding of anti-captcha services at the moment is that the worthwhile ones aren't free, which unfortunately prevents it from being something I can universally implement in the code base. I suppose I could put the responsibility on the public instance maintainer to provide an API key for activating the service -- I'm not sure how Invidious and others handle it, but I assume it's something like that.

In any case, I'll look into it. Thanks for the recommendation!

You can implement the anti-captcha API, it's not universal nor a standard, but it's very common and easy to clone.

A lot of projects provide an anti-captcha API clone like https://capmonster.cloud or mine which I plan to release it publicly as soon as I find it stable.

Implementing an anti captcha solution into whoogle is a great way to provide the tools for public instances maintainers to offer a reliable service that work even when Google is trying to rate limit the server.

@Albonycal
Copy link
Contributor

Yea I'm also getting blocked by google...
This would be cool..
Any updates?
Thank you :D

@maxdesalle
Copy link

Also having this issue on a DigitalOcean droplet.

@randomwalk3141592
Copy link

Rather than an anti-captcha feature, can you make it so that Whoogle will actually display Google's captcha so that it can be solved? Right now, the captcha is not displayed.

@unixfox
Copy link

unixfox commented Jul 16, 2021

Rather than an anti-captcha feature, can you make it so that Whoogle will actually display Google's captcha so that it can be solved? Right now, the captcha is not displayed.

That's not possible, you can't load a same key of Google Recaptcha from another domain that is not whitelisted.

@Albonycal
Copy link
Contributor

Hmm.. I found a workaround for this on Heroku instances.. restarting the heroku dyno assigns the instance a new IP address.. So I never faced the captcha issue after this.. & afaik this doesn't cause any downtime

@randomwalk3141592
Copy link

Rather than an anti-captcha feature, can you make it so that Whoogle will actually display Google's captcha so that it can be solved? Right now, the captcha is not displayed.

That's not possible, you can't load a same key of Google Recaptcha from another domain that is not whitelisted.

I'm not sure I understand. My Whoogle docker is querying Google. Google responds with a captcha. Whoogle just needs to let that captcha come through and display it in the browser so I can solve it. This shouldn't involve another domain.

@randomwalk3141592
Copy link

Hmm.. I found a workaround for this on Heroku instances.. restarting the heroku dyno assigns the instance a new IP address.. So I never faced the captcha issue after this.. & afaik this doesn't cause any downtime

I use a VPN and Whoogle docker queries Google through this VPN. Once in a while, Whoogle will get a captcha and I would have to reconnect the VPN connection so that I get a new public IP address. I know this solves the problem.

What is interesting is that when Whoogle chokes on the Google captcha, if I go to Google directly (also thought the VPN, so my direct connection would come from the same public IP as Whoogle docker), Google does NOT show me a captcha.

It seems Google is somehow detecting that the Whoogle query is "weird" while my direct query to Google from my computer is not weird.

@Albonycal
Copy link
Contributor

hmm.. Can it be different user agents?
or fingerprint thing?

@unixfox
Copy link

unixfox commented Jul 17, 2021

Rather than an anti-captcha feature, can you make it so that Whoogle will actually display Google's captcha so that it can be solved? Right now, the captcha is not displayed.

That's not possible, you can't load a same key of Google Recaptcha from another domain that is not whitelisted.

I'm not sure I understand. My Whoogle docker is querying Google. Google responds with a captcha. Whoogle just needs to let that captcha come through and display it in the browser so I can solve it. This shouldn't involve another domain.

Whoogle is not a browser, it doesn't interpret JavaScript so it can't "show" you the CAPTCHA.

hmm.. Can it be different user agents?
or fingerprint thing?

No Google rate limit based on the IP address and that's it.

@unixfox

This comment was marked as outdated.

@JaneJeon
Copy link

I honestly think all of this can be solved by using a better scraping method. I worked on scrapers to get pass "gated" sites such as GSRPs (Google Search Result Pages), paywalls, etc. and the reliability of scraping (i.e. not getting cockblocked by a captcha because they detected that you were a "bot" - making a request not from their frontend) comes down to these factors:

  1. IP (holy shit people, this is the number 1 thing that gets you blocked by Google. USE THEM PROXIES!!)
  2. Rate limiting (how many requests per second/minute/hour are you sending to Google, per IP?)
  3. SSL fingerprinting (browsers make HTTPS requests in a different manner than just calling requests.get() does
  4. Browser Fingerprinting (this is the big boy shit, and you almost never have to worry about it, except client-side rendered stuff, which is most definitely not GSRPs)

Now, 1 should be solved with proxies, 2 should be solved with careful rate limiting implementation (esp. w/ multiple proxies) within whoogle, 3 can be solved with careful HTTPS handshake implementation within whoogle, and 4 can be implemented using something like playwright plus browser stealth libraries that plug into playwright OR (given that this application is written in python and won't be able to use those stealth libraries that are typically written in js) use playwright to control a "stealthy" web browser instance, such as https://github.com/ulixee/secret-agent. Note that 4 is extreme overkill for most people's use cases (most bot solutions grade you on a sliding scale, so as long as you get 1, 2, and 3, your score will be still high enough to not require this bullshit)!!

I literally never get blocked on Google this way (not from Whoogle, my private application), no matter how many requests I send. Whoogle should adopt at the very least 2 and 3, and really direct people towards using a proxy (instead of trying to remove the captcha - that is a losing solution). Honestly, that should suffice to close this issue once and for all.

@JaneJeon
Copy link

Actually, ignore all that bullshit I said above, @unixfox's method is 10000 times easier. We should do that.

@unixfox

This comment has been minimized.

@yannduran
Copy link

Still happening at the end of 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants