-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accessing and downloading (xml, pdf ) files from handelsregister API #23
Comments
Hello, did you able to figure a way out? |
Hi unfortunately not , since it has to come up from the team it is currently kind of blocked |
Thanks |
interested in other solutions? or do you want to use this one specifically? |
I am open to hear other alternatives if there is any. |
I don't know what your project is about or what you are gonna do with the data, but Selenium looks like a very good option. Contact me if you want further information. |
As I am using a colab notebook , I think an API would be the best right ? I think the colab notebook doesn't support selenium . The project basically collects data to analyze different companies . |
Yeah, I agree. I already tried an alternative to selenium which is playwright, but it has limitations that come together with scraping the website. |
what limitations do you mean? |
IP blocks, for example. |
Even if you are using this API your IP will be blocked if you send more than 60 Requests per hour. for avoiding getting blocked, you have to rotate your proxies. How many requests are you planing to use in your project? |
I already tried to use proxies by rotating, but in that case, the website is not being loaded in a reasonable time. I am planning to send above 200 requests periodically. |
hmm, you can modify the code handelsregister.py and send post requests within the form "ergebnisseform" in .../ergebnisse.xhtml !but you have also to deal with javax.faces.ViewState its readonly and you can't control it. |
Oh ok , I think the API just makes it a bit easier and is there for a reason , I guess but I think it is not maintained |
I tried and found the solution, I found their API to get the document but that API required a cookie (SESSIONID). Which is not accessible using javascript. Do you have any solution to get the cookie? |
no sorry actually i have no clue , you managed to get the documents via api ? Like all the documents related to a search . If so it would be great if you could share your method . |
I have built a complete solution for this problem and will see how and whether I can share the code and approach somehow. As mentioned above its using full browser rendering as opposed to this API, which I think is rather a dead-end when it comes to actually downloading the documents. I do hope, that the Handelsregister will at some point publish a proper API. |
Did you find a way to resolve this issue . I was looking into playwright but so far did not manage to find the issue. |
For wich project do you want to use the API. If you are working on larg project(downloading millions of documents), i can offer you a paid solutions |
Small project for about 15-45 documents for NLP analysis |
Hi @wirthual ,
I am trying the following workflow
Using the advanced search query : https://www.handelsregister.de/rp_web/erweitertesuche.xhtml i do the following
a) Choose Federal States - Berlin / Bavaria
b) Company / search words - "Hallo"
c) Choose the option "contain all words"
d ) Type of registar "HRA"
e ) And choose 100 hits per list .
This will give us a search list as shown in the screen shot
I want to then downlaod SI or DK content of all the hits .
I am wondering if that is possible or not ?
Screenshot :
Please let me know what could be the best way forward
The text was updated successfully, but these errors were encountered: