Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accessing and downloading (xml, pdf ) files from handelsregister API #23

Open
timtensor opened this issue Jul 27, 2023 · 20 comments
Open

Comments

@timtensor
Copy link

Hi @wirthual ,
I am trying the following workflow

  • Using the advanced search query : https://www.handelsregister.de/rp_web/erweitertesuche.xhtml i do the following
    a) Choose Federal States - Berlin / Bavaria
    b) Company / search words - "Hallo"
    c) Choose the option "contain all words"
    d ) Type of registar "HRA"
    e ) And choose 100 hits per list .

  • This will give us a search list as shown in the screen shot

  • I want to then downlaod SI or DK content of all the hits .
    I am wondering if that is possible or not ?
    Screenshot :
    image

Please let me know what could be the best way forward

@melihsunbul
Copy link

Hello, did you able to figure a way out?

@timtensor
Copy link
Author

Hi unfortunately not , since it has to come up from the team it is currently kind of blocked

@melihsunbul
Copy link

Hi unfortunately not , since it has to come up from the team it is currently kind of blocked

Thanks

@monkeygopro
Copy link

interested in other solutions? or do you want to use this one specifically?

@melihsunbul
Copy link

interested in other solutions? or do you want to use this one specifically?

I am open to hear other alternatives if there is any.

@monkeygopro
Copy link

I don't know what your project is about or what you are gonna do with the data, but Selenium looks like a very good option. Contact me if you want further information.

@timtensor
Copy link
Author

As I am using a colab notebook , I think an API would be the best right ? I think the colab notebook doesn't support selenium .

The project basically collects data to analyze different companies .

@melihsunbul
Copy link

Yeah, I agree. I already tried an alternative to selenium which is playwright, but it has limitations that come together with scraping the website.

@monkeygopro
Copy link

what limitations do you mean?

@melihsunbul
Copy link

IP blocks, for example.

@monkeygopro
Copy link

Even if you are using this API your IP will be blocked if you send more than 60 Requests per hour. for avoiding getting blocked, you have to rotate your proxies. How many requests are you planing to use in your project?

@melihsunbul
Copy link

I already tried to use proxies by rotating, but in that case, the website is not being loaded in a reasonable time. I am planning to send above 200 requests periodically.

@monkeygopro
Copy link

hmm, you can modify the code handelsregister.py and send post requests within the form "ergebnisseform" in .../ergebnisse.xhtml !but you have also to deal with javax.faces.ViewState its readonly and you can't control it.

@timtensor
Copy link
Author

Oh ok , I think the API just makes it a bit easier and is there for a reason , I guess but I think it is not maintained

@muhmtayyab
Copy link

I tried and found the solution, I found their API to get the document but that API required a cookie (SESSIONID). Which is not accessible using javascript. Do you have any solution to get the cookie?

@timtensor
Copy link
Author

no sorry actually i have no clue , you managed to get the documents via api ? Like all the documents related to a search . If so it would be great if you could share your method .

@mauhai
Copy link

mauhai commented Mar 10, 2024

I have built a complete solution for this problem and will see how and whether I can share the code and approach somehow. As mentioned above its using full browser rendering as opposed to this API, which I think is rather a dead-end when it comes to actually downloading the documents.

I do hope, that the Handelsregister will at some point publish a proper API.

@timtensor
Copy link
Author

I tried and found the solution, I found their API to get the document but that API required a cookie (SESSIONID). Which is not accessible using javascript. Do you have any solution to get the cookie?

Did you find a way to resolve this issue . I was looking into playwright but so far did not manage to find the issue.
How are you using the API to download the docs ?

@monkeygopro
Copy link

download

For wich project do you want to use the API. If you are working on larg project(downloading millions of documents), i can offer you a paid solutions

@timtensor
Copy link
Author

download

For wich project do you want to use the API. If you are working on larg project(downloading millions of documents), i can offer you a paid solutions

Small project for about 15-45 documents for NLP analysis

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants