Accessing and downloading (xml, pdf ) files from handelsregister API #23

timtensor · 2023-07-27T10:01:25Z

Hi @wirthual ,
I am trying the following workflow

Using the advanced search query : https://www.handelsregister.de/rp_web/erweitertesuche.xhtml i do the following
a) Choose Federal States - Berlin / Bavaria
b) Company / search words - "Hallo"
c) Choose the option "contain all words"
d ) Type of registar "HRA"
e ) And choose 100 hits per list .
This will give us a search list as shown in the screen shot
I want to then downlaod SI or DK content of all the hits .
I am wondering if that is possible or not ?
Screenshot :

Please let me know what could be the best way forward

melihsunbul · 2024-02-20T11:45:45Z

Hello, did you able to figure a way out?

timtensor · 2024-02-20T15:11:34Z

Hi unfortunately not , since it has to come up from the team it is currently kind of blocked

melihsunbul · 2024-02-20T20:24:58Z

Hi unfortunately not , since it has to come up from the team it is currently kind of blocked

Thanks

monkeygopro · 2024-02-20T22:44:21Z

interested in other solutions? or do you want to use this one specifically?

melihsunbul · 2024-02-20T22:50:03Z

interested in other solutions? or do you want to use this one specifically?

I am open to hear other alternatives if there is any.

monkeygopro · 2024-02-21T00:43:14Z

I don't know what your project is about or what you are gonna do with the data, but Selenium looks like a very good option. Contact me if you want further information.

timtensor · 2024-02-21T09:44:08Z

As I am using a colab notebook , I think an API would be the best right ? I think the colab notebook doesn't support selenium .

The project basically collects data to analyze different companies .

melihsunbul · 2024-02-21T09:59:12Z

Yeah, I agree. I already tried an alternative to selenium which is playwright, but it has limitations that come together with scraping the website.

monkeygopro · 2024-02-21T14:53:37Z

what limitations do you mean?

melihsunbul · 2024-02-21T15:29:29Z

IP blocks, for example.

monkeygopro · 2024-02-21T19:48:36Z

Even if you are using this API your IP will be blocked if you send more than 60 Requests per hour. for avoiding getting blocked, you have to rotate your proxies. How many requests are you planing to use in your project?

melihsunbul · 2024-02-21T19:54:14Z

I already tried to use proxies by rotating, but in that case, the website is not being loaded in a reasonable time. I am planning to send above 200 requests periodically.

monkeygopro · 2024-02-21T21:22:04Z

hmm, you can modify the code handelsregister.py and send post requests within the form "ergebnisseform" in .../ergebnisse.xhtml !but you have also to deal with javax.faces.ViewState its readonly and you can't control it.

timtensor · 2024-02-22T09:20:24Z

Oh ok , I think the API just makes it a bit easier and is there for a reason , I guess but I think it is not maintained

muhmtayyab · 2024-02-24T21:07:38Z

I tried and found the solution, I found their API to get the document but that API required a cookie (SESSIONID). Which is not accessible using javascript. Do you have any solution to get the cookie?

timtensor · 2024-02-25T01:27:06Z

no sorry actually i have no clue , you managed to get the documents via api ? Like all the documents related to a search . If so it would be great if you could share your method .

mauhai · 2024-03-10T00:34:29Z

I have built a complete solution for this problem and will see how and whether I can share the code and approach somehow. As mentioned above its using full browser rendering as opposed to this API, which I think is rather a dead-end when it comes to actually downloading the documents.

I do hope, that the Handelsregister will at some point publish a proper API.

timtensor · 2024-04-23T13:22:45Z

I tried and found the solution, I found their API to get the document but that API required a cookie (SESSIONID). Which is not accessible using javascript. Do you have any solution to get the cookie?

Did you find a way to resolve this issue . I was looking into playwright but so far did not manage to find the issue.
How are you using the API to download the docs ?

monkeygopro · 2024-04-25T00:38:09Z

download

For wich project do you want to use the API. If you are working on larg project(downloading millions of documents), i can offer you a paid solutions

timtensor · 2024-04-25T09:10:31Z

download

For wich project do you want to use the API. If you are working on larg project(downloading millions of documents), i can offer you a paid solutions

Small project for about 15-45 documents for NLP analysis

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accessing and downloading (xml, pdf ) files from handelsregister API #23

Accessing and downloading (xml, pdf ) files from handelsregister API #23

timtensor commented Jul 27, 2023

melihsunbul commented Feb 20, 2024

timtensor commented Feb 20, 2024

melihsunbul commented Feb 20, 2024

monkeygopro commented Feb 20, 2024

melihsunbul commented Feb 20, 2024

monkeygopro commented Feb 21, 2024

timtensor commented Feb 21, 2024

melihsunbul commented Feb 21, 2024

monkeygopro commented Feb 21, 2024

melihsunbul commented Feb 21, 2024

monkeygopro commented Feb 21, 2024

melihsunbul commented Feb 21, 2024

monkeygopro commented Feb 21, 2024

timtensor commented Feb 22, 2024

muhmtayyab commented Feb 24, 2024

timtensor commented Feb 25, 2024

mauhai commented Mar 10, 2024

timtensor commented Apr 23, 2024

monkeygopro commented Apr 25, 2024

timtensor commented Apr 25, 2024

Accessing and downloading (xml, pdf ) files from handelsregister API #23

Accessing and downloading (xml, pdf ) files from handelsregister API #23

Comments

timtensor commented Jul 27, 2023

melihsunbul commented Feb 20, 2024

timtensor commented Feb 20, 2024

melihsunbul commented Feb 20, 2024

monkeygopro commented Feb 20, 2024

melihsunbul commented Feb 20, 2024

monkeygopro commented Feb 21, 2024

timtensor commented Feb 21, 2024

melihsunbul commented Feb 21, 2024

monkeygopro commented Feb 21, 2024

melihsunbul commented Feb 21, 2024

monkeygopro commented Feb 21, 2024

melihsunbul commented Feb 21, 2024

monkeygopro commented Feb 21, 2024

timtensor commented Feb 22, 2024

muhmtayyab commented Feb 24, 2024

timtensor commented Feb 25, 2024

mauhai commented Mar 10, 2024

timtensor commented Apr 23, 2024

monkeygopro commented Apr 25, 2024

timtensor commented Apr 25, 2024