Create a dummy spider #169

Glutexo · 2023-01-24T13:42:58Z

#89 mentions a very simple spider using a plain list of URLs as a parsing result. A Gopher used to work this way, returning just a list of files. A dummy spider demonstrates the complete Onigumo workflow without any site-specific details.

The spider Operator provides a URL pointing to a plain text file with a list of URLs
Onigumo Downloader fetches and saves the file.
The spider Parser reads the text file and returns an Elixir list of URLs.
The Onigumo Parser saves the list of URLs in a file.
The spider Operator considers the crawl as finished.

Although the URL file produced by the parsing is identical to the fetched one, it is only a coincidence. The spider handles the format of the former one, and it is arbitrary. Onigumo consumes the latter, and although now plain text, it may become a more complex (and standardized) structure with metadata later.

An open question is how to make the spider download the URL list. Ideas:

Mock the HTTP client.
Hook up a simple HTTP server.
Use a fake, local or data URL.
Use an HTTP recording library.

Glutexo mentioned this issue Jan 24, 2023

Create module parser #89

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create a dummy spider #169

Create a dummy spider #169

Glutexo commented Jan 24, 2023

Create a dummy spider #169

Create a dummy spider #169

Comments

Glutexo commented Jan 24, 2023