You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
#89mentions a very simple spider using a plain list of URLs as a parsing result. A Gopher used to work this way, returning just a list of files. A dummy spider demonstrates the complete Onigumo workflow without any site-specific details.
The spider Operator provides a URL pointing to a plain text file with a list of URLs
Onigumo Downloader fetches and saves the file.
The spider Parser reads the text file and returns an Elixir list of URLs.
The Onigumo Parser saves the list of URLs in a file.
The spider Operator considers the crawl as finished.
Although the URL file produced by the parsing is identical to the fetched one, it is only a coincidence. The spider handles the format of the former one, and it is arbitrary. Onigumo consumes the latter, and although now plain text, it may become a more complex (and standardized) structure with metadata later.
An open question is how to make the spider download the URL list. Ideas:
Mock the HTTP client.
Hook up a simple HTTP server.
Use a fake, local or data URL.
Use an HTTP recording library.
The text was updated successfully, but these errors were encountered:
#89 mentions a very simple spider using a plain list of URLs as a parsing result. A Gopher used to work this way, returning just a list of files. A dummy spider demonstrates the complete Onigumo workflow without any site-specific details.
Although the URL file produced by the parsing is identical to the fetched one, it is only a coincidence. The spider handles the format of the former one, and it is arbitrary. Onigumo consumes the latter, and although now plain text, it may become a more complex (and standardized) structure with metadata later.
An open question is how to make the spider download the URL list. Ideas:
The text was updated successfully, but these errors were encountered: