Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a dummy spider #169

Open
Glutexo opened this issue Jan 24, 2023 · 0 comments
Open

Create a dummy spider #169

Glutexo opened this issue Jan 24, 2023 · 0 comments

Comments

@Glutexo
Copy link
Owner

Glutexo commented Jan 24, 2023

#89 mentions a very simple spider using a plain list of URLs as a parsing result. A Gopher used to work this way, returning just a list of files. A dummy spider demonstrates the complete Onigumo workflow without any site-specific details.

  1. The spider Operator provides a URL pointing to a plain text file with a list of URLs
  2. Onigumo Downloader fetches and saves the file.
  3. The spider Parser reads the text file and returns an Elixir list of URLs.
  4. The Onigumo Parser saves the list of URLs in a file.
  5. The spider Operator considers the crawl as finished.

Although the URL file produced by the parsing is identical to the fetched one, it is only a coincidence. The spider handles the format of the former one, and it is arbitrary. Onigumo consumes the latter, and although now plain text, it may become a more complex (and standardized) structure with metadata later.

An open question is how to make the spider download the URL list. Ideas:

  • Mock the HTTP client.
  • Hook up a simple HTTP server.
  • Use a fake, local or data URL.
  • Use an HTTP recording library.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant