Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to use find links with many selectors #171

Open
nappex opened this issue Feb 5, 2023 · 1 comment
Open

Add option to use find links with many selectors #171

nappex opened this issue Feb 5, 2023 · 1 comment
Assignees

Comments

@nappex
Copy link
Collaborator

nappex commented Feb 5, 2023

Currently, our Spider.HTML.find_links search only for href in a tag. It could be handy if we'll be able to use our find_links with more selectors as a, link or a, link, area, base.
Maybe we should consider option when no selector is specified, it could be as default or as explicit, default could be just a or nothing. If nothing is specified then href is searched everywhere.

definition of find_links

  • find_links(parsed_document, selectors \\ "")
  • find_links(parsed_document, selectors \\ "a")

No selectors can be specified as None, NULL, "" or "*"....

Default state could be "a, area"

@Glutexo
Copy link
Owner

Glutexo commented Feb 6, 2023

In the end, we may not want to filter the tags at all. The tool is intended to be as generic as possible and href attributes on other elements can appear in the real world.

We may however omit link tags or maybe anything in the head, because those are not user-followable links to other pages. Downloading those would only rarely provide any data worth collecting by a spider. But it would make sense to have an option to override this behavior.

We should also check how the base tag works and whether we shouldn’t take it into account when finding links. If I remember corretly, such tag would change the target of relative URLs.

Let’s move on in small steps, not putting all of the logic in place in a single pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants