Spider.HTML.find_links should not collect links to headers #170

nappex · 2023-02-05T16:53:30Z

in href we can specify a link to some part of page itself for example headers - You can use href="#top" or href="#" to link to the top of the current page!

But it is not valid link which create valid another page we want to crawl, this kind of page was already crawled.

The text was updated successfully, but these errors were encountered:

Glutexo · 2023-02-05T17:16:32Z

Shouldn’t it? Although I can’t come up with a real example, there may be a use case. That calls for a switch allowing to enable links without path (and domain, protocol…).

Glutexo mentioned this issue Feb 5, 2023

Add parser of html links #87

Merged

Glutexo assigned Glutexo and nappex and unassigned Glutexo Feb 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spider.HTML.find_links should not collect links to headers #170

Spider.HTML.find_links should not collect links to headers #170

nappex commented Feb 5, 2023 •

edited

Loading

Glutexo commented Feb 5, 2023

Spider.HTML.find_links should not collect links to headers #170

Spider.HTML.find_links should not collect links to headers #170

Comments

nappex commented Feb 5, 2023 • edited Loading

Glutexo commented Feb 5, 2023

nappex commented Feb 5, 2023 •

edited

Loading