Equivalent but differently encoded URLs break no-parent recursion #469

JustAnotherArchivist · 2022-01-10T20:03:30Z

When running a recursive crawl with --no-parent for https://example.org/~foo/, links to https://example.org/%7Efoo/bar are not followed (and vice-versa) because there is no normalisation of ~ and %7E to either value. I think this should be considered a bug. I assume a similar thing might be true for other characters but have only seen the tilde in the wild.

I'm not entirely sure about the correct solution here. We could force it to either value (probably the encoded one to be safe as some ancient servers might not support literal tildes, cf. RFC 1738). This would change the URL and might in some very rare cases cause issues. The alternative is to keep URLs as is but do an equivalence check. This would however require extra handling for deduplication of equivalent URLs, and I'm not sure there is a good way to do that (which doesn't involve e.g. a separate DB column for a normalised URL).

I haven't checked what wget does in this case.

The text was updated successfully, but these errors were encountered:

JustAnotherArchivist added the bug label Jan 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Equivalent but differently encoded URLs break no-parent recursion #469

Equivalent but differently encoded URLs break no-parent recursion #469

JustAnotherArchivist commented Jan 10, 2022

Equivalent but differently encoded URLs break no-parent recursion #469

Equivalent but differently encoded URLs break no-parent recursion #469

Comments

JustAnotherArchivist commented Jan 10, 2022