Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Equivalent but differently encoded URLs break no-parent recursion #469

Open
JustAnotherArchivist opened this issue Jan 10, 2022 · 0 comments
Labels

Comments

@JustAnotherArchivist
Copy link
Contributor

When running a recursive crawl with --no-parent for https://example.org/~foo/, links to https://example.org/%7Efoo/bar are not followed (and vice-versa) because there is no normalisation of ~ and %7E to either value. I think this should be considered a bug. I assume a similar thing might be true for other characters but have only seen the tilde in the wild.

I'm not entirely sure about the correct solution here. We could force it to either value (probably the encoded one to be safe as some ancient servers might not support literal tildes, cf. RFC 1738). This would change the URL and might in some very rare cases cause issues. The alternative is to keep URLs as is but do an equivalence check. This would however require extra handling for deduplication of equivalent URLs, and I'm not sure there is a good way to do that (which doesn't involve e.g. a separate DB column for a normalised URL).

I haven't checked what wget does in this case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant