You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When running a recursive crawl with --no-parent for https://example.org/~foo/, links to https://example.org/%7Efoo/bar are not followed (and vice-versa) because there is no normalisation of ~ and %7E to either value. I think this should be considered a bug. I assume a similar thing might be true for other characters but have only seen the tilde in the wild.
I'm not entirely sure about the correct solution here. We could force it to either value (probably the encoded one to be safe as some ancient servers might not support literal tildes, cf. RFC 1738). This would change the URL and might in some very rare cases cause issues. The alternative is to keep URLs as is but do an equivalence check. This would however require extra handling for deduplication of equivalent URLs, and I'm not sure there is a good way to do that (which doesn't involve e.g. a separate DB column for a normalised URL).
I haven't checked what wget does in this case.
The text was updated successfully, but these errors were encountered:
When running a recursive crawl with
--no-parent
for https://example.org/~foo/, links to https://example.org/%7Efoo/bar are not followed (and vice-versa) because there is no normalisation of~
and%7E
to either value. I think this should be considered a bug. I assume a similar thing might be true for other characters but have only seen the tilde in the wild.I'm not entirely sure about the correct solution here. We could force it to either value (probably the encoded one to be safe as some ancient servers might not support literal tildes, cf. RFC 1738). This would change the URL and might in some very rare cases cause issues. The alternative is to keep URLs as is but do an equivalence check. This would however require extra handling for deduplication of equivalent URLs, and I'm not sure there is a good way to do that (which doesn't involve e.g. a separate DB column for a normalised URL).
I haven't checked what wget does in this case.
The text was updated successfully, but these errors were encountered: