-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
can't parse urls starting with xn-- #438
Comments
Thank you for reporting this. Unfortunately, I suspect you're correct and this is not something we have adequately tested for thus far. Interesting cases (Live URL Viewer links for comparison purposes):
@macchiati I suspect this might require further adjustments to TR46 in due course, once we figure out the full details. Part of the problem here is that browsers have been slow on aligning with requirements in general and making the necessary adjustments. |
Digging a bit further, and another cornercase that seems related to this is that const x = new URL('http://example.com');
x.host = 'xn--a';
console.log(x.href);
// node: http://example.com/
// browser: http://xn--a/ While const x = new URL('http://example.com');
x.href = 'http://xn--a/';
console.log(x.href);
// node: throws "Invalid URL: http://xn--a/"
// browser: http://xn--a/ Wouldn't it make sense to make all setters throw when it results in an invalid EDIT: And here's another funny one: const x = new URL('http://example.com');
x.host = 'xn--ß.com';
console.log(x.href);
// node: http://example.com/
// chrome: http://xn--%C3%9F.com/
// firefox: http://xn--xn---yna.com/
// safari: http://example.com/ Only firefox seems to be consistent withitself: const x = new URL('http://example.com');
x.host = 'xn--ß.com';
const y = new URL('http://xn--ß.com');
console.log(x.href === y.href);
// node: exception on line 3
// chrome: exception on line 3
// firefox: true
// safari: exception on line 3 |
That's how the |
Ok I see. fwiw, I can't think of a situation where I want this to just ignore my input, rather than complain. But I guess that would be a breaking change by now. |
Unfortunately it would be. There's been some talk about a dedicated host API, which will not have this problem. |
I'd like to point out that the current rev of the IDNA RFC [IDNA2008] encourages applications that do DNS lookup to be liberal in what they accept, and in particular to "rely on the assumption that names that are present in the DNS are valid" except for specific cases which are known to cause "serious problems". In particular, note the text at the end of section 5.4:
where "all other strings" means "all strings that have passed the sequence of checks for 'serious problems' described in sections 5.3 and 5.4". Here are some examples of URLs that I have personally observed in the wild (during my research, which involves Web crawling) to contain hostnames which are formally invalid per some RFC or other, but do not rise to the level of a 'serious problem', and which I think should probably be accepted by the URL standard, if only for interop's sake:
|
I wonder if we should consider enshrining browsers' "ASCII fast path", where they don't perform ToASCII on ASCII inputs. In https://bugs.chromium.org/p/chromium/issues/detail?id=724018 @annevk seemed to think that was a bad idea, but I'm not sure I fully understood the negative consequences of that direction. |
Yeah, I think that's probably needed given the number of existing systems that seem to rely on this to varying degrees. I think my concerns were mostly design-wise, that it seems somewhat bad to have a different set of restrictions on non-ASCII and ASCII input, e.g., with regards to length. If it wasn't already the case it might also lead to certain security issues I suppose, as you can smuggle invalid |
Example:
I think if URL's ASCII form is valid, then converted to Unicode form must by valid too. And vice versa - if URL is invalid in one form, then it must be invalid in other form too. Otherwise we got weird results. So I am the opposite of "ASCII fast path". |
Well, but it does seem like Chrome (and similar for Firefox and Safari) attempts to browse to I agree that this leads to weirdly inconsistent rules though, so if we go down this path we should be very explicit about it and document these side effects. |
Yes, you are right. Anyway converting valid URL to not valid (even visually) is weird. |
A fix is being proposed for tr46. @markus Scherer <[email protected]>
…On Thu, Oct 1, 2020, 05:49 Rimas Misevičius ***@***.***> wrote:
Well, but it does seem like Chrome (and similar for Firefox and Safari)
attempts to browse to http://xn--a.xn--nxa/, right? Whereas for
http://xn--a.β/ it does a search.
Yes, you are right. Anyway converting valid URL to not valid (even
visually) is weird.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#438 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMADKJ4QIUHCHMDHA23SIR3EBANCNFSM4HJ5J26Q>
.
|
The fix proposed for More generally, the strategy is to report errors but still produce output (unless, for example, the Punycode string is ill-formed and thus not decodable), because different users/callers may ignore different types of errors. |
Thanks @markusicu! However, what about |
I assume you mean Without the additional hyphen the "ASCII" substring is not actually ASCII at all but it's all-non-ASCII Punycode. I don't think that It might be ill-formed Punycode, and the spec says to just record an error for that label. If it's well-formed, then the decoded string is subjected to validation, which in turn might record an error if there is a disallowed character or something else wrong. |
This issue demonstrates a need for URLs such as
I propose allowing ASCII labels with Punycode decoding errors to remain, but still forbid other types of UTS 46 error. So we have the following matrix:
There's already precedent (Safari) for treating Punycode decoding error differently from other UTS 46 failures, as one can see by comparing One way to get this is adding a IgnoreInvalidPunycode boolean flag to UTS 46, and in Processing's
|
@markusicu @macchiati thoughts on #438 (comment)? Especially for the |
Hmm, it seems that only Chromium-based browsers still have a problem here studying the results of https://wpt.fyi/results/url/toascii.window.html so maybe no change is needed. @foolip are you all planning on fixing those remaining failures? |
@ricea can you make a judgment about these failures and the linked bugs? |
@foolip I think we want to fix these. I think the linked bug for these issues is slightly different, so I newly filed https://crbug.com/1406728 |
Can't seem to parse urls like
http://xn--abc.com
. This seems to work in browsers though.I've been digging through the code and specs a bit.
It looks like
tr46.toASCII
returns an error. Digging further, it looks like it should implement this spec: https://www.unicode.org/reports/tr46/#Processing. But that seems to say:And it says
The url spec seems to dictate (https://url.spec.whatwg.org/#idna)
I feel like this should be possible though, tr46 seems quite ambiguous as to what's recoverable and what not.
I came across an example that renders and parses in the browser but seems to fail the parsing algorithm: http://xn--12cr4aua8bifvs3aljr6edb1al1vlg1a.blogspot.com (disclaimer: I am in no way connected to this url or the content of the site, it just passed by our systems)
In any case, I'm not super experienced in reading these specs, so take the previous with an appropriate grain of salt. It just seems strange to me that urls can render in a browser, but fail parsing them according to the spec.
EDIT:
forgot to mention, when I say "I've been digging through the code" I'm talking about https://github.com/jsdom/whatwg-url. FWIW, the node.js native url parser seems to behave the same way.
The text was updated successfully, but these errors were encountered: