-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Punycode behavior for labels exceeding DNS length is ill-defined #824
Comments
@annevk, do you recall the motivation for setting @markusicu, do you recall if you had a more concrete reason than “people sometimes use libraries and protocols with non-standard inputs” for not using the tight bounds in ICU4C? |
CC @valenting who removed the DNS-oriented length limit from Gecko, AFAICT, due to WPTs and not due to actual Web compat. |
Hmm. Even with the error condition of counting encode output units, there should still be a limit on encode input units to avoid denial of service. I haven’t even tried doing the math of what the tight bound there would be. |
I think conceptually the idea is that we don't want to couple the hostnames of special URLs too tightly with DNS, although I don't think it is well defined (long-standing issue #397). That said, a length limit on the encode function to prevent overflow is a practical necessity. My library uses 32-bit integers, so to prevent overflow we refuse to encode input sequences longer than 3854 scalars. That's not even for DoS protection; it's just for correctness. I wouldn't necessarily mind a tighter bound to keep the n^2 encoding algorithm under control. I'm not sure it makes sense to use length limits from DNS, though. |
I think one example why you wouldn't want to enforce a DNS limit on URLs is something like Tor onion URLs. |
#397 is certainly an open question that's relevant here. I also don't think implementations ever enforced the limit for pure ASCII inputs. I'm not a 100% positive on also having tested long non-ASCII inputs, but I would also not be comfortable with having different length requirements for ASCII and non-ASCII. |
Given that ASCII labels have O(n) behavior and non-ASCII labels have O(n^2) behavior, I think it's not good for an implementation to take the position that a) the length of ASCII labels is unconstrained and b) ASCII and non-ASCII labels have the same length (non-)requirements. It follows that either the specs give precise interoperable constraints or implementations make something up under https://infra.spec.whatwg.org/#algorithm-limits as ICU4C has done (or risk denial of service). It’s a bit unclear to me what the foreseeable space of naming systems is. #397 points to NetBIOS, but, AFAICT, the way the URL Standard’s "forbidden domain code point" does not appear to arise directly from NetBIOS. In any case, NetBIOS seems to be limited to 15 ASCII characters, so the limit is a) irrelevant to non-ASCII labels and b) more constraining on length than DNS’s length constraint.
|
I think agreeing on a maximum label length and a maximum number of labels would be reasonable. Presumably to be enforced after ToASCII is performed, but potentially possible to be inlined? Do we also want to agree on limits for other fields of a URL? All implementations probably have them. |
What is the issue with the URL Standard?
The URL Standard, UTS 46, and RFC 3492 don’t specify interoperable behavior for Punycode encode and decode failures when a label is longer than what actually makes sense for DNS purposes.
If the input is too long, at some point an integer internal to the Punycode algorithm overflows. See https://datatracker.ietf.org/doc/html/rfc3492.html#section-6.4
One way to specify this would be to specify that the internal integer size be 32 bits, but that can lead to denial of service attacks with unreasonably long inputs. (Apparently Chrome‘s fuzzers managed to time out when fuzzing Punycode.) For this reason, ICU4C has somewhat arbitrary length limits for the inputs to Punycode decode and encode. https://unicode-org.atlassian.net/browse/ICU-13727 https://searchfox.org/mozilla-central/rev/6bc0f370cc459bf79e1330ef74b21009a9848c91/intl/icu/source/common/punycode.cpp#173-176
The rationale from the issue is:
The non-arbitrary tight bound would be to fail before decoding Punycode if the decoder input (not counting the
xn--
prefix) would exceed 59 (ASCII) characters and to fail during encoding if the encoder is (not counting thexn--
prefix) about to output a 60th (ASCII) character.Using the tight bound would come pretty close to setting
VerifyDNSLength
to true (close, but not exactly: It would still not place a limit for ASCII-only labels and the domain name as a whole). Yet, the URL Standard setsVerifyDNSLength
tofalse
. This comes from 3bec3b8 , which does not state motivation.Without knowing the motivation for setting
VerifyDNSLength
tofalse
, it’s hard to assess if placing the tight bounds on Punycode would work.I think the specs should make the behavior here well defined even if it’s not a particularly pressing issue, since it only concern labels that are too long for DNS anyway. (This probably belongs in UTS 46, but filing this here for discussion before sending UTS 46 feedback.)
The text was updated successfully, but these errors were encountered: