Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

string.utf_codepoint returns an error for valid codepoints U+FFFE and U+FFFF #778

Closed
mooreryan opened this issue Dec 24, 2024 · 5 comments
Closed
Labels
good first issue Good for newcomers help wanted Extra attention is needed

Comments

@mooreryan
Copy link
Contributor

The string.utf_codepoint returns an error for valid codepoints U+FFFE (65,534) and U+FFFF (65,535).

The line of code is here.

These two codepoints are two of the 66 so-called "noncharacter code points". Here are some excerpts from 23.7.1 Noncharacters: U+FFFE, U+FFFF, and Others of the Unicode core spec:

...they are not illegal in interchange, nor does their presence cause Unicode text to be ill-formed....They are not prohibited from occurring in valid Unicode strings which happen to be interchanged. This distinction, which might be seen as too finely drawn, ensures that noncharacters are correctly preserved when “interchanged” internally, as when used in strings in APIs, in other interprocess protocols, or when stored....It is not recommended to simply delete noncharacter code points from such text, because of the potential security issues caused by deleting uninterpreted characters

If the goal is for string.utf_codepoint to return Error(Nil) for noncharacter codepoints, then there are many missing noncharacter codepoints that do not return an error, e.g., U+1FFFF.

However, I think that the correct behavior would be for the string.utf_codepoint function to return Ok for U+FFFE and U+FFFF. (As examples, the Gleam accepts the literals "\u{FFFE}" and "\u{FFFF}" as valid unicode code points, and both Elixir and Rust also accept them as valid code points)

@lpil lpil added good first issue Good for newcomers help wanted Extra attention is needed labels Dec 28, 2024
@lpil
Copy link
Member

lpil commented Dec 28, 2024

Thank you

@jooaf
Copy link

jooaf commented Dec 29, 2024

Hello, if it is possible may I pick this up? For building out a solution and testing locally, should I just run gleam build and gleam test. Sorry if this is a naive question.

mooreryan added a commit to mooreryan/stdlib that referenced this issue Dec 29, 2024
Treats U+FFFE and U+FFFF as valid unicode codepoints rather than errors.  See gleam-lang#778.
@mooreryan
Copy link
Contributor Author

mooreryan commented Dec 29, 2024

Hey @jooaf, I apologize--I did not notice you commented on this issue asking to work on it before I went ahead and fixed it. If you want I could close my PR and you could take it (since it is a good first issue!)

@jooaf
Copy link

jooaf commented Dec 29, 2024

Hey @mooreryan, no worries at all! That's so kind of you! Since you have already done the work, I think you should go ahead with your PR. I will just be on the lookout for another good first issue :)

mooreryan added a commit to mooreryan/stdlib that referenced this issue Dec 31, 2024
Treats `U+FFFE` and `U+FFFF` as valid Unicode codepoints rather than errors.  See gleam-lang#778.
lpil pushed a commit to mooreryan/stdlib that referenced this issue Jan 3, 2025
Treats `U+FFFE` and `U+FFFF` as valid Unicode codepoints rather than errors.  See gleam-lang#778.
lpil pushed a commit that referenced this issue Jan 3, 2025
Treats `U+FFFE` and `U+FFFF` as valid Unicode codepoints rather than errors.  See #778.
@mooreryan
Copy link
Contributor Author

Completed in #781.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants