`string.utf_codepoint` returns an error for valid codepoints `U+FFFE` and `U+FFFF` #778

mooreryan · 2024-12-24T20:44:25Z

The string.utf_codepoint returns an error for valid codepoints U+FFFE (65,534) and U+FFFF (65,535).

The line of code is here.

These two codepoints are two of the 66 so-called "noncharacter code points". Here are some excerpts from 23.7.1 Noncharacters: U+FFFE, U+FFFF, and Others of the Unicode core spec:

...they are not illegal in interchange, nor does their presence cause Unicode text to be ill-formed....They are not prohibited from occurring in valid Unicode strings which happen to be interchanged. This distinction, which might be seen as too finely drawn, ensures that noncharacters are correctly preserved when “interchanged” internally, as when used in strings in APIs, in other interprocess protocols, or when stored....It is not recommended to simply delete noncharacter code points from such text, because of the potential security issues caused by deleting uninterpreted characters

If the goal is for string.utf_codepoint to return Error(Nil) for noncharacter codepoints, then there are many missing noncharacter codepoints that do not return an error, e.g., U+1FFFF.

However, I think that the correct behavior would be for the string.utf_codepoint function to return Ok for U+FFFE and U+FFFF. (As examples, the Gleam accepts the literals "\u{FFFE}" and "\u{FFFF}" as valid unicode code points, and both Elixir and Rust also accept them as valid code points)

The text was updated successfully, but these errors were encountered:

lpil · 2024-12-28T17:03:20Z

Thank you

jooaf · 2024-12-29T05:54:33Z

Hello, if it is possible may I pick this up? For building out a solution and testing locally, should I just run gleam build and gleam test. Sorry if this is a naive question.

Treats U+FFFE and U+FFFF as valid unicode codepoints rather than errors. See gleam-lang#778.

mooreryan · 2024-12-29T06:15:49Z

Hey @jooaf, I apologize--I did not notice you commented on this issue asking to work on it before I went ahead and fixed it. If you want I could close my PR and you could take it (since it is a good first issue!)

jooaf · 2024-12-29T06:26:53Z

Hey @mooreryan, no worries at all! That's so kind of you! Since you have already done the work, I think you should go ahead with your PR. I will just be on the lookout for another good first issue :)

Treats `U+FFFE` and `U+FFFF` as valid Unicode codepoints rather than errors. See gleam-lang#778.

Treats `U+FFFE` and `U+FFFF` as valid Unicode codepoints rather than errors. See #778.

mooreryan · 2025-01-21T06:30:23Z

Completed in #781.

lpil added good first issue Good for newcomers help wanted Extra attention is needed labels Dec 28, 2024

mooreryan added a commit to mooreryan/stdlib that referenced this issue Dec 29, 2024

Fix non-character handling in string.utf_codepoint

4936620

Treats U+FFFE and U+FFFF as valid unicode codepoints rather than errors. See gleam-lang#778.

mooreryan mentioned this issue Dec 29, 2024

Fix non-character handling in string.utf_codepoint #781

Merged

mooreryan added a commit to mooreryan/stdlib that referenced this issue Dec 31, 2024

Fix non-character handling in string.utf_codepoint

bb8ec2e

Treats `U+FFFE` and `U+FFFF` as valid Unicode codepoints rather than errors. See gleam-lang#778.

lpil pushed a commit to mooreryan/stdlib that referenced this issue Jan 3, 2025

Fix non-character handling in string.utf_codepoint

2ba40c5

Treats `U+FFFE` and `U+FFFF` as valid Unicode codepoints rather than errors. See gleam-lang#778.

lpil pushed a commit that referenced this issue Jan 3, 2025

Fix non-character handling in string.utf_codepoint

6f44f83

Treats `U+FFFE` and `U+FFFF` as valid Unicode codepoints rather than errors. See #778.

mooreryan closed this as completed Jan 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`string.utf_codepoint` returns an error for valid codepoints `U+FFFE` and `U+FFFF` #778

`string.utf_codepoint` returns an error for valid codepoints `U+FFFE` and `U+FFFF` #778

mooreryan commented Dec 24, 2024

lpil commented Dec 28, 2024

jooaf commented Dec 29, 2024

mooreryan commented Dec 29, 2024 •

edited

Loading

jooaf commented Dec 29, 2024

mooreryan commented Jan 21, 2025

string.utf_codepoint returns an error for valid codepoints U+FFFE and U+FFFF #778

string.utf_codepoint returns an error for valid codepoints U+FFFE and U+FFFF #778

Comments

mooreryan commented Dec 24, 2024

lpil commented Dec 28, 2024

jooaf commented Dec 29, 2024

mooreryan commented Dec 29, 2024 • edited Loading

jooaf commented Dec 29, 2024

mooreryan commented Jan 21, 2025

`string.utf_codepoint` returns an error for valid codepoints `U+FFFE` and `U+FFFF` #778

`string.utf_codepoint` returns an error for valid codepoints `U+FFFE` and `U+FFFF` #778

mooreryan commented Dec 29, 2024 •

edited

Loading