Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indicate discrepencies with Unicode specifications for UTF-16/32 schemes #128571

Open
youkidearitai opened this issue Jan 7, 2025 · 5 comments
Open
Labels
docs Documentation in the Doc dir easy topic-unicode

Comments

@youkidearitai
Copy link

youkidearitai commented Jan 7, 2025

Bug report

Bug description:

b"ab".decode("UTF-16")

On https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G28070, UTF-16 is not pointing at endian (there is no BOM and in the absence of higher-level protocol), UTF-16 is big-endian.

The UTF-16 encoding scheme may or may not begin with a BOM. However, when there is no BOM, and in the absence of a higher-level protocol, the byte order of the UTF-16 encoding scheme is big-endian.

However, CPython actual behavior is maybe depends on CPU architecture.

I tested x86_64(WSL Ubuntu), and aarch64(Raspberry Pi(Raspbian) and macOS).

x86_64 result is (U+6162), aarch64 result is (U+6261).
I think endian is big-endian in UTF-16.

CPython versions tested on:

3.10, 3.12

Operating systems tested on:

Linux, macOS

@youkidearitai youkidearitai added the type-bug An unexpected behavior, bug, or error label Jan 7, 2025
@picnixz picnixz added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Jan 7, 2025
@picnixz
Copy link
Member

picnixz commented Jan 7, 2025

The codecs docs say:

These constants define various byte sequences, being Unicode byte order marks (BOMs) for several encodings. They are used in UTF-16 and UTF-32 data streams to indicate the byte order used, and in UTF-8 as a Unicode signature. BOM_UTF16 is either BOM_UTF16_BE or BOM_UTF16_LE depending on the platform’s native byte order, BOM is an alias for BOM_UTF16, BOM_LE for BOM_UTF16_LE and BOM_BE for BOM_UTF16_BE. The others represent the BOM in UTF-8 and UTF-32 encodings.

The important part is:

BOM_UTF16 is either BOM_UTF16_BE or BOM_UTF16_LE depending on the platform’s native byte order, BOM is an alias for BOM_UTF16, BOM_LE for BOM_UTF16_LE and BOM_BE for BOM_UTF16_BE.

AFAIU, BOM is platform-dependent and UTF-16 uses BOM_UTF16 so it will be platform dependent. Finally, this is backed by the following statement (4th paragraph of https://docs.python.org/3/library/codecs.html#encodings-and-unicode).

All of these encodings can only encode 256 of the 1114112 code points defined in Unicode. A simple and straightforward way that can store each Unicode code point, is to store each code point as four consecutive bytes. There are two possibilities: store the bytes in big endian or in little endian order. These two encodings are called UTF-32-BE and UTF-32-LE respectively. Their disadvantage is that if e.g. you use UTF-32-BE on a little endian machine you will always have to swap bytes on encoding and decoding. UTF-32 avoids this problem: bytes will always be in natural endianness

Thus, AFAIK, the behaviour is correct. But we could definitely improve the docs so that this information is not burried across multiple pages..

@picnixz
Copy link
Member

picnixz commented Jan 7, 2025

cc @serhiy-storchaka

@picnixz picnixz added the pending The issue will be closed if no feedback is provided label Jan 7, 2025
@picnixz
Copy link
Member

picnixz commented Jan 7, 2025

I also don't think we need to match UTF-16 for the UTF-16 as in the Unicode specs. And if we want to match it, it would cause a lot of breaking changes I think so I'm not sure we'll ever be able to change this. Alternatives are to create a new encoding that we name utf16-ces for utf16 canonical encoding scheme that exactly match the specifications.

@serhiy-storchaka
Copy link
Member

I afraid that this is the case where the Unicode specification contradicts practice, and Python chose to follow practice. For example, on Linux in UTF-16 without BOM the byte order on little-endian machine is little-endian.

$ echo abc | iconv -t utf-16le | iconv -f utf-16
abc
$ echo abc | iconv -t utf-16be | iconv -f utf-16
愀戀挀਀

I think that Windows also uses little-endian, as it is natural on little-endian machines.

Changing UTF-16 now would be a great breaking change. But we should clarify more explicitly the difference with the Unicode specification in the documentation.

@picnixz
Copy link
Member

picnixz commented Jan 7, 2025

But we should clarify more explicitly the difference with the Unicode specification in the documentation.

I will categorize this issue as a doc issue instead.

@picnixz picnixz added docs Documentation in the Doc dir and removed type-bug An unexpected behavior, bug, or error interpreter-core (Objects, Python, Grammar, and Parser dirs) pending The issue will be closed if no feedback is provided labels Jan 7, 2025
@picnixz picnixz changed the title b"ab".decode("UTF-16") result is depends on CPU architecture endians Indicate discrepencies with Unicode specifications for BOM-dependent schemes Jan 7, 2025
@picnixz picnixz changed the title Indicate discrepencies with Unicode specifications for BOM-dependent schemes Indicate discrepencies with Unicode specifications for UTF-16/32 schemes Jan 7, 2025
@picnixz picnixz added the easy label Jan 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation in the Doc dir easy topic-unicode
Projects
Status: Todo
Development

No branches or pull requests

3 participants