You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Extend the RACP specification to include support for unicode a la the UTF standards.
Proposal
We define the current RACP characters as being the Basic Characters and introduce Unicode Characters. When referencing a unicode codepoint we use the U+ method. Likewise, for the basic characters we will use B+ as a reference to a basic character codepoint.
The basic characters are encoded as previously defined using the 1 byte values 0-127.
The unicode characters are then encoded using 2-4 bytes, where all the bytes have the highest order bit set to 1. Therefore, we can differentiate between a basic and unicode characters by only looking at the highest order bit. The Unicode characters are encoded using the following specification:
We define a 'character sequence' as being 1-4 consecutive bytes that encoded a basic characters or a valid unicode character. A string of characters is therefore a sequence of character sequences:
Number of bytes
Bits for code point
First code point
Last code point
Capacity
Header
Byte 2
Byte 3
Byte 4
1
7
B+0
B+7F
128
0xxx xxxx
2
11
U+0
U+7FF
2048
110x xxxx
10xx xxxx
3
16
U+800
U+107FF
65535
1110 xxxx
10xx xxxx
10xx xxxx
3
14
U+10800
U+147FF
16383
1111 00xx
10xx xxxx
10xx xxxx
3
14
U+14800
U+187FF
16383
1111 01xx
10xx xxxx
10xx xxxx
3
14
U+18800
U+1C7FF
16383
1111 10xx
10xx xxxx
10xx xxxx
4
20
U+1C800
U+110000
997376
1111 11xx
10xx xxxx
10xx xxxx
10xx xxxx
The first byte in any character sequence encodes which type the sequence has and is called the header byte. As specified before, if the header byte has its highest order bit set to 0, then the sequence has the length 1 (meaning the sequence is 1 byte long, containing only the header byte) and encodes a basic character.
If a byte starts with '11' then it is the header byte in a sequence. If it starts with '10' it is not the header. In the header, the bits 5 and 6 (where 8 is the highest order bit) define the type of the sequence the header is part of. We'll call those two bits the type bits.
If the highest type bit is 0, the sequence has a length of two and the remaining 11 bits are used to encode the codepoints U+3FF to U+7FF as we will see below.
If the type bits are '10', the sequence has a length of 3 and the remaining 16 bits are used to encode the code points U+800 to U+107FF.
If the type bits are '11' we define the bits 3 and 4 as the specialization bits. If these bits are '00', '01', or '10' the sequence also has a length of 3 and the remaining 12 bits are used to encode the code points U+10800 to U+1C7FF.
If the specialization bits are '11' the sequence has a length of 4 and the remaining 20 bits are used to encode the code points U+1C800 to U+110000. There is room for up to U+11C7FF, but since Unicode is restricted to U+110000 so is this encoding.
Encoding the Unicode characters
Much like UTF-8 the integer value of a code point is spead out into the sequence bytes. First, the capacity of the type of sequence the codepoint goes into is subtracted from the codepoints integer value.
The lowest order bits in the resulting value are inserted, in order, into the remaining bits of the last byte. The next remaining lowest order bits are likewise inserted into the next to last byte, and so on until all the bits are inserted in the unassigned bits of each byte in the sequence. Examples:
Codepoint
Encoded value
u32 bitwise encoded value
encoding
B+E
14
... 0000 1110
0 000 1110
U+E
14
... 0000 1110
1100 0000 l 10 00 1110
U+44c
76
... 0100 1100
1101 0001 l 10 00 1100
U+270F
7951
... 0001 1111 l 0000 1111
1110 0001 l 10 11 1100 l 10 00 1111
U+15B38
4920
... 0001 0011 l 0011 1000
1111 01 01 l 10 00 1100 l 10 11 1000
U+1B207
10759
... 0010 1010 l 0000 0111
1111 10 10 l 10 10 1000 l 10 00 0111
U+10F447
994375
... 0000 1111 l 0010 1100 l 0100 0111
1111 11 11 l 10 11 0010 l 10 11 0001 l 10 00 0111
Features of the encoding
Compatability with current RACP
The current encoding of the basic RACP characters is also a valid encoding in the proposal.
Error detection
Receiving an incomplete character sequence can be detected by looking at the first two bits in the first byte. If its not a valid header or a basic character then the sequence is incomplete and the decoder can report the error. Since the header can be looked at to see how many bytes a sequence should have, receiving too few can also be detected.
Streamed
The header byte contains all the information about the sequence, meaning a sequence can be instantly decoded without having to look for the header of the next sequence or end of stream.
Reversible and Searchable
The start of a character sequence can be found by reversing until a header is found. This can be used in jumping searches to ensure that you can always find the start of a character sequence you land on.
Sorting order
A list of encoded unicode characters can be sorted in code point order by directly sorting the encoded value. Therefore, sorting can be done without having to decode the character. Seudo-proof:
Unicode point
Encoding
Encoded value
0
1100 0000 1000 0000
49280
1023
1100 1111 1011 1111
53183
1024
1101 0000 1000 0000
53376
2047
1101 1111 1011 1111
57279
2048
1110 0000 1000 0000 1000 0000
14712960
67583
1110 1111 1011 1111 1011 1111
15712191
67584
1111 0000 1000 0000 1000 0000
15761536
83967
1111 0011 1011 1111 1011 1111
15974335
83968
1111 0100 1000 0000 1000 0000
16023680
100351
1111 0111 1011 1111 1011 1111
16236479
100352
1111 1000 1000 0000 1000 0000
16285824
116735
1111 1011 1011 1111 1011 1111
16498623
116736
1111 1100 1000 0000 1000 0000 1000 0000
4236279936
1165311
1111 1111 1011 1111 1011 1111 1011 1111
4290756543
Negatives
2-byte unicode
Since the basic characters take up all the space in one byte a unicode character is encoded as at least 2 byte. This is constrasted with UTF8 where the ASCII-equivalent characters are also 8 bytes. On the other hand, this does allow the basic characters to be only 1 byte, which can be seen as equivalent to UTF-8.
Complexity
Compared to UTF-8, which has only 4 sequence types, this encoding is more complex, having 5 effective sequence types (Only 1 test is needed to identify the specialized 3 byte sequences). This will incur some performance hit regarding encoding/decoding. The increased number of sequence types was chosen to allow almost double the number of codepoint to be encoded using 3 or less bytes. This allows most characters in the supplementary multilingual plane to be 3 bytes, where they need to be 4 bytes in UTF-8. An additional subtraction/addition is needed in when encoding/decoding that is not needed by UTF-8.
On the other hand, since UTF-8 requires a valid encoding use the minimum number of bytes for a given code point, there might be some work there that this encoding does not need to do.
An experimental analyses should be done to measure the impact of encode/decode compared to UTF-8.
The text was updated successfully, but these errors were encountered:
Extend the RACP specification to include support for unicode a la the UTF standards.
Proposal
We define the current RACP characters as being the
Basic Characters
and introduceUnicode Characters
. When referencing a unicode codepoint we use theU+
method. Likewise, for the basic characters we will useB+
as a reference to a basic character codepoint.The basic characters are encoded as previously defined using the 1 byte values 0-127.
The unicode characters are then encoded using 2-4 bytes, where all the bytes have the highest order bit set to 1. Therefore, we can differentiate between a basic and unicode characters by only looking at the highest order bit. The Unicode characters are encoded using the following specification:
We define a 'character sequence' as being 1-4 consecutive bytes that encoded a basic characters or a valid unicode character. A string of characters is therefore a sequence of character sequences:
B+0
B+7F
U+0
U+7FF
U+800
U+107FF
U+10800
U+147FF
U+14800
U+187FF
U+18800
U+1C7FF
U+1C800
U+110000
The first byte in any character sequence encodes which type the sequence has and is called the header byte. As specified before, if the header byte has its highest order bit set to 0, then the sequence has the length 1 (meaning the sequence is 1 byte long, containing only the header byte) and encodes a basic character.
If a byte starts with '11' then it is the header byte in a sequence. If it starts with '10' it is not the header. In the header, the bits 5 and 6 (where 8 is the highest order bit) define the type of the sequence the header is part of. We'll call those two bits the type bits.
If the highest type bit is
0
, the sequence has a length of two and the remaining 11 bits are used to encode the codepointsU+3FF
toU+7FF
as we will see below.If the type bits are '10', the sequence has a length of 3 and the remaining 16 bits are used to encode the code points
U+800
toU+107FF
.If the type bits are '11' we define the bits 3 and 4 as the specialization bits. If these bits are '00', '01', or '10' the sequence also has a length of 3 and the remaining 12 bits are used to encode the code points
U+10800
toU+1C7FF
.If the specialization bits are '11' the sequence has a length of 4 and the remaining 20 bits are used to encode the code points
U+1C800
toU+110000
. There is room for up toU+11C7FF
, but since Unicode is restricted toU+110000
so is this encoding.Encoding the Unicode characters
Much like UTF-8 the integer value of a code point is spead out into the sequence bytes. First, the capacity of the type of sequence the codepoint goes into is subtracted from the codepoints integer value.
The lowest order bits in the resulting value are inserted, in order, into the remaining bits of the last byte. The next remaining lowest order bits are likewise inserted into the next to last byte, and so on until all the bits are inserted in the unassigned bits of each byte in the sequence. Examples:
Features of the encoding
Compatability with current RACP
The current encoding of the basic RACP characters is also a valid encoding in the proposal.
Error detection
Receiving an incomplete character sequence can be detected by looking at the first two bits in the first byte. If its not a valid header or a basic character then the sequence is incomplete and the decoder can report the error. Since the header can be looked at to see how many bytes a sequence should have, receiving too few can also be detected.
Streamed
The header byte contains all the information about the sequence, meaning a sequence can be instantly decoded without having to look for the header of the next sequence or end of stream.
Reversible and Searchable
The start of a character sequence can be found by reversing until a header is found. This can be used in jumping searches to ensure that you can always find the start of a character sequence you land on.
Sorting order
A list of encoded unicode characters can be sorted in code point order by directly sorting the encoded value. Therefore, sorting can be done without having to decode the character. Seudo-proof:
Negatives
2-byte unicode
Since the basic characters take up all the space in one byte a unicode character is encoded as at least 2 byte. This is constrasted with UTF8 where the ASCII-equivalent characters are also 8 bytes. On the other hand, this does allow the basic characters to be only 1 byte, which can be seen as equivalent to UTF-8.
Complexity
Compared to UTF-8, which has only 4 sequence types, this encoding is more complex, having 5 effective sequence types (Only 1 test is needed to identify the specialized 3 byte sequences). This will incur some performance hit regarding encoding/decoding. The increased number of sequence types was chosen to allow almost double the number of codepoint to be encoded using 3 or less bytes. This allows most characters in the supplementary multilingual plane to be 3 bytes, where they need to be 4 bytes in UTF-8. An additional subtraction/addition is needed in when encoding/decoding that is not needed by UTF-8.
On the other hand, since UTF-8 requires a valid encoding use the minimum number of bytes for a given code point, there might be some work there that this encoding does not need to do.
An experimental analyses should be done to measure the impact of encode/decode compared to UTF-8.
The text was updated successfully, but these errors were encountered: