The four levels of the Unicode Character Encoding Model can be summarized as:
- ACR: Abstract Character Repertoire
the set of characters to be encoded, for example, some alphabet or symbol set- CCS: Coded Character Set
a mapping from an abstract character repertoire to a set of nonnegative integers- CEF: Character Encoding Form
a mapping from a set of nonnegative integers that are elements of a CCS to a set of sequences of particular code units of some specified width, such as 32-bit integers- CES: Character Encoding Scheme
a reversible transformation from a set of sequences of code units (from one or more CEFs to a serialized sequence of bytes)
In addition to the four individual levels, there are two other useful concepts:
- CM: Character Map
a mapping from sequences of members of an abstract character repertoire to serialized sequences of bytes bridging all four levels in a single operation- TES: Transfer Encoding Syntax
a reversible transform of encoded data, which may or may not contain textual data
- 四个层次
- 抽象字符层(ACR)。比如我们平时使用的文字就是抽象字符。
- 码化字符集(CCS)。所有抽象字符映射为一系列非负数(Code point)。
- 字符编码方式(CEF)。将上一层的整数转为代码单元(code unit)的集合。
- 字符编码模式(CES)。由一系列代码单元(code unit)组成的模式。例如 UTF-8、UTF-16 等等
-
- 表示法
U+1FFFF
. (U+
+ 一个十六进制数) - 一个代码点,是一个数字,代表一个字符。
- 同一个代码点在不同的编码格式中占用的空间大小有可能不同。比如,UTF-32 中所有代码点都是 4 个字节, UTF-8 中 代码点是可变长的 1-4 字节
- 表示法
-
code unit
Code unit: The minimal bit combination that can represent a unit of encoded text for processing or interchange.
- 在一个 CES 中,能表示一个字符的最小位组合数。UTF-8 为 8 bits,UTF-16 为 16 bits,UTF-32 为 32 bits。
-
UCS-2
- 固定 2 字节(16 bits)。只能表示 BMP
-
- 前身为 UCS-2, 但其不能表示 SMP,为弥补该缺陷产生了 UTF-16。
- 2 或 4 字节(16 or 32 bits)。
- 构成:
U+0000..U+D7FF && U+E000..U+FFFF
// 用来表示部分 BMP 字符U+D800..U+DFFF
// surrogate pairs 用来表示 SMP- 2 个 2 字节
high surrogate
, 第一个 2 字节,范围为0xD800..0xDBFF.
low surrogate
,第二个 2 字节,范围为U+DC00..U+DFFF
- 计算方式(栗子: 0x10437):
- 减去 0x10000。 result = 0x00437, 二进制表示:0000 0000 0100 0011 0111。
- 将结果分为高 10 bits 0x0001 和低 10 bits 0x0037
high surrogate
= 0x0001 + 0xD800 = 0xD801low surrogate
= 0x0037 + 0xDC00 = 0xDC37- 所以 0x10437 的 UTF+16 表示 为
0xD801DC37
- 存储方式
- 因为是多字节存储的所以会有两种方式
- UTF-16BE // 大端法(默认)
- UTF-16LE // 小端法
The byte order mark (BOM) is a Unicode character, U+FEFF BYTE ORDER MARK (BOM), whose appearance as a magic number at the start of a text stream can signal several things to a program consuming the text:
- What byte order, or endianness, the text stream is stored in;
- The fact that the text stream is Unicode, to a high level of confidence;
- Which of several Unicode encodings that text stream is encoded as.
- -- from Byte order mark(BOM)
- 尽量不用