utf.7 (2338B) download
1.deEX
2.ift .ft5
3.nf
4..
5.deEE
6.ft1
7.fi
8..
9.TH UTF 7
10.SH NAME
11UTF, Unicode, ASCII, rune \- character set and format
12.SH DESCRIPTION
13The Plan 9 character set and representation are
14based on the Unicode Standard and on the ISO multibyte
15.SM UTF-8
16encoding (Universal Character
17Set Transformation Format, 8 bits wide).
18The Unicode Standard represents its characters in 16
19bits;
20.SM UTF-8
21represents such
22values in an 8-bit byte stream.
23Throughout this manual,
24.SM UTF-8
25is shortened to
26.SM UTF.
27.PP
28In Plan 9, a
29.I rune
30is a 16-bit quantity representing a Unicode character.
31Internally, programs may store characters as runes.
32However, any external manifestation of textual information,
33in files or at the interface between programs, uses a
34machine-independent, byte-stream encoding called
35.SM UTF.
36.PP
37.SM UTF
38is designed so the 7-bit
39.SM ASCII
40set (values hexadecimal 00 to 7F),
41appear only as themselves
42in the encoding.
43Runes with values above 7F appear as sequences of two or more
44bytes with values only from 80 to FF.
45.PP
46The
47.SM UTF
48encoding of the Unicode Standard is backward compatible with
49.SM ASCII\c
50:
51programs presented only with
52.SM ASCII
53work on Plan 9
54even if not written to deal with
55.SM UTF,
56as do
57programs that deal with uninterpreted byte streams.
58However, programs that perform semantic processing on
59.SM ASCII
60graphic
61characters must convert from
62.SM UTF
63to runes
64in order to work properly with non-\c
65.SM ASCII
66input.
67See
68.IR rune (3).
69.PP
70Letting numbers be binary,
71a rune x is converted to a multibyte
72.SM UTF
73sequence
74as follows:
75.PP
7601. x in [00000000.0bbbbbbb] → 0bbbbbbb
77.br
7810. x in [00000bbb.bbbbbbbb] → 110bbbbb, 10bbbbbb
79.br
8011. x in [bbbbbbbb.bbbbbbbb] → 1110bbbb, 10bbbbbb, 10bbbbbb
81.br
82.PP
83Conversion 01 provides a one-byte sequence that spans the
84.SM ASCII
85character set in a compatible way.
86Conversions 10 and 11 represent higher-valued characters
87as sequences of two or three bytes with the high bit set.
88Plan 9 does not support the 4, 5, and 6 byte sequences proposed by X-Open.
89When there are multiple ways to encode a value, for example rune 0,
90the shortest encoding is used.
91.PP
92In the inverse mapping,
93any sequence except those described above
94is incorrect and is converted to rune hexadecimal 0080.
95.SH "SEE ALSO"
96.IR ascii (1),
97.IR tcs (1),
98.IR rune (3),
99.IR "The Unicode Standard" .