Alexander Gutkin | 439f3d1 | 2014-02-28 11:33:45 +0000 | [diff] [blame] | 1 | .deEX |
| 2 | .ift .ft5 |
| 3 | .nf |
| 4 | .. |
| 5 | .deEE |
| 6 | .ft1 |
| 7 | .fi |
| 8 | .. |
| 9 | .TH UTF 7 |
| 10 | .SH NAME |
| 11 | UTF, Unicode, ASCII, rune \- character set and format |
| 12 | .SH DESCRIPTION |
| 13 | The Plan 9 character set and representation are |
| 14 | based on the Unicode Standard and on the ISO multibyte |
| 15 | .SM UTF-8 |
| 16 | encoding (Universal Character |
| 17 | Set Transformation Format, 8 bits wide). |
| 18 | The Unicode Standard represents its characters in 16 |
| 19 | bits; |
| 20 | .SM UTF-8 |
| 21 | represents such |
| 22 | values in an 8-bit byte stream. |
| 23 | Throughout this manual, |
| 24 | .SM UTF-8 |
| 25 | is shortened to |
| 26 | .SM UTF. |
| 27 | .PP |
| 28 | In Plan 9, a |
| 29 | .I rune |
| 30 | is a 16-bit quantity representing a Unicode character. |
| 31 | Internally, programs may store characters as runes. |
| 32 | However, any external manifestation of textual information, |
| 33 | in files or at the interface between programs, uses a |
| 34 | machine-independent, byte-stream encoding called |
| 35 | .SM UTF. |
| 36 | .PP |
| 37 | .SM UTF |
| 38 | is designed so the 7-bit |
| 39 | .SM ASCII |
| 40 | set (values hexadecimal 00 to 7F), |
| 41 | appear only as themselves |
| 42 | in the encoding. |
| 43 | Runes with values above 7F appear as sequences of two or more |
| 44 | bytes with values only from 80 to FF. |
| 45 | .PP |
| 46 | The |
| 47 | .SM UTF |
| 48 | encoding of the Unicode Standard is backward compatible with |
| 49 | .SM ASCII\c |
| 50 | : |
| 51 | programs presented only with |
| 52 | .SM ASCII |
| 53 | work on Plan 9 |
| 54 | even if not written to deal with |
| 55 | .SM UTF, |
| 56 | as do |
| 57 | programs that deal with uninterpreted byte streams. |
| 58 | However, programs that perform semantic processing on |
| 59 | .SM ASCII |
| 60 | graphic |
| 61 | characters must convert from |
| 62 | .SM UTF |
| 63 | to runes |
| 64 | in order to work properly with non-\c |
| 65 | .SM ASCII |
| 66 | input. |
| 67 | See |
| 68 | .IR rune (3). |
| 69 | .PP |
| 70 | Letting numbers be binary, |
| 71 | a rune x is converted to a multibyte |
| 72 | .SM UTF |
| 73 | sequence |
| 74 | as follows: |
| 75 | .PP |
| 76 | 01. x in [00000000.0bbbbbbb] → 0bbbbbbb |
| 77 | .br |
| 78 | 10. x in [00000bbb.bbbbbbbb] → 110bbbbb, 10bbbbbb |
| 79 | .br |
| 80 | 11. x in [bbbbbbbb.bbbbbbbb] → 1110bbbb, 10bbbbbb, 10bbbbbb |
| 81 | .br |
| 82 | .PP |
| 83 | Conversion 01 provides a one-byte sequence that spans the |
| 84 | .SM ASCII |
| 85 | character set in a compatible way. |
| 86 | Conversions 10 and 11 represent higher-valued characters |
| 87 | as sequences of two or three bytes with the high bit set. |
| 88 | Plan 9 does not support the 4, 5, and 6 byte sequences proposed by X-Open. |
| 89 | When there are multiple ways to encode a value, for example rune 0, |
| 90 | the shortest encoding is used. |
| 91 | .PP |
| 92 | In the inverse mapping, |
| 93 | any sequence except those described above |
| 94 | is incorrect and is converted to rune hexadecimal 0080. |
| 95 | .SH "SEE ALSO" |
| 96 | .IR ascii (1), |
| 97 | .IR tcs (1), |
| 98 | .IR rune (3), |
| 99 | .IR "The Unicode Standard" . |