Language(s) | International |
---|---|
Standard | Unicode Standard |
Classification | Unicode Transformation Format, extended ASCII, variable-width encoding |
Extends | US-ASCII |
Transforms / Encodes | ISO 10646 (Unicode) |
Preceded by | UTF-1 |
Number of bytes | Bits for code point | First code point | Last code point | Byte 1 | Byte 2 | Byte 3 | Byte 4 |
---|---|---|---|---|---|---|---|
1 | 7 | U+0000 | U+007F | 0xxxxxxx | |||
2 | 11 | U+0080 | U+07FF | 110xxxxx | 10xxxxxx | ||
3 | 16 | U+0800 | U+FFFF | 1110xxxx | 10xxxxxx | 10xxxxxx | |
4 | 21 | U+10000 | U+10FFFF | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |
Character | Octal code point | Binary code point | Binary UTF-8 | Octal UTF-8 | Hexadecimal UTF-8 | |
---|---|---|---|---|---|---|
$ | U+0024 | 044 | 010 0100 | 00100100 | 044 | 24 |
¢ | U+00A2 | 0242 | 000 1010 0010 | 11000010 10100010 | 302 242 | C2A2 |
ह | U+0939 | 004471 | 00001001 0011 1001 | 11100000 10100100 10111001 | 340 244 271 | E0A4B9 |
€ | U+20AC | 020254 | 00100000 1010 1100 | 11100010 10000010 10101100 | 342 202 254 | E282AC |
? | U+10348 | 0201510 | 0 0001 00000011 0100 1000 | 11110000 10010000 10001101 10001000 | 360 220 215 210 | F0908D88 |
_0 | _1 | _2 | _3 | _4 | _5 | _6 | _7 | _8 | _9 | _A | _B | _C | _D | _E | _F | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0_ | NUL 0000 | SOH 0001 | STX 0002 | ETX 0003 | EOT 0004 | ENQ 0005 | ACK 0006 | BEL 0007 | BS 0008 | HT 0009 | LF 000A | VT 000B | FF 000C | CR 000D | SO 000E | SI 000F |
1_ | DLE 0010 | DC1 0011 | DC2 0012 | DC3 0013 | DC4 0014 | NAK 0015 | SYN 0016 | ETB 0017 | CAN 0018 | EM 0019 | SUB 001A | ESC 001B | FS 001C | GS 001D | RS 001E | US 001F |
2_ | SP 0020 | ! 0021 | ' 0022 | # 0023 | $ 0024 | % 0025 | & 0026 | ' 0027 | ( 0028 | ) 0029 | * 002A | + 002B | , 002C | - 002D | . 002E | / 002F |
3_ | 0 0030 | 1 0031 | 2 0032 | 3 0033 | 4 0034 | 5 0035 | 6 0036 | 7 0037 | 8 0038 | 9 0039 | : 003A | ; 003B | < 003C | = 003D | > 003E | ? 003F |
4_ | @ 0040 | A 0041 | B 0042 | C 0043 | D 0044 | E 0045 | F 0046 | G 0047 | H 0048 | I 0049 | J 004A | K 004B | L 004C | M 004D | N 004E | O 004F |
5_ | P 0050 | Q 0051 | R 0052 | S 0053 | T 0054 | U 0055 | V 0056 | W 0057 | X 0058 | Y 0059 | Z 005A | [ 005B | 005C | ] 005D | ^ 005E | _ 005F |
6_ | ` 0060 | a 0061 | b 0062 | c 0063 | d 0064 | e 0065 | f 0066 | g 0067 | h 0068 | i 0069 | j 006A | k 006B | l 006C | m 006D | n 006E | o 006F |
7_ | p 0070 | q 0071 | r 0072 | s 0073 | t 0074 | u 0075 | v 0076 | w 0077 | x 0078 | y 0079 | z 007A | { 007B | | 007C | } 007D | ~ 007E | DEL 007F |
8_ | • +00 | • +01 | • +02 | • +03 | • +04 | • +05 | • +06 | • +07 | • +08 | • +09 | • +0A | • +0B | • +0C | • +0D | • +0E | • +0F |
9_ | • +10 | • +11 | • +12 | • +13 | • +14 | • +15 | • +16 | • +17 | • +18 | • +19 | • +1A | • +1B | • +1C | • +1D | • +1E | • +1F |
A_ | • +20 | • +21 | • +22 | • +23 | • +24 | • +25 | • +26 | • +27 | • +28 | • +29 | • +2A | • +2B | • +2C | • +2D | • +2E | • +2F |
B_ | • +30 | • +31 | • +32 | • +33 | • +34 | • +35 | • +36 | • +37 | • +38 | • +39 | • +3A | • +3B | • +3C | • +3D | • +3E | • +3F |
2 C_ | 2 0000 | 2 0040 | Latin 0080 | Latin 00C0 | Latin 0100 | Latin 0140 | Latin 0180 | Latin 01C0 | Latin 0200 | IPA 0240 | IPA 0280 | IPA 02C0 | accents 0300 | accents 0340 | Greek 0380 | Greek 03C0 |
2 D_ | Cyril 0400 | Cyril 0440 | Cyril 0480 | Cyril 04C0 | Cyril 0500 | Armeni 0540 | Hebrew 0580 | Hebrew 05C0 | Arabic 0600 | Arabic 0640 | Arabic 0680 | Arabic 06C0 | Syriac 0700 | Arabic 0740 | Thaana 0780 | N'Ko 07C0 |
3 E_ | Indic 0800 | Misc. 1000 | Symbol 2000 | Kana… 3000 | CJK 4000 | CJK 5000 | CJK 6000 | CJK 7000 | CJK 8000 | CJK 9000 | Asian A000 | Hangul B000 | Hangul C000 | Hangul D000 | PUA E000 | Forms F000 |
4 F_ | SMP… 10000 | ? 40000 | ? 80000 | SSP… C0000 | SPU… 100000 | 4 140000 | 4 180000 | 4 1C0000 | 5 200000 | 5 1000000 | 5 2000000 | 5 3000000 | 6 4000000 | 6 40000000 |
InputStreamReader
and OutputStreamWriter
(if it is the platform's default character set or as requested by the program). However it uses Modified UTF-8 for object serialization[32] among other applications of DataInput
and DataOutput
, for the Java Native Interface,[33] and for embedding constant strings in class files.[34]The dex format defined by Dalvik also uses the same modified UTF-8 to represent string values.[35]Tcl also uses the same modified UTF-8[36] as Java for internal representation of Unicode data, but uses strict CESU-8 for external data.Number of bytes | First code point | Last code point | Byte 1 | Byte 2 | Byte 3 | Byte 4 | Byte 5 |
---|---|---|---|---|---|---|---|
1 | U+0000 | U+009F | 00–9F | ||||
2 | U+00A0 | U+00FF | A0 | A0–FF | |||
2 | U+0100 | U+4015 | A1–F5 | 21–7E, A0–FF | |||
3 | U+4016 | U+38E2D | F6–FB | 21–7E, A0–FF | 21–7E, A0–FF | ||
5 | U+38E2E | U+7FFFFFFF | FC–FF | 21–7E, A0–FF | 21–7E, A0–FF | 21–7E, A0–FF | 21–7E, A0–FF |
Number of bytes | First code point | Last code point | Byte 1 | Byte 2 | Byte 3 | Byte 4 | Byte 5 |
---|---|---|---|---|---|---|---|
1 | U+0000 | U+007F | 0xxxxxxx | ||||
2 | U+0080 | U+207F | 10xxxxxx | 1xxxxxxx | |||
3 | U+2080 | U+8207F | 110xxxxx | 1xxxxxxx | 1xxxxxxx | ||
4 | U+82080 | U+208207F | 1110xxxx | 1xxxxxxx | 1xxxxxxx | 1xxxxxxx | |
5 | U+2082080 | U+7FFFFFFF | 11110xxx | 1xxxxxxx | 1xxxxxxx | 1xxxxxxx | 1xxxxxxx |
Number of bytes | First code point | Last code point | Byte 1 | Byte 2 | Byte 3 | Byte 4 | Byte 5 | Byte 6 |
---|---|---|---|---|---|---|---|---|
1 | U+0000 | U+007F | 0xxxxxxx | |||||
2 | U+0080 | U+07FF | 110xxxxx | 10xxxxxx | ||||
3 | U+0800 | U+FFFF | 1110xxxx | 10xxxxxx | 10xxxxxx | |||
4 | U+10000 | U+1FFFFF | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx | ||
5 | U+200000 | U+3FFFFFF | 111110xx | 10xxxxxx | 10xxxxxx | 10xxxxxx | 10xxxxxx | |
6 | U+4000000 | U+7FFFFFFF | 1111110x | 10xxxxxx | 10xxxxxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |
The problems outlined here go away when exclusively using UTF-8, which is one of the many reasons that is now the mandatory encoding for all things.
The Basic Multilingual Plane (BMP, or Plane 0) contains the common-use characters for all the modern scripts of the world as well as many historical and rare characters. By far the majority of all Unicode characters for almost all textual data can be found in the BMP.
it looks like Win7 silently enhanced support for codepage 65001. Significant limitations do remain - in particular redirection and piping still fail under codepage 65001. Nevertheless, the added support opens up some new exciting possibilities.
Java virtual machine UTF-8 strings never have embedded nulls.
[…] encoded in modified UTF-8.
The JNI uses modified UTF-8 strings to represent various string types.
[…] differences between this format and the 'standard' UTF-8 format.
[T]he dex format encodes its string data in a de facto standard modified UTF-8 form, hereafter referred to as MUTF-8.
In orthodox UTF-8, a NUL byte (x00) is represented by a NUL byte. […] But […] we […] want NUL bytes inside […] strings […]
Look up UTF-8 in Wiktionary, the free dictionary. |