Unicode

From Wikipedia

HomePage | Recent changes | View source | Discuss this page | Page history | Log in |

Printable version | Disclaimers | Privacy policy

The California-based Unicode Consortium first published "The Unicode Standard" in 1991, and continues to develop standards based on that original work. The goal of Unicode is to specify a code matching every character needed by every human language to a single unique integer. This can be used to create character encodings and facilitate translation among other encodings. Unicode was adopted as an standard by the International Organization for Standardization as ISO 10646.

The character set is divided into several planes, each of which supports 65536 characters, of which only the first, the Basic Multilingual Plane (BMP), is normally used. (The remaining planes are mainly for ancient Egyptian hieroglyphics, rare Chinese characters, and other specialized uses.) The Unicode standard allows for several million code points overall. The first 256 codes of UCS-2 precisely match those of ISO 8859-1, the most popular 8-bit character encoding.

Several encodings of Unicode have been defined. One of these is UCS-2, which is a 16-bit encoding, sufficent to encode every code point in the BMP in one 16-bit word. (Representation of code points from other planes requires two 16-bit words.) This encoding is what is often meant by "Unicode". UTF-16 is another name for this encoding: UCS-2 implies the ISO 10646 standard, while UTF-16 implies the Unicode Consortium standard; but the two standards differ only on a few minor points.

Another encoding is UCS-4, which is a 32-bit encoding. This encoding is capable of expressing every Unicode code point, from any plane, in one 32-bit word. This encoding is not often used externally due to storage considerations, but many programs use it internally since it is the easiest representation to manipulate (if full Unicode support, including non-BMP planes, is sought). UTF-32 is another name for this encoding: UCS-4 implies the ISO 10646 standard, while UTF-32 implies the Unicode Consortium standard; but the two differ only on a few minor points.

Another common encoding can express each Unicode character as a sequence of 8-bit bytes; this is UTF-8. This encoding has the property of being identical to ASCII if only the first 128 code points are used.

Various text encodings can be used to represent text in any language (or any set of multiple languages). The consortium also produces these, as well as computer programming standards and tools.

Recent web browsers display web pages using Unicode if an appropriate font is installed.

Revision history

Links

/Talk