Unicode and HTML

HomePage | Recent changes | View source | Discuss this page | Page history | Log in |

Printable version | Disclaimers | Privacy policy

HTML 4.0 uses Unicode as its official character set. Usually though, an 8-bit character encoding is used that can only represent a small slice of this set. It is still possible to have characters from the whole of Unicode inside an HTML document by using a numeric character entity reference &#N;, where N is a decimal number for the Unicode code point, or a hexadecimal number prefixed by x. (Note that the use of hexadecimal in this context is more recent, and therefore less widely supported, than the use of decimal.) There is also a standard set of named character entity references for commonly used symbols outside of some character encodings, so one can use —, for example, to represent an em dash—like this—in text even if the character encoding used doesn't contain that character.

Many browsers, though, are only capable of displaying a small subset of the full UCS-2 repertoire. For example, the codes Δ Й ק م ๗ ぁ 叶 葉 냻 display on your browser as Δ, Й, ק, م, , , , and which ideally look like the Greek letter "Delta", Cyrillic letter "Short I", the Arabic letter "Meem", the Hebrew letter "Qof", Thai numeral 7, Japanese Hiragana "A", simplified Chinese "Leaf", traditional Chinese "Leaf", and a Korean syllable, respectively. Some multilingual web browsers that dynamically merge the required font sets on demand, e.g., Microsoft's Internet Explorer 5.5 on Windows, are capable of displaying all the Unicode characters on this page simultaneously after the appropriate "text display support packs" are downloaded. MSIE 5.5 would prompt the users if a new font were needed via its "install on demand" feature. Other browsers such as Netscape Navigator 4.77 can only display text supported by the current font associated with the character encoding of the page. When you are using the latter type of browser, it is unlikely that your computer has all of those fonts, nor the browser can use all available fonts on the same page. As a result, the browser will not display the text above all correctly, though it may display a subset of them. Because they are encoded according to the standard, though, they will display correctly on any system that is compliant and does have the characters available. Further, those characters given names for use in named entity references are likely to be more commonly available than others.

See also: Wiki_special_characters