Text encoding

HomePage | Recent changes | View source | Discuss this page | Page history | Log in |

Printable version | Disclaimers | Privacy policy

A text encoding is a method of representing a piece of text as a sequence of codes (from a character encoding) for the purpose of computer storage or electronic communication of that text. While character encodings like ASCII represent indivual characters of a language, a text encoding has to represent much larger things like articles and books, and must represent not only the characters they contain but the structure and organization of the text, and perhaps information about the text or its appearance. Common examples are HTML and RTF which represent texts in natural languages, and XML, which can represent many kinds of text not necessarily intended to be human-readable (the contents of a database, for example).

Though character encodings like ASCII and Unicode are not, strictly speaking, text encodings in their own right, they may serve as very simple text encodings if one wishes only to preserve the English content of a document and not necessarily its formatting. By far the most common text encoding now in use is what might informally be called "Plain ASCII", which involves simply encoding a text as a stream of ASCII characters. The specifics of how this is done vary greatly: for example, the end of a text line might be encoded as ASCII code 10 ("line feed" or "new line") as is common practice on Unix machines, or as ASCII code 13 ("carriage return") as is common on Apple machines, or as both (the sequence <13, 10> is used to end lines on MS-DOS based machines and many others, while the rather rare sequence <10, 13> was used by some Acorn machines). Some texts also use this line-end sequence inside paragraphs (with a blank line between paragraphs) while some do not. Also, various texts in this form interpret code 9 ("tab") and other control characters differently. None of these methods specify how to identify text structure like headings and tables, or special text forms like italics. Text in this format is basically readable by any computer though some work might be needed to accommodate local variations, and all information besides the actual words of the text will be lost.