# Regular expression

Regular expressions is a family of compact and powerful languages for describing sets of strings. These languages are used by many text editors and utilities (especially in the Unix operating system) (e.g. sed and awk) to search bodies of text for certain patterns and, for example, replace the found strings with a certain other string.

The origin of regular expressions lies in automata theory and formal language theory (both part of theoretical computer science). These fields study models of computation (automata) and ways to describe and classify formal languages. A formal language is nothing but a set of strings. In the 1940s, Warren McCulloch and Walter Pitts described the nervous system by modelling neurons as small simple automata. The mathematician, Stephen Kleene, later described these models using his mathematical notation called regular sets. Ken Thompson built this notation into qed (the predecessor of the Unix ed) and eventually into grep. Ever since that time, regular expressions have been widely used in Unix and Unix-like utilities, such as expr, awk, Emacs and Perl, most of them using the regex library built by Henry Spencer.

### Regular expressions in Formal Language Theory

Regular expressions consist of constants and operators that denote sets of strings and operations over these sets, respectively. Given a finite alphabet Σ the following constants are defined:

1. (empty set) ∅ denoting the set ∅
2. (empty string) ε denoting the set {ε}
3. (literal character) a in Σ denoting the set {"a"}

and the following operations:

1. (concatenation) RS denoting the set { αβ | α in R and β in S }. For example {"ab", "c"}{"d", "ef"} = {"abd", "abef", "cd", "cef"}.
2. (set union) R U S denoting the set union of R and S.
3. (Kleene star) R* denoting the smallest superset of R that contains ε and is closed under string concatenation. This is the set of all strings that can be made by concatenating zero or more strings in R. For example, {"ab", "c"}* = {ε, "ab", "c", "abab", "abc", "cab", "cc", "ababab", ... }.

To avoid brackets it is assumed that the Kleene star has the highest priority, then concatenation and then set union. If there is no ambiguity then brackets may be omitted. For example, (ab)c is written as abc and a U (b(c*)) can be written as a U bc*.

Sometimes the '+' operator is also included such that R+ is the smallest superset of R that is closed under string concatenation, but this operator is redundant because R+ = RR*. Another redundant operator that is often added is the complement operator ~ such that ~R denotes the set of all strings over Σ that are not in R. In that case the resulting operators form a Kleene algebra.

Examples:

• a U b* denotes {"a", ε, "b", "bb", "bbb", ...}
• (a U b)* denotes the set of all strings consisting of 'a's and 'b's, including the empty string
• (ab*)* the same
• ab*(c U ε) denotes the set of strings starting with 'a', then zero or more 'b's and finally optionally a 'c'.
• ( bb U a(bb)*aa U a(bb)*(ab U ba)(bb)*(ab U ba) )* denotes the set of all strings which contain an even number of 'b's and a number of 'a's divisible by three.

Regular expression can express exactly the class of languages expressed by finite state automata: the regular languages. There is, however, a significant difference in compactness: some classes of regular languages can only be described by automata that grow exponentially in size, while the required regular expressions only grow linearly.

We can also study expressive power within the formalism. As the example shows, different regular expressions can express the same language: the formalism is redundant. To what extent can this redundancy be eliminated? Can we find an interesting subset of regular expressions that is still fully expressive? Kleene star and set union are obviously required, but perhaps we can restrict their use. This turns out to be a surprisingly difficult problem. As simple as the regular expressions are, it turns out there is no method to systematically rewrite them to some normal form. They are not finitely axiomatizable. So we have to resort to other methods. This leads to the star height problem.

### Regular expressions in Unix

Basic Unix regexps omit the set union (here usually called "choice operator"), but add

• '.' (the wildcard character, matching any character)
• character classes, with negation: [a-z] matches all lowercase letters, and [^a-z] matches all letters except the lowercase letters.
• '^' (start of line) and \$ (end of line)

Examples:

".at" matches any three-letter word ending with "at"
"[hc]at" matches "hat" and "cat"
"[^b]at" matches any three-letter word ending with "at" and not beginning with 'b'.
"^[hc]at" matches "hat" and "cat" but only at the beginning of a line
"[hc]at\$" matches "hat" and "cat" but only at the end of a line

The egrep utility extends this with

• '+' to express 'the preceding 1 or more times'
• '?' (0 or 1 times the preceding)
• '|' (the choice operator)

Examples:

"[hc]+at" matches with "hat", "cat", "hhat", "chat", "hcat", "ccat" et cetera
"[hc]?at" matches "hat", "cat" and "at"
"([cC]at | [dD]og)" matches "cat", "Cat", "dog" and "Dog"

Since the characters '(', ')', '[', ']', '.', '*', '?', '+', '^' and '\$' are used as special symbols they have to be "escaped" somehow if they are meant literally. This is done by preceding them with '\' which therefore also has to be "escaped" this way if meant literally.

Examples:

"[\.]at" matches with the string ".at"

Other utilities often add their own conventions. Perl, for example, has a particularly rich set of extensions that allows even bracket matching which is usually not possible with more conventional regular expresssions. POSIX standardization tried to remedy this: it offers regex matching that is configurable with many binary flags to enable/disable features or change the syntax (on many Unixes such as Linux 'man regex' has the details). But many new features have been added to Perl, for instance, that the standard doesn't encompass.