Computing Science 2355 | Scripting Languages

Regular Expressions

Regular expressions were developed originally to define members of one particular class of "formal languages". They can be used to form "patterns" that describe specific strings or categories of strings. Though there are some differences from one language to another, particularly in the built-in functions that are used to perform "pattern matching", the basic ideas underlying regular expressions are the same across all programming languages.

This page assumes you already know something about regular expressions and just want to quickly look up something you can't quite remember. For much more detail, explanation and lots of examples see here.

Metacharacters and Ordinary Characters

These metacharacters do not match themselves unless they are escaped:
\ | ( ) [ ] { } ^ $ * + ? .

These "ordinary" characters match themselves:

The alphanumeric characters:
A B C ... Z a b c ... z 0 1 2 ... 9
These non-metacharacter punctuation characters:
! " # % & ' , - / : ; < = > @ _ ` ~
The blank space character

The period metacharacter (.) matches any single character except a newline.

Character Classes (and their "negations")

[adps] [^adps]
[246] [^246]
[a-z] [^a-z]
[3-7] [^3-7]

Note that in character classes the dash (-) is effectively a metacharacter unless it appears as the first character in the class (when it cannot be indicating a range), and the caret (^) is a meta character unless it does not appear as the first character.

Predefined Character Classes

\d \D \w \W \s \S

Alternations

a|b|c

Boundaries

The "boundary pattern" \b matches the boundary (that is, the position) between a word character (\w) and a non-word character (\W), while \B matches a "non-boundary".

Quantifiers (appended to a pattern)

* (zero or more)
+ (one or more)
? (zero or one)
{n} (exactly n)
{m,n} (any number from m to n, inclusive, assuming m<n)
{n,} (at least n)
{0,n} (at most n)

Other Pattern Modifiers

/pattern/g causes all instances to be matched, not just the first
/pattern/i ignores case in the match

Anchors

^pattern matches ^pattern only at the beginning of the string
pattern$ matches ^pattern only at the end of the string

Note that if ^ is anywhere but at the beginning of the pattern, or if $ is anywhere but at the end of the pattern, then these two characters are just "ordinary" characters that match themselves.

Parentheses and Capturing

Placing a pattern in parentheses, as in (pattern), does not change whether pattern is matched or not, but causes the match (if it occurs) to be "remembered". Then \1, \2, \3 (and so on, up to \9) can be used for access to the first, second, third (and so on) such "remembered" matches within the regular expression itself. This particular aspect of regular expressions is not consistent across the various programming languages, so it is important to be aware of what language you're using, and the context, so that you know exactly what these "variables" contain on any given occasion.