Overview

Regular expressions were developed originally to define members of one particular class of "formal languages". They are most frequently used to form "patterns" that describe specific strings or categories of strings. Such patterns, which are themselves strings, can be used to test or manipulate other strings in various ways. The three basic operations in which regular expressions are used are:

Though there are some differences from one programming language to another, particularly in the built-in functions that are used to perform the various regular expression operations, the basic ideas underlying regular expressions are mostly the same across all programming languages.

Some situations for which regular expressions might be found useful include:

This page is not a regular expression tutorial, but it does give a summary of most of what you need to know, and suggests a number of exercises that you might like to try with an online regex tool like regexpal. Note, however, that regexpal is a JavaScript tool, so you cannot use it to test any regular expression syntax that is not supported by that language (such as POSIX bracket expressions, for example).

Advice and Notes

Keep referring back to these items as you read through the rest of this page. Some may not apply or be relevant until you have read and absorbed something that comes later.

  1. Always prefer the use of regular expressions wherever possible over hand-coding a solution from scratch.
  2. Try to ensure that your regular expressions find not only what you want, but only what you want. In this context, be aware that there are several "flavors" of regular expression engine, and there may be subtle differences between them.
  3. Regular expressions are case-sensitive by default.
  4. Regular expressions are eager (the earliest match is preferred). For example, the regex (get|getValue|set|setValue) will match the set in setValue and not setValue itself when tested against setValue.
  5. Regular expressions are also greedy by default, which essentially means that a "quantified repetition" part of a regex will try to match as much as possible before turning it over to the next part of the expression. However, it still "defers" to the need to get an overall match. For example, if our regex is .+\.jpg, the first part (.+) will match all of filename.jpg because .+ is greedy, but then it will "give back" .jpg so that we get an overall match. Note, however, that as little as possible is "given back". For example, if our regex is .*[0-9]+ and our string is Page 266, then the .* matches Page 266, but the final 6 is "given back" to get an overall match. So the end result is that .* matches Page 26 and the [0-9]+ matches only the final 6.
  6. On the other hand, if a quantified expression is made "lazy" by appending a ? to it, then it tries to match as little as possible before turning things over to the next part of the expression. For example, if our regex is .*?[0-9]+, then .*? matches as little as possible before turning it over to [0-9]+. The result is that .*? matches "Page " and [0-9]+ matches the 266.
  7. But be careful with "laziness", because if everything is optional, then "nothing" is a match. For example, if .*?[0-9]*? is our regex and Page 266 our string, then both parts of the regex succeed by matching nothing, so the overall match turns out to be nothing.
  8. Do not escape ordinary characters, and remember that in regular expressions the double quote is just an ordinary character.
  9. The order of characters in a character class does not matter.
  10. Metacharacters inside character classes are already "escaped", except for these: ] - ^ \. However, it also doesn't hurt if you do escape, inside a character set, a metacharacter that doesn't need escaping. Note that you can also use predefined character classes like \w inside a square-bracketed character class.
  11. The underscore (_) is a "word character", but the hyphen (-) is not.
  12. POSIX bracket expressions may not be supported in JavaScript, Java, .NET or Python.
  13. A POSIX bracket expression must go inside a character class.
  14. Grouping with parentheses can be used for the following:
  15. When constructing a complex regular expression, it is often a good technique to put conceptually distinct parts on separate lines and then, when you are happy with the overall construct, join the lines.
  16. In an alternation, either left or right can match (of course), but left gets precedence.
  17. All anchors refer to a position, not an actual character, and they have zero width. The ^ and $ symbols are virtually universal start and end anchors (respectively), but \A and \Z are also recognized in Java, .NET, Perl, Python and Ruby.
  18. In single-line mode note that:
  19. In multiline mode note that:
  20. Note that a "word boundary" is not an actual character (and in particular it is not a space), but a position that occurs in one of these places:
  21. Backreferences to optional expressions can be very subtle, and how they behave differs from one regex engine to another:
  22. Like anchors and word boundaries, lookaround assertions are also zero-width.
  23. Metacharacters inside character sets do not need to be escaped, except for these four: ] - ^ \
  24. Negative lookahead expressions give us a way to match something that should be rejected.
  25. Some principles for constructing better regexes:

Regular Expression Details

Metacharacters
\ | ( ) [ ] { } ^ $ * + ? . - : ! =
These are characters that have a special meaning within the context of regular expressions, and which do not "match" themselves in a "pattern matching" operation unless they are "escaped". A character (including the backslash character itself) is "escaped" by placing a backslash (\) in front of it. Here's a brief indication of what each metacharacter, or each metacharacter pair, is used for ...
\ For escaping other characters or itself
| Alternation (the "or" character)
( ) For enclosure, just to achieve clarity, but also for "capturing" subexpressions
(?: ) For enclosing a non-capturing group (?: turns off capturing and backreferences, for efficiency and to preserve space for other captures, for example) Think of it this way: The ? says "give this group a different meaning", while the : says that the meaning is that "the group is non-capturing".
(?= ) For enclosing a positive lookahead assertion
(?! ) For enclosing a negative lookahead assertion
(?<= ) For enclosing a positive lookbehind assertion (not widely supported and, in particular, not in JavaScript, and often only for simple expressions, such as those of fixed length, when it is supported)
(?<! ) For enclosing a negative lookbehind assertion (not widely supported and, in particular, not in JavaScript, and often only for simple expressions, such as those of fixed length, when it is supported)
[ ] For delimiting a character class
{ } For delimiting a numerical range
^ and $ For marking the beginning (^) or end ($) of a string/line
\A and \Z Also for marking the beginning (\A) and end \Z of a string, but never a line (and much less widely supported than ^ and $)
^ For negating a character class
* + ? For repetition: 0 or more (*), one or more (+), and 0 or 1 (?)
? Makes *?, +?, ?? and {min,max}? "lazy" instead of "greedy" (the default for those quantifiers without the (second) ?)
. For any character except the newline character
- For indicating a range in a character class
Ordinary Characters (which match themselves)
The letters (both uppercase and lowercase), and the digits:
A B C ... Z a b c ... z 0 1 2 ... 9
The non-metacharacter punctuation characters:
! " # % & ' , - / : ; < = > @ _ ` ~
The blank space character
Character Classes (and their "negations")
[abcd] Any one of the lowercase letters a, b, c or d
[^abcd] Any character except one of the lowercase letters a, b, c or d
[246] Any one of the digits 2, 4 or 6
[^246] Any character except one of the digits 2, 4 or 6
[a-z] Any lowercase character
[^a-z] Any character except a lowercase character
[3-7] Any digit from 3 to 7 inclusive
[^3-7] Any character that is not a digit from the range 3 to 7 inclusive
Note that in character classes the dash (-) is effectively a metacharacter unless it appears as the first or last character in the class (when it cannot be indicating a range), and the caret (^) is a meta character unless it does not appear as the first character. And be reminded (again) that predefined character classes like those shown in the following section are eligible to be placed within a square-bracketed character class.
Predefined Character Classes
\d A digit (same as [0-9])
\D Not a digit (same as [^0-9])
\w A letter, digit or underscore (same as [a-zA-Z0-9_])
\W Not a letter, digit or underscore (same as [^a-zA-Z0-9_])
\s A whitespace character (same as [ \t\r\n])
\S Not a whitespace character (same as [^ \t\r\n])
POSIX Bracket Expressions
Note: POSIX = Portable Operating System Interface for Unix
[:alpha:] Same as [a-zA-Z]
[:digit:] Same as [0-9]
[:alnum:] Same as [a-zA-Z0-9]
[:lower:] Same as [a-z]
[:upper:] Same as [A-Z]
[:xdigit:] A hexadecimal character (same as [a-fA-F0-9])
[:punct:] A printable character that is not a space, digit or letter
[:space:] Same as \s
[:blank:] Same as a space or a tab
[:print:] A printable character, including whitespace characters
[:graph:] A printable non-whitespace character
[:cntrl:] A (non-printable) control character
Alternations
a|b|c
Quantifiers (appended to a pattern)
* (zero or more)
+ (one or more)
? (zero or one)
{n} (exactly n)
{m,n} (any number from m to n, inclusive, assuming m<n)
{n,} (at least n) (Example: \d{1,} is same as \d+)
{0,n} (at most n) (Example: \d{0,} is same as \d*)
Anchors
^pattern matches only at the beginning of the string
pattern$ matches only at the end of the string

Note that if ^ is anywhere but at the beginning of the pattern, or if $ is anywhere but at the end of the pattern, then these two characters are just "ordinary" characters that match themselves.

The named "boundary" patterns (\b and \B)
\b matches the boundary (i.e., the position, not a character) between a word character (\w) and a non-word character (\W), while \B matches a "non-boundary".
Pattern Modifiers
In general, any regular expression engine will provide some way to modify patterns in various ways, such as finding all matches or only the first match, or ignoring case during a match or search. However, this is one of the things that may differ quite radically as you move from one programming language to another.
Parentheses
Placing parentheses around part of a regular expressions, as in (pattern), does not change whether pattern is matched or not, but it causes the match (if it occurs) to be "remembered". Then, later on, \1, \2, \3, ... can be used for access to the first, second, third, ... such "remembered" matches within the regular expression itself. Some languages (Perl, and now also C++11, for example) use $1, $2, $3, ... to contain these remembered matches for later access outside the regular expression. However, these variables may also be used to contain other values, so if you want to use them it is important to be aware of what language you're using and the context in which you are using them, so that you know exactly what they contain on any particular occasion.

Exercises for Regular Expression Familiarization via The JavaScript Tool regexpal

Familiarity with regular expressions can only come with practice, and getting really comfortable can take a great deal of practice. Fortunately there are some tools to help us. The easiest way to try the following exercises is to use a program like regexpal, where you can try examples like those given in the table below. When trying examples like these, as others you may have made up, it's usually a good idea to enter the data first, then the regular expression. This gives you a chance (often) to watch as various things are matched "along the way" until you have entered the full regular expression that you wish to test. This in itself can sometimes reveal subtleties or be a good learning experience in other ways.

Ordinary Characters, the Period Metacharacter, and Escape Characters
regex data string Notes
car car carnival Carnival Try with Global on and off.
zz pizzazz Try with Global on and off.
cat The cow, camel and cat communicated. Try with Global on and off. The data string is all on one line.
h.t hot hat hit heat hate hzt h t h#t h:t h.t The data string is all on one line. Note regexpal highlighting of the period (.) in the regex.
.a.a.a banana papaya #a$a@a abacab Note the last match in particular.
a.a.a. banana papaya #a$a@a abacab Compare last match with preceding ones.
9.00 9.00 9500 9-00
9\.00 9.00 9500 9-00
h.._export.txt his_export.txt her_export.txt
h.._export\.txt his_export.txt her_export.txt
resume..txt resume1.txt resume2.txt resume3_txt.zip The data string is all on one line.
resume.\.txt resume1.txt resume2.txt resume3_txt.zip The data string is all on one line.
a\tb a b There's a TAB between a and b.
a\nb a
b
The data string is on two lines.
c\nd abc
def
The data string is on two lines.
Character Classes, Negative Character Classes, and Predefined Character Classes
[aeiou] Bananas Peaches Apples
gr[ea]y gray grey
gr[ea]t great
gr[ea][ea]t great graet greet graat
[abcdefghijklmnopqrstuvwxyz] Now we know how to make negative character sets. Type the regex in one character at time, then put a ^ at the start. The data string is all on one line.
[^aeiou] It seems I see the sea I seek. Try with Global on and off.
see[^mn] It seems I see the sea I seek.
h[abc.xyz]t hat hot h.t The period is not a metacharacter in this regex.
var[[(][0-9][\])] var(3) var[4] Also try [])], [)]] and [)] as last part of regex.
file[0-\_]1 file01 file-1 file\1 file_1 Thinks - indicates a range.
file[0\-\_]1 file01 file-1 file\1 file_1 Now thinks \ is escaping the underscore (_).
file[0\-\\_]1 file01 file-1 file\1 file_1 Now gets all four.
\d\d\d\d 1984 text
\w\w\w\w 1984 text 1_5W
[\w\-] blue-green paint
[\d\s] 123 456 789 abc
[^\d\s] 123 456 789 abc
[\D\S] 123 456 789 abc
Repetition Expressions, Greediness and Laziness
apples* apple apples applessssss Also try apples+ and apples? for the regex.
\d\d\d\d* 1234567890 1234 123 12 Also try \d\d\d+ for the regex.
colou?r color colour
[a-z]+\d[a-z]* abc9xyz Also try 9xyz and abc9 for the string.
\w+s We picked apples. Also try We picked applessssss. for the string.
\w+_\d{2,4}-\d{2} report_1997-04
budget_03-04
memo_712539-100
\d+\w+\d+ 01_FY_07_report_99.xls Illustrates regex "greediness".
\d+\w+?\d+ 01_FY_07_report_99.xls Illustrates regex "laziness".
".+", ".+" "IBM", "Samsung", "Apple, Inc." Illustrates regex "greediness".
".+?", ".+?" "IBM", "Samsung", "Apple, Inc." Illustrates regex "laziness".
Grouping and Alternation Metacharacters
abc+ abcccc
(abc)+ abcabcabc
(in)?dependent independent dependent
run(s)? I run fast. He runs faster. Same as runs? but clearer.
[A-Z][0-9] A1B2C3D4E5F6G7H8I9J0 Also try ([A-Z][0-9]), ([A-Z][0-9])+ and ([A-Z][0-9]){3}.
apple|orange apple orange appleorange apple|orange Also try apple\|orange.
abc|def|ghi|jkl abcdefghijklmnopqrstuvwxyz Try with Global on and off.
applejuice|sauce applejuice applesauce Try with Global on and off.
apple(juice|sauce) applejuice applesauce
(peanut|peanutbutter) peanutbutter Illustrates regex "eagerness".
peanut(butter)? peanutbutter Illustrates regex "greediness".
(\w+|FY\d{4}_report\.xls) FY2003_report.xls
xyz|abc|def|ghi|jkl abcdefghijklmnopqrstuvwxyz Shows eagerness; turn off Global.
(AA|BB|CC){6} AABBAACCBB
(\d\d|[A-z][A-Z]){3} 112233 Also try AABBCC, AA66ZZ, 11AA44.
Anchored Expressions: Start and End Anchors
[A-Z] Mr. Smith went to Washington. Also try ^[A-Z], \., \.$ as the regex.
^[A-Z][A-Za-z\-. ]+\.$ Mr. Smith went to Washington. Also take out the period in the regex.
^\w+@\w+\.[a-z]{3}$ me@here.com, you@there.com Try with and without either or both anchors.
Anchored Expressions: Single-Line and Multiline Modes
[a-z]+ milk
apple juice
sweet peas
yogurt
sweet corn
apple sauce
milkshake
sweet potatoes
Try ^ at the beginning, and $ at the end with and without a newline following the last item, and also with the "Multiline anchors" option (in regexpal) on and off.
Anchored Expressions: Word Boundaries
\b\w+\b This is a test.
\b\w+\b abc_123
\b\w+\b top-notch
\BThis This is a test.
\B\w+\B This is a test.
apples\band\boranges apples and oranges
apples\b \band\b \boranges apples and oranges
\b[\w']+\b Shall I compare thee to a summer's day? Another example of greediness (look carefully at the matches in regexpal).
\b[\w']+?\b Shall I compare thee to a summer's day? Another example of laziness (look carefully at the matches in regexpal).
Capture Groups and Backreferences
(apples) to \1 apples to apples
(ab)(cd)(ef)\3\2\1 abcdefefcdab
<(i|em)>.+?</\1> <i>Hello</i>
<(i|em)>.+?</\1> <em>Hello</em>
<(i|em)>.+?</\1> <i>italics</i> <em>emphassis</em> <i>bad</em> <em>bad</i> Data should all be on one line. Also make the regex greedy by removing the ?.
\b([A-Z][a-z]+)\b\s\b\1son\b Steve Smith, John Johnson, Eric Erikson, Evan Evanson Finding names of people whose first name is repeated in their last name.
\b(\w+)\s+\1\b Paris in the
the spring.
Finding repeated words.
Backreferences with Optional Groups
(A?)B AB Matches AB and captures A.
(A?)B B Matches B but captures nothing, so captures occur on zero-width matches.
(A?)B\1 ABA
(A?)B\1 B Matches B, so backreferences become zero-width as well.
(A?)B\1C ABAC
(A?)B\1C BC Matches BC, so backreferences become zero-width as well.
Non-Capturing Group Expressions
(oranges) and (apples) to oranges oranges and apples to oranges These do capture.
(oranges) and (apples) to \1 oranges and apples to oranges The backreference does match.
(oranges) and (apples) to \2 oranges and apples to oranges The backreference does not match.
(oranges) and (apples) to \2 oranges and apples to oranges
oranges and apples to apples
Now the backreference matches the second line.
(?:oranges) and (apples) to \1 oranges and apples to oranges
oranges and apples to apples
Here the backreference \1 also matches the second line because the first group is not captured.
(?:oranges) and (apples) to \2 oranges and apples to oranges
oranges and apples to apples
Now nothing matches because nothing has been saved in \2.
Lookaround Assertions (Lookahead and Lookbehind)
(?=seashore)sea seashore seaside A positive lookahead assertion.
sea(?=shore) seashore seaside An equivalent positive lookahead assertion.
sea(?:shore) seashore seaside A non-capturing expression ... be sure not to confuse what gets matched, what gets captured, and what is simply "asserted".
\b[A-Za-z']+\b, Verily, verily, I say onto you,
give, take, and then, if you like, give back
Match all words that are followed immediately by a comma and also include the comma in the match.
\b[A-Za-z']+\b(?=,) Verily, verily, I say onto you,
give, take, and then, if you like, give back
Now the lookahead assertion says the comma has to be there, but this time it is not matched.
\b[A-Za-z']+\b(?:,) Verily, verily, I say onto you,
give, take, and then, if you like, give back
Once again, the comma is matched, this time in a non-capturing group.
\d{3}-\d{3}-\d{4} 555-302-4321 555-781-6978 Both are matched.
^[0-5\-]+$ 555-302-4321 This matches.
^[0-5\-]+$ 23140-5 This also matches.
(?=^[0-5\-]+$)\d{3}-\d{3}-\d{4} 555-302-4321 This matches. (The regex is all on one line.)
(?=^[0-5\-]+$)\d{3}-\d{3}-\d{4} 555-781-6978 This does not match. (The regex is all on one line.)
(?=^[0-5\-]+$)\d{3}-\d{3}-\d{4} 555-302-4321
555-781-6978
555-245-1321
Put Multiline anchors on. (The regex is all on one line.)
(?=^[0-5\-]+$)(?=.*4321)\d{3}-\d{3}-\d{4} 555-302-4321
555-781-6978
555-245-1321
Put Multiline anchors on. (The regex is all on one line.)
\b(?=\w*ou)[A-Za-z']+\b(?=,) Verily, verily, I say onto you,
give, take, and then, if you like, give back
(?=.*\d).{8,15} base355ball Match a password that contains 8 to 15 characters and at least one digit.
(?=.*\d)(?=.*[A-Z]).{8,15} base355Ball Match a password that contains 8 to 15 characters and at least one digit and one capital letter.
(?!seashore)sea seashore seaside A negative lookahead assertion.
sea(?!shore) seashore seaside An equivalent negative lookahead assertion.
online(?! training) online training and online courses The data string is all on one line.
online(?!.*training) online video training and online courses and online videos Finds online as long as it's not followed by training, even if there are other words in between the two. (The data string is all on one line.)
\bblack\b(?! dog) The black dog followed the black car into the black night. The data string is all on one line.
\bblack\b(?= dog) The black dog followed the black car into the black night. Compare this one with the previous one. (The data string is all on one line.)
(?=^[0-5\-]+$)(?!.*4321)\d{3}-\d{3}-\d{4} 555-302-4321
555-781-6978
555-245-1321
Put Multiline anchors on. (The regex is all on one line.) Here we're using a combination of a positive lookahead and a negative lookahead.
\b[A-Za-z']+\b(?![.,]) Verily, verily, I say onto you,
give, take. And then, if you like, give back.
Find all words not followed by a period or a comma.
\bblack\b(?!.*\bblack\b) The black dog followed the black car into the black night. Finds the last occurrence of the word black. (The data string is all on one line.)
(\bblack\b)(?!.*\1) The black dog followed the black car into the black night. Also finds the last occurrence of the word black. (The data string is all on one line.)
(?<=baseball)ball baseball football A positive lookbehind assertion. (Not supported in JavaScript.)
ball(?<=baseball) baseball football An equivalent positive lookbehind assertion. (Not supported in JavaScript.)

Regular Expressions in C++

As of C++11, regular expressions are a standard part of the language. The default regular expression grammar is that of ECMAScript, which is the most powerful, but there are five other alternative grammars, should the need arise to use one of them: basic, extended, awk, grep and egrep. For the most part, you can simply use the default, which requires no extra effort on your part.

As with regular expressions in any programming language, if you want to use them you are probably wanting to find strings that "match" a regular expression in their entirety, to search for substrings that match a regular expression, and/or to replace all or part of a string with some other string. C++ 11 provides functions that perform these actions:

Some sample programs that you can find here illustrate some of the ways you can now work with regular expresssions in C++11:

  1. The first sample program, learn_regex1.cpp, illustrates matching, searching and replacing in the simplest possible form of each, when both the regex object and the string object on which it acts consist of just simple "ordinary" characters.
  2. The second sample program, learn_regex2.cpp, shows how we can use an smatch object to get access to more details about whatever match or matches occur during a call to regex_match() or regex_search().
  3. The third sample program, learn_regex3.cpp, uses a regex object that contains something more that just ordinary characters, as well as (again) an smatch object to get access to more details about whatever match or matches occur during a call to regex_match() or regex_search().
In the testers subdirectory under the link mentioned above you will find a number of other sample programs that you should study and use for experimentation.