Regular Expressions
Regular expressions were developed originally to define members of one particular class of "formal languages". They are most frequently used to form "patterns" that describe specific strings or categories of strings. Such patterns, which are themselves strings, can be used to test or manipulate other strings in various ways. The three basic operations in which regular expressions are used are:
Though there are some differences from one programming language to another, particularly in the built-in functions that are used to perform the various regular expression operations, the basic ideas underlying regular expressions are mostly the same across all programming languages.
Some situations for which regular expressions might be found useful include:
This page is not a regular expression tutorial, but it does give a
summary of most of what you need to know, and suggests a number of
exercises that you might like to try with an online regex tool like
regexpal
. Note, however, that regexpal
is a
JavaScript tool, so you cannot use it to test any regular expression
syntax that is not supported by that language (such as POSIX bracket
expressions, for example).
Keep referring back to these items as you read through the rest of this page. Some may not apply or be relevant until you have read and absorbed something that comes later.
(get|getValue|set|setValue)
will match the set
in setValue
and not
setValue
itself when tested against setValue
..+\.jpg
,
the first part (.+
) will match all of
filename.jpg
because .+
is greedy, but then
it will "give back" .jpg
so that we get an overall match.
Note, however, that as little as possible is "given back". For example,
if our regex is .*[0-9]+
and our string is Page
266
, then the .*
matches Page 266
, but
the final 6
is "given back" to get an overall match. So the
end result is that .*
matches Page 26
and the
[0-9]+
matches only the final 6
.?
to it, then it tries to match as little as
possible before turning things over to the next part of the expression.
For example, if our regex is .*?[0-9]+
, then .*? matches
as little as possible before turning it over to [0-9]+
.
The result is that .*?
matches "Page
" and
[0-9]+
matches the 266
..*?[0-9]*?
is
our regex and Page 266
our string, then both parts of the
regex succeed by matching nothing, so the overall match turns out to be
nothing.] - ^ \
. However, it also doesn't hurt
if you do escape, inside a character set, a metacharacter that
doesn't need escaping. Note that you can also use predefined character
classes like \w
inside a square-bracketed character
class._
) is a "word character", but the
hyphen (-
) is not.^
and $
symbols
are virtually universal start and end anchors (respectively), but
\A
and \Z
are also recognized in Java, .NET,
Perl, Python and Ruby.^
and $
do not match at line
breaks.\A
and \Z
(if available) do not
match at line breaks.^
and $
match at the start and end
(respectively) of lines.\A
and \Z
(if available) still do
not match at line breaks.A
is optional, but the group/capture is
not optional:(A?)B
matches B
and "captures
nothing"A
is not optional, but the
group/capture is optional:(A)?B
matches B
but "does not capture
anything"] - ^ \
.+
is faster than
.*
and .{5}
or {3,7}
are
even faster.[A-Za-z]
is better than
.+
.<[^>]+>
is better than
<.+>
.\w+_\d{2,4}|\d{4}_\d{2}_\w+|export\d{2}
is
not as good as
export\d{2}|\d{4}_\d{2}_\w+|\w+_\d{2,4}
.\ | ( ) [ ] { } ^ $ * + ? . - : ! =
\
) in front of it. Here's a brief indication of what each
metacharacter, or each metacharacter pair, is used for ...\
For escaping other characters or itself|
Alternation (the "or" character)( )
For enclosure, just to achieve clarity, but also for
"capturing" subexpressions(?: )
For enclosing a non-capturing group (?:
turns off capturing and backreferences, for efficiency and to preserve
space for other captures, for example) Think of it this way: The
?
says "give this group a different meaning", while the
:
says that the meaning is that "the group is
non-capturing".(?= )
For enclosing a positive lookahead assertion(?! )
For enclosing a negative lookahead assertion(?<= )
For enclosing a positive lookbehind assertion
(not widely supported and, in particular, not in JavaScript, and often
only for simple expressions, such as those of fixed length, when it is
supported)(?<! )
For enclosing a negative lookbehind assertion
(not widely supported and, in particular, not in JavaScript, and often
only for simple expressions, such as those of fixed length, when it is
supported)[ ]
For delimiting a character class{ }
For delimiting a numerical range^
and $
For marking the beginning
(^
) or end ($
) of a string/line\A
and \Z
Also for marking the beginning
(\A
) and end \Z
of a string, but never a line
(and much less widely supported than ^
and
$
)^
For negating a character class* + ?
For repetition: 0 or more (*
), one or
more (+
), and 0 or 1 (?
)?
Makes *?
, +?
, ??
and {min,max}?
"lazy" instead of "greedy" (the default for
those quantifiers without the (second) ?
).
For any character except the newline character-
For indicating a range in a character classA B C ... Z a b c ... z 0 1 2 ... 9
! " # % & ' , - / : ; < = > @ _ ` ~
[abcd]
Any one of the lowercase letters
a
, b
, c
or d
[^abcd]
Any character except one of the lowercase
letters a
, b
, c
or
d
[246]
Any one of the digits 2
, 4
or 6
[^246]
Any character except one of the digits
2
, 4
or 6
[a-z]
Any lowercase character[^a-z]
Any character except a lowercase
character[3-7]
Any digit from 3
to 7
inclusive[^3-7]
Any character that is not a digit from the
range 3
to 7
inclusive\d
A digit (same as [0-9]
)\D
Not a digit (same as [^0-9
])\w
A letter, digit or underscore (same as
[a-zA-Z0-9_]
)\W
Not a letter, digit or underscore (same as
[^a-zA-Z0-9_]
)\s
A whitespace character (same as [
\t\r\n]
)\S
Not a whitespace character (same as [^
\t\r\n]
)[:alpha:]
Same as [a-zA-Z]
[:digit:]
Same as [0-9]
[:alnum:]
Same as [a-zA-Z0-9]
[:lower:]
Same as [a-z]
[:upper:]
Same as [A-Z]
[:xdigit:]
A hexadecimal character (same as
[a-fA-F0-9]
)[:punct:]
A printable character that is not a space, digit
or letter[:space:]
Same as \s
[:blank:]
Same as a space or a tab[:print:]
A printable character, including whitespace
characters[:graph:]
A printable non-whitespace character[:cntrl:]
A (non-printable) control charactera|b|c
*
(zero or more)+
(one or more)?
(zero or one){n}
(exactly n){m,n}
(any number from m to n, inclusive, assuming
m<n){n,}
(at least n) (Example: \d{1,}
is same as
\d+
){0,n}
(at most n) (Example: \d{0,}
is same as
\d*
)^pattern
matches only at the beginning of the
stringpattern$
matches only at the end of the string
Note that if ^
is anywhere but at the beginning of
the pattern, or if $
is anywhere but at the end of the
pattern, then these two characters are just "ordinary" characters
that match themselves.
\b
and \B
)\b
matches the boundary (i.e., the position, not a
character) between a word character (\w
) and a non-word
character (\W
), while \B
matches a
"non-boundary".(pattern)
, does not change whether pattern
is
matched or not, but it causes the match (if it occurs) to be
"remembered". Then, later on, \1, \2, \3,
... can be used for access to the first, second, third, ... such
"remembered" matches within the regular expression itself. Some
languages (Perl, and now also C++11, for example) use $1,
$2, $3, ... to contain these remembered matches for
later access outside the regular expression. However, these variables
may also be used to contain other values, so if you want to use them it
is important to be aware of what language you're using and the context
in which you are using them, so that you know exactly what they contain
on any particular occasion.Familiarity with regular expressions can only come with practice, and
getting really comfortable can take a great deal of practice. Fortunately
there are some tools to help us. The easiest way to try the following
exercises is to use a program like regexpal
,
where you can try examples like those given in the table below. When
trying examples like these, as others you may have made up, it's usually
a good idea to enter the data first, then the regular expression. This
gives you a chance (often) to watch as various things are matched "along
the way" until you have entered the full regular expression that you wish
to test. This in itself can sometimes reveal subtleties or be a good
learning experience in other ways.
Ordinary Characters, the Period Metacharacter, and Escape Characters | ||
---|---|---|
regex | data string | Notes |
car |
car carnival Carnival |
Try with Global on and off. |
zz |
pizzazz |
Try with Global on and off. |
cat |
The cow, camel and cat communicated. |
Try with Global on and off. The data string is all on one line. |
h.t |
hot hat hit heat hate hzt h t h#t h:t h.t |
The data string is all on one line. Note regexpal highlighting of
the period (. ) in the regex. |
.a.a.a |
banana papaya #a$a@a abacab |
Note the last match in particular. |
a.a.a. |
banana papaya #a$a@a abacab |
Compare last match with preceding ones. |
9.00 |
9.00 9500 9-00 |
|
9\.00 |
9.00 9500 9-00 |
|
h.._export.txt |
his_export.txt her_export.txt |
|
h.._export\.txt |
his_export.txt her_export.txt |
|
resume..txt |
resume1.txt resume2.txt resume3_txt.zip |
The data string is all on one line. |
resume.\.txt |
resume1.txt resume2.txt resume3_txt.zip |
The data string is all on one line. |
a\tb |
a b |
There's a TAB between a and b. |
a\nb |
a |
The data string is on two lines. |
c\nd |
abc |
The data string is on two lines. |
Character Classes, Negative Character Classes, and Predefined Character Classes | ||
[aeiou] |
Bananas Peaches Apples |
|
gr[ea]y |
gray grey |
|
gr[ea]t |
great |
|
gr[ea][ea]t |
great graet greet graat |
|
[abcdefghijklmnopqrstuvwxyz] |
Now we know how to make negative character
sets. |
Type the regex in one character at time, then put a
^ at the start. The data string is all on one line. |
[^aeiou] |
It seems I see the sea I seek. |
Try with Global on and off. |
see[^mn] |
It seems I see the sea I seek. |
|
h[abc.xyz]t |
hat hot h.t |
The period is not a metacharacter in this regex. |
var[[(][0-9][\])] |
var(3) var[4] |
Also try [])] , [)]] and [)] as last
part of regex. |
file[0-\_]1 |
file01 file-1 file\1 file_1 |
Thinks - indicates a range. |
file[0\-\_]1 |
file01 file-1 file\1 file_1 |
Now thinks \ is escaping the underscore
(_ ). |
file[0\-\\_]1 |
file01 file-1 file\1 file_1 |
Now gets all four. |
\d\d\d\d |
1984 text |
|
\w\w\w\w |
1984 text 1_5W |
|
[\w\-] |
blue-green paint |
|
[\d\s] |
123 456 789 abc |
|
[^\d\s] |
123 456 789 abc |
|
[\D\S] |
123 456 789 abc |
|
Repetition Expressions, Greediness and Laziness | ||
apples* |
apple apples applessssss |
Also try apples+ and apples? for the
regex. |
\d\d\d\d* |
1234567890 1234 123 12 |
Also try \d\d\d+ for the regex. |
colou?r |
color colour |
|
[a-z]+\d[a-z]* |
abc9xyz |
Also try 9xyz and abc9 for the
string. |
\w+s |
We picked apples. |
Also try We picked applessssss. for the string. |
\w+_\d{2,4}-\d{2} |
report_1997-04 |
|
\d+\w+\d+ |
01_FY_07_report_99.xls |
Illustrates regex "greediness". |
\d+\w+?\d+ |
01_FY_07_report_99.xls |
Illustrates regex "laziness". |
".+", ".+" |
"IBM", "Samsung", "Apple, Inc." |
Illustrates regex "greediness". |
".+?", ".+?" |
"IBM", "Samsung", "Apple, Inc." |
Illustrates regex "laziness". |
Grouping and Alternation Metacharacters | ||
abc+ |
abcccc |
|
(abc)+ |
abcabcabc |
|
(in)?dependent |
independent dependent |
|
run(s)? |
I run fast. He runs faster. |
Same as runs? but clearer. |
[A-Z][0-9] |
A1B2C3D4E5F6G7H8I9J0 |
Also try ([A-Z][0-9]) , ([A-Z][0-9])+
and ([A-Z][0-9]){3} . |
apple|orange |
apple orange appleorange apple|orange |
Also try apple\|orange . |
abc|def|ghi|jkl |
abcdefghijklmnopqrstuvwxyz |
Try with Global on and off. |
applejuice|sauce |
applejuice applesauce |
Try with Global on and off. |
apple(juice|sauce) |
applejuice applesauce |
|
(peanut|peanutbutter) |
peanutbutter |
Illustrates regex "eagerness". |
peanut(butter)? |
peanutbutter |
Illustrates regex "greediness". |
(\w+|FY\d{4}_report\.xls) |
FY2003_report.xls |
|
xyz|abc|def|ghi|jkl |
abcdefghijklmnopqrstuvwxyz |
Shows eagerness; turn off Global. |
(AA|BB|CC){6} |
AABBAACCBB |
|
(\d\d|[A-z][A-Z]){3} |
112233 |
Also try AABBCC , AA66ZZ ,
11AA44 . |
Anchored Expressions: Start and End Anchors | ||
[A-Z] |
Mr. Smith went to Washington. |
Also try ^[A-Z] , \. , \.$
as the regex. |
^[A-Z][A-Za-z\-. ]+\.$ |
Mr. Smith went to Washington. |
Also take out the period in the regex. |
^\w+@\w+\.[a-z]{3}$ |
me@here.com, you@there.com |
Try with and without either or both anchors. |
Anchored Expressions: Single-Line and Multiline Modes | ||
[a-z]+ |
milk |
Try ^ at the beginning, and $ at the
end with and without a newline following the last item, and also with
the "Multiline anchors" option (in regexpal ) on and
off. |
Anchored Expressions: Word Boundaries | ||
\b\w+\b |
This is a test. |
|
\b\w+\b |
abc_123 |
|
\b\w+\b |
top-notch |
|
\BThis |
This is a test. |
|
\B\w+\B |
This is a test. |
|
apples\band\boranges |
apples and oranges |
|
apples\b \band\b \boranges |
apples and oranges |
|
\b[\w']+\b |
Shall I compare thee to a summer's day? |
Another example of greediness (look carefully at the matches in
regexpal ). |
\b[\w']+?\b |
Shall I compare thee to a summer's day? |
Another example of laziness (look carefully at the matches in
regexpal ). |
Capture Groups and Backreferences | ||
(apples) to \1 |
apples to apples |
|
(ab)(cd)(ef)\3\2\1 |
abcdefefcdab |
|
<(i|em)>.+?</\1> |
<i>Hello</i> |
|
<(i|em)>.+?</\1> |
<em>Hello</em> |
|
<(i|em)>.+?</\1> |
<i>italics</i> <em>emphassis</em>
<i>bad</em> <em>bad</i> |
Data should all be on one line. Also make the regex greedy by
removing the ? . |
\b([A-Z][a-z]+)\b\s\b\1son\b |
Steve Smith, John Johnson, Eric Erikson, Evan
Evanson |
Finding names of people whose first name is repeated in their last name. |
\b(\w+)\s+\1\b |
Paris in the |
Finding repeated words. |
Backreferences with Optional Groups | ||
(A?)B |
AB |
Matches AB and captures A . |
(A?)B |
B |
Matches B but captures nothing, so captures occur on
zero-width matches. |
(A?)B\1 |
ABA |
|
(A?)B\1 |
B |
Matches B, so backreferences become zero-width as well. |
(A?)B\1C |
ABAC |
|
(A?)B\1C |
BC |
Matches BC, so backreferences become zero-width as well. |
Non-Capturing Group Expressions | ||
(oranges) and (apples) to oranges |
oranges and apples to oranges |
These do capture. |
(oranges) and (apples) to \1 |
oranges and apples to oranges |
The backreference does match. |
(oranges) and (apples) to \2 |
oranges and apples to oranges |
The backreference does not match. |
(oranges) and (apples) to \2 |
oranges and apples to oranges |
Now the backreference matches the second line. |
(?:oranges) and (apples) to \1 |
oranges and apples to oranges |
Here the backreference \1 also matches the
second line because the first group is not
captured. |
(?:oranges) and (apples) to \2 |
oranges and apples to oranges |
Now nothing matches because nothing has been saved in
\2 . |
Lookaround Assertions (Lookahead and Lookbehind) | ||
(?=seashore)sea |
seashore seaside |
A positive lookahead assertion. |
sea(?=shore) |
seashore seaside |
An equivalent positive lookahead assertion. |
sea(?:shore) |
seashore seaside |
A non-capturing expression ... be sure not to confuse what gets matched, what gets captured, and what is simply "asserted". |
\b[A-Za-z']+\b, |
Verily, verily, I say onto you, |
Match all words that are followed immediately by a comma and also include the comma in the match. |
\b[A-Za-z']+\b(?=,) |
Verily, verily, I say onto you, |
Now the lookahead assertion says the comma has to be there, but this time it is not matched. |
\b[A-Za-z']+\b(?:,) |
Verily, verily, I say onto you, |
Once again, the comma is matched, this time in a non-capturing group. |
\d{3}-\d{3}-\d{4} |
555-302-4321 555-781-6978 |
Both are matched. |
^[0-5\-]+$ |
555-302-4321 |
This matches. |
^[0-5\-]+$ |
23140-5 |
This also matches. |
(?=^[0-5\-]+$)\d{3}-\d{3}-\d{4} |
555-302-4321 |
This matches. (The regex is all on one line.) |
(?=^[0-5\-]+$)\d{3}-\d{3}-\d{4} |
555-781-6978 |
This does not match. (The regex is all on one line.) |
(?=^[0-5\-]+$)\d{3}-\d{3}-\d{4} |
555-302-4321 |
Put Multiline anchors on. (The regex is all on one
line.) |
(?=^[0-5\-]+$)(?=.*4321)\d{3}-\d{3}-\d{4} |
555-302-4321 |
Put Multiline anchors on. (The regex is all on one
line.) |
\b(?=\w*ou)[A-Za-z']+\b(?=,) |
Verily, verily, I say onto you, |
|
(?=.*\d).{8,15} |
base355ball |
Match a password that contains 8 to 15 characters and at least one digit. |
(?=.*\d)(?=.*[A-Z]).{8,15} |
base355Ball |
Match a password that contains 8 to 15 characters and at least one digit and one capital letter. |
(?!seashore)sea |
seashore seaside |
A negative lookahead assertion. |
sea(?!shore) |
seashore seaside |
An equivalent negative lookahead assertion. |
online(?! training) |
online training and online courses |
The data string is all on one line. |
online(?!.*training) |
online video training and online courses and online
videos |
Finds online as long as it's not followed by
training , even if there are other words in between the
two. (The data string is all on one line.) |
\bblack\b(?! dog) |
The black dog followed the black car into the black
night. |
The data string is all on one line. |
\bblack\b(?= dog) |
The black dog followed the black car into the black
night. |
Compare this one with the previous one. (The data string is all on one line.) |
(?=^[0-5\-]+$)(?!.*4321)\d{3}-\d{3}-\d{4} |
555-302-4321 |
Put Multiline anchors on. (The regex is all on one
line.) Here we're using a combination of a positive lookahead and a
negative lookahead. |
\b[A-Za-z']+\b(?![.,]) |
Verily, verily, I say onto you, |
Find all words not followed by a period or a comma. |
\bblack\b(?!.*\bblack\b) |
The black dog followed the black car into the black
night. |
Finds the last occurrence of the word black . (The
data string is all on one line.) |
(\bblack\b)(?!.*\1) |
The black dog followed the black car into the black
night. |
Also finds the last occurrence of the word black .
(The data string is all on one line.) |
(?<=baseball)ball |
baseball football |
A positive lookbehind assertion. (Not supported in JavaScript.) |
ball(?<=baseball) |
baseball football |
An equivalent positive lookbehind assertion. (Not supported in JavaScript.) |
As of C++11, regular expressions are a standard part of the language.
The default regular expression grammar is that of
ECMAScript
, which is the most powerful, but there are five
other alternative grammars, should the need arise to use one of them:
basic
, extended
, awk
,
grep
and egrep
. For the most part, you can
simply use the default, which requires no extra effort on your part.
As with regular expressions in any programming language, if you want to use them you are probably wanting to find strings that "match" a regular expression in their entirety, to search for substrings that match a regular expression, and/or to replace all or part of a string with some other string. C++ 11 provides functions that perform these actions:
regex_match()
tries to match the entire string and
returns true or falseregex_search()
tries to find a matching substring and
returns true or falseregex_replace()
tries to replace a matching substring
and returns the revised stringSome sample programs that you can find here illustrate some of the ways you can now work with regular expresssions in C++11:
learn_regex1.cpp
,
illustrates matching, searching and replacing in the simplest possible
form of each, when both the regex
object and the
string
object on which it acts consist of just simple
"ordinary" characters.learn_regex2.cpp
, shows how
we can use an smatch
object to get access to more details
about whatever match or matches occur during a call to
regex_match()
or regex_search()
.learn_regex3.cpp
, uses a
regex
object that contains something more that just
ordinary characters, as well as (again) an smatch
object
to get access to more details about whatever match or matches occur
during a call to regex_match()
or
regex_search()
.testers
subdirectory under the link mentioned
above you will find a number of other sample programs that you should
study and use for experimentation.