23.7. Basic Syntax of Regular Expressions
The two special symbols: '^'
and
'$'
indicate the start
and the end
of a string respectively, like so:
^The
: matches any string
that starts with The; |
of despair$
: matches a
string that ends in the substring of despair; |
^abc$
: a string that
starts and ends with abc -- that could only be abc itself! |
notice
: a string that
has the text notice in it. |
Without either of the above special character you are allowing the pattern to occur anywhere inside the string.
The symbols '*'
, '+'
,
and '?'
denote the number of times a character
or a sequence of characters may occur. What they mean is: zero or more, one or
more, and zero or one. Here are some examples:
ab*
: matches a string
that has an a
followed by zero or more
b
's (a, ab, abbb, etc.); |
ab+
: same, but there is
at least one b
(ab, abbb, etc.); |
ab?
: there might be a
b
or not; |
a?b+$
: a possible a
followed by one or more b
's ending a string. |
You can also use bounds , which come inside braces and indicate ranges in the number of occurrences:
ab{2}
: matches a string that has an
a
followed by exactly two b
's
(abb); |
ab{2,}
: there are at least two
b
's (abb, abbbb, etc.); |
ab{3,5}
: from three to five
b
's (abbb, abbbb, or abbbbb). |
Note, that you must always specify the first number of a range
(i.e, {0,2}
, not {,2}
).
Also, as you may have noticed, the symbols '*', '+', and '?' have the same effect
as using the bounds {0,}
,
{1,}
, and {0,1}
,
respectively.
Now, to quantify a sequence of characters, put them inside parentheses:
a(bc)*
: matches a string that has
an a
followed by zero or more copies of the sequence bc; |
a(bc){1,5}
: one through five copies of bc. |
There's also the '|' symbol, which works as an OR operator:
hi|hello
: matches a string that has
either hi or hello in it; |
(b|cd)ef
: a string that has either
bef or cdef; |
(a|b)*c
: a string that has a
sequence of alternating a
's and b
's
ending in a c
; |
A period ('.') stands for any single character:
a.[0-9]
: matches a string that has
an a
followed by one character and a digit; |
^.{3}$
: a string with exactly 3 characters. |
Bracket expressions specify which characters are allowed in a single position of a string:
[ab]
: matches a string that has
either an a
or a b
(that's the same as a|b
); |
[a-d]
: a string that has lowercase
letters 'a' through 'd' (that's equal to a|b|c|d
and even [abcd]
); |
^[a-zA-Z]
: a string that starts with a letter; |
[0-9]%
: a string that has a single
digit before a percent sign; |
,[a-zA-Z0-9]$
: a string that ends in
a comma followed by an alphanumeric character. |
You can also list the characters that do NOT want -- just use a '^' as
the first symbol in a bracketed expression (i.e.,
%[^a-zA-Z]%
matches a string with a character
that is not a letter between two percent signs).
Do not forget that bracket expressions are an exception to that
rule--inside them, all special characters, including the backslash ('\'),
lose their special powers (i.e., [*\+?{}.]
matches exactly any of the characters inside the brackets).
To include a literal ']' in the list, make it the first character
(following a possible '^'). To include a literal '-', make it the first or last
character, or the second endpoint of a range.