23.7.Basic Syntax of Regular Expressions
The two special symbols: '^'
and '$'
indicate the
start and the end of a string respectively, like
so:
^The : matches any string
that starts with The; |
of despair$ : matches a
string that ends in the substring of despair; |
^abc$ : a string that
starts and ends with abc -- that could only be abc itself! |
notice : a string that has
the text notice in it. |
Without either of the above special character you are allowing the pattern to occur anywhere inside the string.
The symbols '*'
,
'+'
, and '?'
denote the number of times a character
or a sequence of characters may occur. What they mean is: zero or
more, one or more, and zero or one. Here are some examples:
ab* : matches a string that
has an a followed by zero or
more b 's (a, ab, abbb,
etc.); |
ab+ : same, but there is at
least one b (ab, abbb,
etc.); |
ab? : there might be a
b or not; |
a?b+$ : a possible
a followed by one or more
b 's ending a string. |
You can also use bounds , which come inside braces and indicate ranges in the number of occurrences:
ab{2} : matches a string
that has an a followed by
exactly two b 's (abb); |
ab{2,} : there are at least
two b 's (abb, abbbb,
etc.); |
ab{3,5} : from three to
five b 's (abbb, abbbb, or
abbbbb). |
Note, that you must always specify the first number of a range
(i.e, {0,2}
, not {,2}
). Also, as you may have noticed, the
symbols '*', '+', and '?' have the same effect as using the bounds
{0,}
, {1,}
, and {0,1}
, respectively.
Now, to quantify a sequence of characters, put them inside parentheses:
a(bc)* : matches a string
that has an a followed by
zero or more copies of the sequence bc; |
a(bc){1,5} : one through
five copies of bc. |
There's also the '|' symbol, which works as an OR operator:
hi|hello : matches a string
that has either hi or hello in it; |
(b|cd)ef : a string that
has either bef or cdef; |
(a|b)*c : a string that has
a sequence of alternating a
's and b 's ending in a
c ; |
A period ('.') stands for any single character:
a.[0-9] : matches a string
that has an a followed by
one character and a digit; |
^.{3}$ : a string with
exactly 3 characters. |
Bracket expressions specify which characters are allowed in a single position of a string:
[ab] : matches a string
that has either an a or a
b (that's the same as
a|b ); |
[a-d] : a string that has
lowercase letters 'a' through 'd' (that's equal to a|b|c|d and even [abcd] ); |
^[a-zA-Z] : a string that
starts with a letter; |
[0-9]% : a string that has
a single digit before a percent sign; |
,[a-zA-Z0-9]$ : a string
that ends in a comma followed by an alphanumeric character. |
You can also list the characters that do NOT want -- just use a
'^' as the first symbol in a bracketed expression (i.e.,
%[^a-zA-Z]%
matches a string
with a character that is not a letter between two percent
signs).
Do not forget that bracket expressions are an exception to that
rule--inside them, all special characters, including the backslash
('\'), lose their special powers (i.e., [*\+?{}.]
matches exactly any of the
characters inside the brackets). To include a literal ']' in the
list, make it the first character (following a possible '^'). To
include a literal '-', make it the first or last character, or the
second endpoint of a range.