Top

23.7. Basic Syntax of Regular Expressions

The two special symbols: '^' and '$' indicate the start and the end of a string respectively, like so:

^The : matches any string that starts with The;
of despair$ : matches a string that ends in the substring of despair;
^abc$ : a string that starts and ends with abc -- that could only be abc itself!
notice : a string that has the text notice in it.

Without either of the above special character you are allowing the pattern to occur anywhere inside the string.

The symbols '*' , '+' , and '?' denote the number of times a character or a sequence of characters may occur. What they mean is: zero or more, one or more, and zero or one. Here are some examples:

ab* : matches a string that has an a followed by zero or more b 's (a, ab, abbb, etc.);
ab+ : same, but there is at least one b (ab, abbb, etc.);
ab? : there might be a b or not;
a?b+$ : a possible a followed by one or more b 's ending a string.

You can also use bounds , which come inside braces and indicate ranges in the number of occurrences:

ab{2} : matches a string that has an a followed by exactly two b 's (abb);
ab{2,} : there are at least two b 's (abb, abbbb, etc.);
ab{3,5} : from three to five b 's (abbb, abbbb, or abbbbb).

Note, that you must always specify the first number of a range (i.e, {0,2} , not {,2} ). Also, as you may have noticed, the symbols '*', '+', and '?' have the same effect as using the bounds {0,} , {1,} , and {0,1} , respectively.

Now, to quantify a sequence of characters, put them inside parentheses:

a(bc)* : matches a string that has an a followed by zero or more copies of the sequence bc;
a(bc){1,5} : one through five copies of bc.

There's also the '|' symbol, which works as an OR operator:

hi|hello : matches a string that has either hi or hello in it;
(b|cd)ef : a string that has either bef or cdef;
(a|b)*c : a string that has a sequence of alternating a 's and b 's ending in a c ;

A period ('.') stands for any single character:

a.[0-9] : matches a string that has an a followed by one character and a digit;
^.{3}$ : a string with exactly 3 characters.

Bracket expressions specify which characters are allowed in a single position of a string:

[ab] : matches a string that has either an a or a b (that's the same as a|b );
[a-d] : a string that has lowercase letters 'a' through 'd' (that's equal to a|b|c|d and even [abcd] );
^[a-zA-Z] : a string that starts with a letter;
[0-9]% : a string that has a single digit before a percent sign;
,[a-zA-Z0-9]$ : a string that ends in a comma followed by an alphanumeric character.

You can also list the characters that do NOT want -- just use a '^' as the first symbol in a bracketed expression (i.e., %[^a-zA-Z]% matches a string with a character that is not a letter between two percent signs).

Do not forget that bracket expressions are an exception to that rule--inside them, all special characters, including the backslash ('\'), lose their special powers (i.e., [*\+?{}.] matches exactly any of the characters inside the brackets). To include a literal ']' in the list, make it the first character (following a possible '^'). To include a literal '-', make it the first or last character, or the second endpoint of a range.