Sprog::help::regex_intro.3pm

Langue: en

Version: 2005-05-30 (fedora - 01/12/10)

Section: 3 (Bibliothèques de fonctions)

An Introduction to Perl Regular Expressions

A 'regular expression' is just a fancy name for a pattern.

You're probably already familiar with filename patterns like '"*.doc"'. Perl regular expressions are similar but much more powerful. Here's a brief introduction and some examples to get you started.

Note: 'regular expression' is often abbreviated to 'regex'.

Plain Text

The simplest pattern is just plain text. For example, this pattern ...
   cat
 
 

... will match any line containing a 'c' followed by an 'a' followed by a 't'. So all these lines would match:

   The cat sat on the mat
   He picked up the catcher's mitt
   She scattered rose petals in the wind
 
 

But this line ...

   Cat Doors - $35 each
 
 

... would not match unless you checked the 'ignore case' box.

Note: you don't need to say '"*cat*"' and in fact it would be a mistake to do so. We'll get to stars shortly.

Character Classes

You can use square brackets to specify a list of characters that should match. For example, this pattern ...
   [bcr]at
 
 

... will match any line containing either a 'b', a 'c' or an 'r' followed by an 'a' followed by a 't'. So all these lines would match:

   The cat sat on the mat
   He picked up the bat
   The gazed into the crater
 
 

But these would not:

   The hat sat on the mat
   canned flat bacon rashers
 
 

You can use '-' to specify a range of characters. For example, this pattern will match any of the first ten letters of the (english) alphabet:

   [a-j]
 
 

If the very first character in the square brackets is a '^' then it will match any character this is not in the list. For example:

   [^bc]at
 
 

Will match 'hat' and 'mat' but not 'bat' or 'cat'.

Predefined Character Classes

There are a number of predefined character classes:
   \d   [0-9]         any digit
   \s   [ \t\r\n\f]   any 'whitespace' character
   \w   [a-zA-Z0-9_]  any letter, digit or underscore
 
   \D   [^0-9]        anything except a digit
   \S   [^\s]         anything except whitespace
   \W   [^\w]         anything except a letter, digit or underscore
 
 

So this pattern ...

   \d\dth
 
 

... would match any line containing two digits followed by 'th', such as:

   the 19th hole
 
 

The '.' can be thought of as a class which matches any character. So for example '"p.t"' would match 'pet', 'pot', 'p8t' and 'p@t'. It would not match 'pt'.

Matching Multiple Times

If you wanted to match any line containing 5 consecutive 'W' characters, you could use this:
   WWWWW
 
 

or this:

   W{5}
 
 

If you wanted to match 'bet' and 'beet' but not 'beeet', you could use this:

   be{1,2}t
 
 

Note: the number or numbers in the curly brackets only apply to the character or character class immediately before the '{'. In the above example it will match at least 1 but no more than 2 consecutive 'e's.

Often you want to include an optional character in a match, you could match on zero or one occurences:

   pots{0,1}
 
 

which will match 'pot' or 'pots' (and also 'potscrub'). This is so common it has a shorthand form - just replace '"{0,1}"' with '?' which means exactly the same thing:

   pots?
 
 

There are two other important shorthand forms. '*' mean 0 or more so this:

   po*t
 
 

is the same as:

   po{0,}t
 
 

which will match 'pt', 'pot', 'poot', etc.

The other import form is '+' meaning one or more matches:

   po+t
 
 

will match 'pot', 'poot', etc.

Anchoring a Match

If the very first character in your pattern is '^' then what follows is 'anchored' to the start of the line. For example ...
   ^pot
 
 

... will match these lines:

   potato
   potentially
 
 

but not these:

   a pot-plant
   spotted
 
 

Similarly, '$' can be used to anchor to the end of the line, so ...

   end$
 
 

... will match:

   Max is my friend
 
 

but not:

   My friend is Max
 
 

You can use both '^' and '$' in the same pattern. Here for example is a pattern that matches lines which start with a capital letter and end with an exclamation point:

   ^[A-Z].*!$
 
 

Another useful technique is to use '"\b"' to anchor a match to the beginning or end of a word (think of '"\b"' as matching the boundary between a non-word character such as a space and a word character such as a letter). For example ...

   \bcat
 
 

... will match words that start with 'cat' such as 'cat' or 'catch' but not 'scatter'. And this ...

   \bcat\b
 
 

... will only match 'cat' not 'catch' or 'scat'.

More Information

That's far more information than you can be expected to absorb in one sitting, so we'll stop there. The best way to become proficient with regular expressions is to use them and Sprog is an ideal tool for trying out different patterns and different data.

This quick introduction has only scratched the surface of what's possible in a Perl regular expression. The official Perl regular expression tutorial is at: <http://perldoc.perl.org/perlretut.html>.

A web search will locate many helpful pages: <http://www.google.com/search?q=perl+regular+expression+tutorial>.

There are also a number of books on the subject.