Regexp::List.3pm

Langue: en

Version: 2004-12-05 (ubuntu - 07/07/09)

Section: 3 (Bibliothèques de fonctions)

NAME

Regexp::List - builds regular expressions out of a list of words

SYNOPSIS

   use Regexp::List;
   my $l  = Regexp::List->new;
   my $re = $l->list2re(qw/foobar fooxar foozap fooza/);
   # $re is now qr/foo(?:[bx]ar|zap?)/
 
 

ABSTRACT

This module offers "list2re" method that turns a list of words into an optimized regular expression which matches all words therein.

The optimized regular expression is much more efficient than a simple-minded '|'-concatenation thereof.

DESCRIPTION

This module use Object-Oriented approach so you can use this module as a base and tweak its features. This module is a base class of Regexp::Optimizer.

EXPORT

Since this is an OO module there is no symbol exported.

METHODS

This module offers methods below;
$l = Regexp::List->new(key=>value, ...)
Constructor. When arguments are fed in key => value, manner, it sets attributes. See "$l->set" for details
$re = $l->list2re(list of words ...)
Does the job. Takes a list of words and turn it into an optimal regular expresson. See ``IMPLEMENTATION'' to find out how it is achieved. If you want to know the underlying black magic even further, see the source.
$l->set(key => value, ...)
Sets attributes. There are many attributes supported but let me mention just a few that you may be interested.
lookahead
Whether you prepend a lookahead assertion or not. Default value is 1. This module is smart enough to omit the assertion when you don't need one.
   $re = $l->list2re(qw/1 2 3 infinity/);
   # qr/(?=[123i])(?:[123]|infinity)/
   $re = $l->set(lookahead=>0)->list2re(qw/1 2 3 infinity/);
   # qr/(?:[123]|infinity)/
 
 
quotemeta
Whether you quote metacharacters or not. Default is 1. If you really need this feature try Regexp::Optimizer instead.
   @list = qw/3 3.14 3.14159265358979/;
   $re = $l->list2re(@list);
   # qr/3(?:\.14(?:159265358979)?)?)/
   $re = $l->set(lookahead=>0)->list2re(@list);
   # qr/3(?:.14(?:159265358979)?)?)/
   # which does match 3.14 but also "11+3=14"
 
 
modifies
Currently it accepts 'i', 'm', 's', and 'x', the same as regular expression modifiers.
   @list = qw/Perl perl BASIC basic/;
   $re = $l->list2re(@list);
   # qr/(?=[BPbp])(?:[Pp]erl|BASIC|basic)/
   $re = $l->set(modifiers => 'i')->list2re(@list);
   # qr/(?=[bp])(?:perl|basic)/i
   print $l->set(modifiers => 'x')->list2re(@list);
   # Try for yourself;  Isn't itcute ?
 
 
$l->expand($re);
Utility methods to expand a regular expression. Handy when you want to check the complex regexes.
$l->unexpand($re);
Utility methods to unexpand a regular expression.

IMPLEMENTATION

This module optimizes the regular expression as follows. Let's see what happens when qw/foobar fooxar foozap fooza/ is fed
trie building (common prefix aggregation)
first the corresponding trie structure is built
        +- bar
   foo -+- xar
        +- za -+- p
               +- ''
 
 
common suffix aggregation
it checks if any leaf node can be optimized for common suffix. In this case we can do so to ``bar'' and ``xar''.
        +- b -+-ar
   foo -+- x -+
        +- za -+- p
               +- ''
 
 
character class conversion
If a branch contains more than two single characters, it turns it into a character class.
   foo -+- [bx] --- ar
        +- za -+-p
               +- ''
 
 
empty leaf to "?"
Empty leaf is converted to a '?' quantifier
   foo -+- [bx] --- ar
        +- za -+-p?
 
 
join all leafs into a group
And the final result is reached.
   foo(?:[bx]ar|zap?)
 
 

BENCHMARKS

This module is faily robust. You can practically use this module to find a regular expression that matches all words in a dictionary. Here is a result by on perl 5.8.0, FreeBSD 4-Stable, Pentium III 800 Mhz with 512 MB RAM.
  # Sat May 31 09:11:06 2003 (     0.000000 s) Reading /usr/share/dict/words
  # Sat May 31 09:11:07 2003 (     0.847797 s) 235881 lines read.
  # Sat May 31 09:11:07 2003 (     0.000000 s) Making regexp.
  # Sat May 31 09:13:09 2003 (   121.596928 s) Done.
  # Sat May 31 09:13:09 2003 (     0.000000 s) Saving to t/words.rx
  # Sat May 31 09:13:09 2003 (     0.000000 s) Reading t/words.rx
  # Sat May 31 09:13:13 2003 (     3.679176 s) Done.
  # Sat May 31 09:13:13 2003 (     0.000000 s) Opening /usr/share/dict/words for comparison.
  # Sat May 31 09:13:13 2003 (     0.255222 s) /usr/share/dict/words:235881 lines found.
  # Sat May 31 09:13:13 2003 (     0.000000 s) Showtime!
  # 235881/235881
  # Sat May 31 10:44:17 2003 (  5464.370409 s) Done.
  # Sat May 31 10:44:17 2003 (  5464.370624 s) 43.167 matches/s
 
 

The result of optimization is obvious as the number of alteration increases. Here is a result of a benchmark which matches randomly picked words against "/usr/share/dict/words".

   ====        2 words
           Rate naive optim
   naive 1.79/s    --  -28%
   optim 2.49/s   39%    --
 
   ====      256 words
         s/iter naive optim
   naive   31.7    --  -81%
   optim   5.95  433%    --
 
 

SEE ALSO

Regexp::Optimizer --- uses this module as its base

"eg/" directory in this package contains example scripts.

Perl standard documents
perltodo, perlre
CPAN Modules
Regexp::Presuf, Text::Trie
Books
Mastering Regular Expressions <http://www.oreilly.com/catalog/regex2/>

AUTHOR

Dan Kogai <dankogai@dan.co.jp> Copyright 2003 by Dan Kogai, All Rights Reserved.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.