KinoSearch1::Analysis::Tokenizer.3pm

Langue: en

Autres versions - même langue

Version: 2010-10-05 (fedora - 01/12/10)

Section: 3 (Bibliothèques de fonctions)

NAME

KinoSearch1::Analysis::Tokenizer - customizable tokenizing

SYNOPSIS

     my $whitespace_tokenizer
         = KinoSearch1::Analysis::Tokenizer->new( token_re => qr/\S+/, );
 
     # or...
     my $word_char_tokenizer
         = KinoSearch1::Analysis::Tokenizer->new( token_re => qr/\w+/, );
 
     # or...
     my $apostrophising_tokenizer = KinoSearch1::Analysis::Tokenizer->new;
 
     # then... once you have a tokenizer, put it into a PolyAnalyzer
     my $polyanalyzer = KinoSearch1::Analysis::PolyAnalyzer->new(
         analyzers => [ $lc_normalizer, $word_char_tokenizer, $stemmer ], );
 
 

DESCRIPTION

Generically, ``tokenizing'' is a process of breaking up a string into an array of ``tokens''.
     # before:
     my $string = "three blind mice";
 
     # after:
     @tokens = qw( three blind mice );
 
 

KinoSearch1::Analysis::Tokenizer decides where it should break up the text based on the value of "token_re".

     # before:
     my $string = "Eats, Shoots and Leaves.";
 
     # tokenized by $whitespace_tokenizer
     @tokens = qw( Eats, Shoots and Leaves. );
 
     # tokenized by $word_char_tokenizer
     @tokens = qw( Eats Shoots and Leaves   );
 
 

METHODS

new

     # match "O'Henry" as well as "Henry" and "it's" as well as "it"
     my $token_re = qr/
             \b        # start with a word boundary
             \w+       # Match word chars.
             (?:       # Group, but don't capture...
                '\w+   # ... an apostrophe plus word chars.
             )?        # Matching the apostrophe group is optional.
             \b        # end with a word boundary
         /xsm;
     my $tokenizer = KinoSearch1::Analysis::Tokenizer->new(
         token_re => $token_re, # default: what you see above
     );
 
 

Constructor. Takes one hash style parameter.

*
token_re - must be a pre-compiled regular expression matching one token.
Copyright 2005-2010 Marvin Humphrey

LICENSE, DISCLAIMER, BUGS, etc.

See KinoSearch1 version 1.00.