Lingua::Stem::EnBroken.3pm

Langue: en

Autres versions - même langue

Version: 2007-10-23 (debian - 07/07/09)

Section: 3 (Bibliothèques de fonctions)

NAME

Lingua::Stem::EnBroken - Porter's stemming algorithm for 'generic' English

SYNOPSIS

     use Lingua::Stem::EnBroken;
     my $stems   = Lingua::Stem::EnBroken::stem({ -words => $word_list_reference,
                                         -locale => 'en',
                                     -exceptions => $exceptions_hash,
                                      });
 
 

DESCRIPTION

This routine MIS-applies the Porter Stemming Algorithm to its parameters, returning the stemmed words. It is an intentionally broken version of Lingua::Stem::En for people needing backwards compatibility with Lingua::Stem 0.30 and Lingua::Stem 0.40. Do not use it if you aren't one of those people.

It is derived from the C program ``stemmer.c'' as found in freewais and elsewhere, which contains these notes:

    Purpose:    Implementation of the Porter stemming algorithm documented
                in: Porter, M.F., "An Algorithm For Suffix Stripping,"
                Program 14 (3), July 1980, pp. 130-137.
    Provenance: Written by B. Frakes and C. Cox, 1986.
 
 

I have re-interpreted areas that use Frakes and Cox's ``WordSize'' function. My version may misbehave on short words starting with ``y'', but I can't think of any examples.

The step numbers correspond to Frakes and Cox, and are probably in Porter's article (which I've not seen). Porter's algorithm still has rough spots (e.g current/currency, -ings words), which I've not attempted to cure, although I have added support for the British -ise suffix.

CHANGES

  2003.09.28 -  Documentation fix
 
 
  2000.09.14 -  Forked from the Lingua::Stem::En.pm module to provide
                a backward compatibly broken version for people needing
                consistent behavior with 0.30 and 0.40 more than accurate
                stemming.
 
 

METHODS

stem({ -words => \@words, -locale => 'en', -exceptions => \%exceptions });
Stems a list of passed words using the rules of US English. Returns an anonymous array reference to the stemmed words.

Example:

   my $stemmed_words = Lingua::Stem::EnBroken::stem({ -words => \@words,
                                               -locale => 'en',
                                           -exceptions => \%exceptions,
                           });
 
 
stem_caching({ -level => 0|1|2 });
Sets the level of stem caching.

'0' means 'no caching'. This is the default level.

'1' means 'cache per run'. This caches stemming results during a single
    call to 'stem'.

'2' means 'cache indefinitely'. This caches stemming results until
    either the process exits or the 'clear_stem_cache' method is called.

clear_stem_cache;
Clears the cache of stemmed words

NOTES

This code is almost entirely derived from the Porter 2.1 module written by Jim Richardson.

SEE ALSO

  Lingua::Stem
 
 

AUTHOR

   Jim Richardson, University of Sydney
   jimr@maths.usyd.edu.au or http://www.maths.usyd.edu.au:8000/jimr.html
 
 
   Integration in Lingua::Stem by
   Benjamin Franz, FreeRun Technologies,
   snowhare@nihongo.org or http://www.nihongo.org/snowhare/
 
 
Jim Richardson, University of Sydney Benjamin Franz, FreeRun Technologies

This code is freely available under the same terms as Perl.

BUGS

TODO