Lingua::Ident.3pm

Langue: en

Version: 2006-11-11 (mandriva - 01/05/08)

Section: 3 (Bibliothèques de fonctions)

NAME

Lingua::Ident -- Statistical language identification

SYNOPSIS

  use Lingua::Ident;
  $i    = new Lingua::Ident("filename 1" ... "filename n");
  $lang = $i->identify("text to classify"), "\n";
 
 

DESCRIPTION

This module implements a statistical language identifier.

The filename attributes to the constructor must refer to files containing tables of n-gram probabilites for languages. These tables can be generated using the trainlid(1) utility program.

RETURN VALUE

The identify() method returns the value specified in the _LANG field of the probabilities table of the language to which the text most likely belongs (see ``WARNINGS'').

It is recommended to be a POSIX locale name constructed from an ISO 639 2-letter language code, possibly extended by an ISO 3166 2-letter country code and a character set identifier. Example: de_DE.iso88591.

WARNINGS

Since Lingua::Ident is based on statistics it cannot be 100% accurate. More precisely, Dunning (see below) reports his implementation to achieve 92% accuracy with 50K of training text for 20 character strings discriminating bewteen English and Spanish. This implementation should be as accurate as Dunning's. However, not only the size but also the quality of the training text play a role.

The current implementation doesn't use a threshold to determine if the most probable language has a high enough probability; if you're trying to classify a text in a language for which there is no probability table, this results in getting an incorrect language.

AUTHOR

Lingua::Ident was developed by Michael Piotrowski <mxp@dynalabs.de>.

LICENSE

This program is free software; you may redistribute it and/or modify it under the same terms as Perl itself.

SEE ALSO

Dunning, Ted (1994). Statistical Identification of Language. Technical report CRL MCCS-94-273. Computing Research Lab, New Mexico State University.