Rechercher une page de manuel

slmbuild

Langue: en

Version: 2010-08-16 (ubuntu - 24/10/10)

Section: 1 (Commandes utilisateur)

Sommaire

NAME
SYNOPSIS
DESCRIPTION
OPTIONS All the following options are mandatory.
NOTE
EXAMPLE
AUTHOR
SEE ALSO

NAME

slmbuild - generate language model from idngram file

SYNOPSIS

slmbuild [option]... idngram_file...

DESCRIPTION

slmbuild generates a back-off smoothing language model from a given idngram file. Generally, the idngram_file is created by ids2ngram.

OPTIONS All the following options are mandatory.

-n,--NMax N

1 for unigram, 2 for bigram, 3 for trigram. Any number not in the range of 1..3 is not valid.

-o, --out output-file

Specify the output xfilei name.

-l, --log

using -log(pr), use pr directly by default.

-w, --wordcount N

Lexican size, number of different words.

-b, --brk id...

Set the ids which should be treated as breaker.

-e, --e id...

Set the ids which should not be put into LM.

-c, --cut c...

k-grams whose freq <= c[k] are dropped.

-d, --discount method, param...

The k-th -d parm specifies the discount method

For k-gram, possibble values for method/param are:

       B<GT>,I<R>,I<dis>  : B<GT> discount for r E<lt>= I<R>, r is the freq of a ngram.
                   Linear discount for those r E<gt> I<R>, i.e. r'=r*dis
                   0 E<lt>E<lt> dis E<lt> 1.0, for example 0.999 
       B<ABS>,[I<dis>] : Absolute discount r'=r-I<dis>. And I<dis> is optional
                   0 E<lt>E<lt> I<dis> E<lt> cut[k]+1.0, normally I<dis> E<lt> 1.0.
       LIN,[I<dis>] : Linear discount r'=r*dis. And dis is optional
                   0 E<lt> dis E<lt> 1.0

NOTE

-n must be given before -c -b. And -c must give right number of cut-off, also -ds must appear exactly N times specifying the discounts for 1-gram, 2-gram..., respectively.

BREAKER-IDs could be SentenceTokens or ParagraphTokens. Conceptually, these ids have no meaning when they appeared in the middle of n-gram.

EXCLUDE-IDs could be ambiguious-ids. Conceptually, n-grams which contain those ids are meaningless.

We can not erase ngrams according to BREAKER-IDS and EXCLUDE-IDs directly from IDNGRAM file, because some low-level information is still useful in it.

EXAMPLE

Following example read 'all.id3gram' and write trigram model 'all.slm'.

At 1-gram level, use Good-Turing discount with cut-off 0, i<R>=8, dis=0.9995. At 2-gram level, use Absolute discount with cut-off 3, dis auto-calc. At 3-gram level, use Absolute discount with cut-off 2, dis auto-calc. Word id 10,11,12 are breakers (sentence/para/paper breaker, etc). Exclude-ID is 9. Lexicon contains 200000 words. The result languagme model uses -log(pr).

slmbuild -l -n 3 -o all.slm -w 200000 -c 0,3,2 -d GT,8,0.9995 -d ABS -d ABS -b 10,11,12 -e 9 all.id3gram

AUTHOR

Originally written by Phill.Zhang <phill.zhang@sun.com>. Currently maintained by Kov.Chai <tchaikov@gmail.com>.

Linux Certif

Toute la documentation sur la certification Linux LPI

Rechercher une page de manuel

slmbuild

Sommaire

NAME

SYNOPSIS

DESCRIPTION

OPTIONS All the following options are mandatory.

NOTE

EXAMPLE

AUTHOR

SEE ALSO

Découvrir

Apprendre

Linux Certif

Toute la documentation sur la certification Linux LPI

Rechercher une page de manuel

slmbuild

Sommaire

NAME

SYNOPSIS

DESCRIPTION

OPTIONS All the following options are mandatory.

NOTE

EXAMPLE

AUTHOR

SEE ALSO

Découvrir

Apprendre

Partager