estim_pm

Langue: en

Version: 111136 (mandriva - 01/05/08)

Section: 1 (Commandes utilisateur)

NAME

estim_pm - Parsimonious Markov model estimation tool.

SYNOPSIS

estim_pm arguments [options]

DESCRIPTION

estim_m performs Parcimonious Markov model estimation and statistics calculus. The model is estimated on input sequence(s). The stationary law is also computed. The resulting model can then be used to simulate sequences with the simul_m program.

ARGUMENTS

sequence_file
Either the name of a file containing a set of sequences in FASTA format, or the name of a file containing a list of filenames, each of which containing a set of sequences in FASTA format.
-d --order=INTEGER
Order of the Markov model.

OPTIONS

-p --phase=INTEGER
Number of phases (default = 1).
-a --alphabet=FILENAME
A file describing the alphabet to use (DNA alphabet, default setting).
-A --Alphabet=EXPRESSION
An expression describing the alphabet to use: [number<10 of characters for each pattern]+[:]+[alphabet patterns list] (DNA alphabet, default setting).
--dna
Use DNA alphabet (1:AGCT, default setting).
--protein
Use amino acid alphabet (1:IVLFCMAGTWSYPHEQDNKR).
-o --output=FILENAME
Result file containing the parameters of the estimated Parcimonious Markov model.
--partition=FILENAME
A file describing the partitions of the alphabet to use (all partitions, default setting)
-b FLOAT
Bayesian prior hyperparameter (1./alphabet_size, default setting).
--penality FLOAT
Penality on the leaves number (0, default setting).
--oxml FILENAME
Tree-shape results file, in xml format, only if ./configure --enable-xml done.
-l --likelihood=FILENAME
Compute the likelihood under selected model on the sequences contained in FILENAME or on the sequences whose filenames are listed in FILENAME.
-L --Likelihood
Compute the likelihood under selected model on the sequences specified by the sequence_file argument.
-b --bic=FILENAME
Compute the BIC under selected model on the sequences contained in FILENAME or on the sequences whose filenames are listed in FILENAME.
-B --Bic=FILENAME
Compute the BIC under selected model on the sequences specified by the sequence_file argument.
--all
Compute the total BIC/likelihood for all the given sequences.
-v --version
Display the version number and exit.
-h --help
Print this help and exit.

Examples

Estimate a parsimonious Markov model of order 5 on the list of sequence files contained in file seq.list. The sequences contain tokens of an alphabet described in file sample.alpha. Generate the estimated model in file model.desc.

estim_pm seql.list -d 5 -a sample.alpha -o model.desc

Estimate a parsimonious Markov model of order 3 on the list of sequences contained in seq.faa. The sequences contain tokens of the amino-acids alphabet. rot.part is the partition file (see next section). Generate the estimated model xml description in file model.xml.

estim_pm seq.faa -d 3 --partition prot.part --protein --oxml model.xml

Partition

Let a partition of an alphabet be a set of tokens'subset, i.e. a division of the alphabet into subset. The -partition option gives 2 possibilities:


    * to compute the overall set of possible partitions (automatically generated) given the alphabet (default setting).


    * to compute the overall set of possible partitions (automatically generated) given a synonymous pseud-alphabet: by declaring synonymous tokens, it is possible to group tokens as a single predictor so that the number of partitions is lower. In this case, a configuration file with the top key word "#Synonymous", containing the lists of synonymous tokens, is required.

Exple:

#Synonymous

a t

g c


    * to input a selected set of partitions. In this case, in a configuration file after a "#Partition" on the first line, each partition is represented as a list of tokens'subset delimited by a "|", each subset being composed with tokens of the alphabet separed by space.

Exple(dna alphabet):

#Partition

a | g | c | t

a g | c | t

a c t | g

Exple2 (protein alphabet):      
#Synonymous

A G

V L I

M

P F

W Y

D E

K R H

N Q C

S T

On large alphabets or orders, the set of possible partitions should be restricted to limit computation time.

AUTHORS

estim_pm is part of the seq++ package, developed by Vincent Miele <miele@genopole.cnrs.fr>, David Robelin <robelin@genopole.cnrs.fr>, Pierre-Yves Bourguignon <bourguignon@genopole.cnrs.fr>, Gregory Nuel <nuel@genopole.cnrs.fr> and Hugues Richard <richard@genopole.cnrs.fr>.

SEE ALSO

estim_m(1), estim_mtd(1), estim_vlm(1), simul_m(1), dist_m(1)

More information on seq++ is available at <http://stat.genopole.cnrs.fr/seqpp>.