Bio::Tools::Run::Alignment::TCoffee.3pm

Langue: en

Version: 2009-11-04 (ubuntu - 24/10/10)

Section: 3 (Bibliothèques de fonctions)

NAME

Bio::Tools::Run::Alignment::TCoffee - Object for the calculation of a multiple sequence alignment from a set of unaligned sequences or alignments using the TCoffee program

SYNOPSIS

   # Build a tcoffee alignment factory
   @params = ('ktuple' => 2, 'matrix' => 'BLOSUM');
   $factory = Bio::Tools::Run::Alignment::TCoffee->new(@params);
 
   # Pass the factory a list of sequences to be aligned.
   $inputfilename = 't/cysprot.fa';
   # $aln is a SimpleAlign object.
   $aln = $factory->align($inputfilename);
 
   # or where @seq_array is an array of Bio::Seq objects
   $seq_array_ref = \@seq_array;
   $aln = $factory->align($seq_array_ref);
 
   # Or one can pass the factory a pair of (sub)alignments
   #to be aligned against each other, e.g.:
 
   # where $aln1 and $aln2 are Bio::SimpleAlign objects.
   $aln = $factory->profile_align($aln1,$aln2);
 
   # Or one can pass the factory an alignment and one or more
   # unaligned sequences to be added to the alignment. For example:
 
   # $seq is a Bio::Seq object.
   $aln = $factory->profile_align($aln1,$seq);
 
   #There are various additional options and input formats available.
   #See the DESCRIPTION section that follows for additional details.
 
 

DESCRIPTION

Note: this DESCRIPTION only documents the (Bio)perl interface to TCoffee.

Helping the module find your executable

You will need to enable TCoffee to find the t_coffee program. This can be done in (at least) three ways:
  1. Make sure the t_coffee executable is in your path so that
     which t_coffee returns a t_coffee executable on your system.
 
  2. Define an environmental variable TCOFFEEDIR which is a dir 
     which contains the 't_coffee' app:
     In bash 
     export TCOFFEEDIR=/home/username/progs/T-COFFEE_distribution_Version_1.37/bin
     In csh/tcsh
     setenv TCOFFEEDIR /home/username/progs/T-COFFEE_distribution_Version_1.37/bin
 
  3. Include a definition of an environmental variable TCOFFEEDIR in
     every script that will use this TCoffee wrapper module.
     BEGIN { $ENV{TCOFFEDIR} = '/home/username/progs/T-COFFEE_distribution_Version_1.37/bin' }
     use Bio::Tools::Run::Alignment::TCoffee;
 
 

If you are running an application on a webserver make sure the webserver environment has the proper PATH set or use the options 2 or 3 to set the variables.

PARAMETERS FOR ALIGNMENT COMPUTATION

There are a number of possible parameters one can pass in TCoffee. One should really read the online manual for the best explanation of all the features. See http://igs-server.cnrs-mrs.fr/~cnotred/Documentation/t_coffee/t_coffee_doc.html

These can be specified as parameters when instantiating a new TCoffee object, or through get/set methods of the same name (lowercase).

IN

  Title       : IN
  Description : (optional) input filename, this is specified when
                align so should not use this directly unless one
                understand TCoffee program very well.
 
 

TYPE

  Title       : TYPE
  Args        : [string] DNA, PROTEIN
  Description : (optional) set the sequence type, guessed automatically
                so should not use this directly
 
 

PARAMETERS

  Title       : PARAMETERS
  Description : (optional) Indicates a file containing extra parameters
 
 

EXTEND

  Title       : EXTEND
  Args        : 0, 1, or positive value
  Default     : 1
  Description : Flag indicating that library extension should be
                carried out when performing multiple alignments, if set
                to 0 then extension is not made, if set to 1 extension
                is made on all pairs in the library.  If extension is
                set to another positive value, the extension is only
                carried out on pairs having a weigth value superior to
                the specified limit.
 
 

DP_NORMALISE

  Title       : DP_NORMALISE
  Args        : 0 or positive value
  Default     : 1000
  Description : When using a value different from 0, this flag sets the
                score of the highest scoring pair to 1000.
 
 

DP_MODE

  Title       : DP_MODE
  Args        : [string] gotoh_pair_wise, myers_miller_pair_wise,
                fasta_pair_wise cfasta_pair_wise
  Default     : cfast_fair_wise
  Description : Indicates the type of dynamic programming used by
                the program
 
     gotoh_pair_wise : implementation of the gotoh algorithm
     (quadratic in memory and time)
 
     myers_miller_pair_wise : implementation of the Myers and Miller
     dynamic programming algorithm ( quadratic in time and linear in
     space). This algorithm is recommended for very long sequences. It
     is about 2 time slower than gotoh. It only accepts tg_mode=1.
 
     fasta_pair_wise: implementation of the fasta algorithm. The
     sequence is hashed, looking for ktuples words. Dynamic programming
     is only carried out on the ndiag best scoring diagonals. This is
     much faster but less accurate than the two previous.
 
     cfasta_pair_wise : c stands for checked. It is the same
     algorithm. The dynamic programming is made on the ndiag best
     diagonals, and then on the 2*ndiags, and so on until the scores
     converge. Complexity will depend on the level of divergence of the
     sequences, but will usually be L*log(L), with an accuracy
     comparable to the two first mode ( this was checked on BaliBase).
 
 

KTUPLE

  Title       : KTUPLE
  Args        : numeric value
  Default     : 1 or 2 (1 for protein, 2 for DNA )
 
  Description : Indicates the ktuple size for cfasta_pair_wise dp_mode
                and fasta_pair_wise. It is set to 1 for proteins, and 2
                for DNA. The alphabet used for protein is not the 20
                letter code, but a mildly degenerated version, where
                some residues are grouped under one letter, based on
                physicochemical properties:
                rk, de, qh, vilm, fy (the other residues are
                not degenerated).
 
 

NDIAGS

  Title       : NDIAGS
  Args        : numeric value
  Default     : 0
  Description : Indicates the number of diagonals used by the
                fasta_pair_wise algorithm. When set to 0,
                n_diag=Log (length of the smallest sequence)
 
 

DIAG_MODE

  Title       : DIAG_MODE
  Args        : numeric value
  Default     : 0
 
 
  Description : Indicates the manner in which diagonals are scored
               during the fasta hashing.
 
               0 indicates that the score of a diagonal is equal to the
               sum of the scores of the exact matches it contains.
 
 
               1 indicates that this score is set equal to the score of
               the best uninterrupted segment
 
               1 can be useful when dealing with fragments of sequences.
 
 

SIM_MATRIX

  Title       : SIM_MATRIX
  Args        : string
  Default     : vasiliky
  Description : Indicates the manner in which the amino acid is being
                degenerated when hashing. All the substitution matrix
                are acceptable. Categories will be defined as sub-group
                of residues all having a positive substitution score
                (they can overlap).
 
                If you wish to keep the non degenerated amino acid
                alphabet, use 'idmat'
 
 

MATRIX

  Title       : MATRIX
  Args        :
  Default     :
  Description : This flag is provided for compatibility with
                ClustalW. Setting matrix = 'blosum' is equivalent to
                -in=Xblosum62mt , -matrix=pam is equivalent to
                in=Xpam250mt . Apart from this, the rules are similar
                to those applying when declaring a matrix with the
                -in=X fl
 
 

GAPOPEN

  Title       : GAPOPEN
  Args        : numeric
  Default     : 0
  Description : Indicates the penalty applied for opening a gap. The
                penalty must be negative. If you provide a positive
                value, it will automatically be turned into a negative
                number. We recommend a value of 10 with pam matrices,
                and a value of 0 when a library is used.
 
 

GAPEXT

  Title       : GAPEXT
  Args        : numeric
  Default     : 0
  Description : Indicates the penalty applied for extending a gap.
 
 

COSMETIC_PENALTY

  Title       : COSMETIC_PENALTY
  Args        : numeric
  Default     : 100
  Description : Indicates the penalty applied for opening a gap. This
                penalty is set to a very low value. It will only have
                an influence on the portions of the alignment that are
                unalignable. It will not make them more correct, but
                only more pleasing to the eye ( i.e. Avoid stretches of
                lonely residues).
 
                The cosmetic penalty is automatically turned off if a
                substitution matrix is used rather than a library.
 
 

TG_MODE

  Title       : TG_MODE
  Args        : 0,1,2
  Default     : 1
  Description : (Terminal Gaps)
                0: indicates that terminal gaps must be panelized with
                   a gapopen and a gapext penalty.
                1: indicates that terminal gaps must be penalized only
                   with a gapext penalty
                2: indicates that terminal gaps must not be penalized.
 
 

WEIGHT

  Title       : WEIGHT
  Args        : sim or sim_<matrix_name or matrix_file> or integer value
  Default     : sim
 
 
  Description : Weight defines the way alignments are weighted when
                turned into a library.
 
                sim indicates that the weight equals the average
                    identity within the match residues.
 
                sim_matrix_name indicates the average identity with two
                    residues regarded as identical when their
                    substitution value is positive. The valid matrices
                    names are in matrices.h (pam250mt) . Matrices not
                    found in this header are considered to be
                    filenames. See the format section for matrices. For
                    instance, -weight=sim_pam250mt indicates that the
                    grouping used for similarity will be the set of
                    classes with positive substitutions. Other groups
                    include
 
                        sim_clustalw_col ( categories of clustalw
                        marked with :)
 
                        sim_clustalw_dot ( categories of clustalw
                        marked with .)
 
 
                Value indicates that all the pairs found in the
                alignments must be given the same weight equal to
                value. This is useful when the alignment one wishes to
                turn into a library must be given a pre-specified score
                (for instance if they come from a structure
                super-imposition program). Value is an integer:
 
                        -weight=1000
 
   Note       : Weight only affects methods that return an alignment to
                T-Coffee, such as ClustalW. On the contrary, the
                version of Lalign we use here returns a library where
                weights have already been applied and are therefore
                insensitive to the -weight flag.
 
 

SEQ_TO_ALIGN

  Title       : SEQ_TO_ALIGN
  Args        : filename
  Default     : no file - align all the sequences
 
  Description : You may not wish to align all the sequences brought in
                by the -in flag. Supplying the seq_to_align flag allows
                for this, the file is simply a list of names in Fasta
                format.
 
                However, note that library extension will be carried out
                on all the sequences.
 
 

PARAMETERS FOR TREE COMPUTATION AND OUTPUT

NEWTREE

  Title       : NEWTREE
  Args        : treefile
  Default     : no file
  Description : Indicates the name of the new tree to compute. The
                default will be <sequence_name>.dnd, or <run_name.dnd>.
                Format is Phylip/Newick tree format
 
 

USETREE

  Title       : USETREE
  Args        : treefile
  Default     : no file specified
  Description : This flag indicates that rather than computing a new
                dendrogram, t_coffee can use a pre-computed one. The
                tree files are in phylips format and compatible with
                ClustalW. In most cases, using a pre-computed tree will
                halve the computation time required by t_coffee. It is
                also possible to use trees output by ClustalW or
                Phylips. Format is Phylips tree format
 
 

TREE_MODE

  Title       : TREE_MODE
  Args        : slow, fast, very_fast
  Default     : very_fast
  Description : This flag indicates the method used for computing the
                dendrogram.
                slow : the chosen dp_mode using the extended library,
                fast : The fasta dp_mode using the extended library.
                very_fast: The fasta dp_mode using pam250mt.
 
 

QUICKTREE

  Title       : QUICKTREE
  Args        :
  Default     :
  Description : This flag is kept for compatibility with ClustalW.
                It indicates that:  -tree_mode=very_fast
 
 

PARAMETERS FOR ALIGNMENT OUTPUT

OUTFILE

  Title       : OUTFILE
  Args        : out_aln file, default, no
  Default     : default ( yourseqfile.aln)
  Description : indicates name of output alignment file
 
 

OUTPUT

  Title       : OUTPUT
  Args        : format1, format2
  Default     : clustalw
  Description : Indicated format for outputting outputfile
                Supported formats are:
 
                clustalw_aln, clustalw: ClustalW format.
                gcg, msf_aln : Msf alignment.
                pir_aln : pir alignment.
                fasta_aln : fasta alignment.
                phylip : Phylip format.
                pir_seq : pir sequences (no gap).
                fasta_seq : fasta sequences (no gap).
     As well as:
                 score_html : causes the output to be a reliability
                              plot in HTML
                 score_pdf : idem in PDF.
                 score_ps : idem in postscript.
 
     More than one format can be indicated:
                 -output=clustalw,gcg, score_html
 
 

CASE

  Title       : CASE
  Args        : upper, lower
  Default     : upper
  Description : triggers choice of the case for output
 
 

CPU

  Title       : CPU
  Args        : value
  Default     : 0
  Description : Indicates the cpu time (micro seconds) that must be
                added to the t_coffee computation time.
 
 

OUT_LIB

  Title       : OUT_LIB
  Args        : name of library, default, no
  Default     : default
  Description : Sets the name of the library output. Default implies
                <run_name>.tc_lib
 
 

OUTORDER

  Title       : OUTORDER
  Args        : input or aligned
  Default     : input
  Description : Sets the name of the library output. Default implies
                <run_name>.tc_lib
 
 

SEQNOS

  Title       : SEQNOS
  Args        : on or off
  Default     : off
  Description : Causes the output alignment to contain residue numbers
                at the end of each line:
 
 

PARAMETERS FOR GENERIC OUTPUT

RUN_NAME

  Title       : RUN_NAME
  Args        : your run name
  Default     :
  Description : This flag causes the prefix <your sequences> to be
                replaced by <your run name> when renaming the default
                files.
 
 

ALIGN

  Title       : ALIGN
  Args        :
  Default     :
  Description : Indicates that the program must produce the
                alignment. This flag is here for compatibility with
                ClustalW
 
 

QUIET

  Title       : QUIET
  Args        : stderr, stdout, or filename, or nothing
  Default     : stderr
  Description : Redirects the standard output to either a file.
               -quiet on its own redirect the output to /dev/null.
 
 

CONVERT

  Title       : CONVERT
  Args        :
  Default     :
  Description : Indicates that the program must not compute the
                alignment but simply convert all the sequences,
                alignments and libraries into the format indicated with
                -output. This flag can also be used if you simply want
                to compute a library ( i.e. You have an alignment and
                you want to turn it into a library).
 
 

FEEDBACK

Mailing Lists

User feedback is an integral part of the evolution of this and other Bioperl modules. Send your comments and suggestions preferably to one of the Bioperl mailing lists. Your participation is much appreciated.
   bioperl-l@bioperl.org                  - General discussion
   http://bioperl.org/wiki/Mailing_lists  - About the mailing lists
 
 

Support

Please direct usage questions or support issues to the mailing list:

bioperl-l@bioperl.org

rather than to the module maintainer directly. Many experienced and reponsive experts will be able look at the problem and quickly address it. Please include a thorough description of the problem with code and data examples if at all possible.

Reporting Bugs

Report bugs to the Bioperl bug tracking system to help us keep track the bugs and their resolution. Bug reports can be submitted via the web:
  http://bugzilla.open-bio.org/
 
 

AUTHOR - Jason Stajich, Peter Schattner

Email jason-at-bioperl-dot-org, schattner@alum.mit.edu

APPENDIX

The rest of the documentation details each of the object methods. Internal methods are usually preceded with a _

program_name

  Title   : program_name
  Usage   : $factory->program_name()
  Function: holds the program name
  Returns:  string
  Args    : None
 
 

program_dir

  Title   : program_dir
  Usage   : $factory->program_dir(@params)
  Function: returns the program directory, obtained from ENV variable.
  Returns:  string
  Args    :
 
 

error_string

  Title   : error_string
  Usage   : $obj->error_string($newval)
  Function: Where the output from the last analysus run is stored.
  Returns : value of error_string
  Args    : newvalue (optional)
 
 

version

  Title   : version
  Usage   : exit if $prog->version() < 1.8
  Function: Determine the version number of the program
  Example :
  Returns : float or undef
  Args    : none
 
 

run

  Title   : run
  Usage   : my $output = $application->run(-seq     => $seq,
                                           -profile => $profile,
                                           -type    => 'profile-aln');
  Function: Generic run of an application
  Returns : Bio::SimpleAlign object
  Args    : key-value parameters allowed for TCoffee runs AND
            -type     => profile-aln or alignment for profile alignments or
                         just multiple sequence alignment
            -seq      => either Bio::PrimarySeqI object OR
                         array ref of Bio::PrimarySeqI objects OR
                         filename of sequences to run with
            -profile  => profile to align to, if this is an array ref
                         will specify the first two entries as the two
                         profiles to align to each other
 
 

align

  Title   : align
  Usage   :
         $inputfilename = 't/data/cysprot.fa';
         $aln = $factory->align($inputfilename);
 or
         $seq_array_ref = \@seq_array; 
         # @seq_array is array of Seq objs
         $aln = $factory->align($seq_array_ref);
  Function: Perform a multiple sequence alignment
  Returns : Reference to a SimpleAlign object containing the
            sequence alignment.
  Args    : Name of a file containing a set of unaligned fasta sequences
            or else an array of references to Bio::Seq objects.
 
  Throws an exception if argument is not either a string (eg a
  filename) or a reference to an array of Bio::Seq objects.  If
  argument is string, throws exception if file corresponding to string
  name can not be found. If argument is Bio::Seq array, throws
  exception if less than two sequence objects are in array.
 
 

profile_align

  Title   : profile_align
  Usage   :
  Function: Perform an alignment of 2 (sub)alignments
  Example :
  Returns : Reference to a SimpleAlign object containing the (super)alignment.
  Args    : Names of 2 files containing the subalignments
            or references to 2 Bio::SimpleAlign objects.
  Note    : Needs to be updated to run with newer TCoffee code, which
            allows more than two profile alignments.
 
 

Throws an exception if arguments are not either strings (eg filenames) or references to SimpleAlign objects.

_run

  Title   :  _run
  Usage   :  Internal function, not to be called directly        
  Function:  makes actual system call to tcoffee program
  Example :
  Returns : nothing; tcoffee output is written to a
            temporary file OR specified output file
  Args    : Name of a file containing a set of unaligned fasta sequences
            and hash of parameters to be passed to tcoffee
 
 

_setinput

  Title   :  _setinput
  Usage   :  Internal function, not to be called directly        
  Function:  Create input file for tcoffee program
  Example :
  Returns : name of file containing tcoffee data input AND
            type of file (if known, S for sequence, L for sequence library,
            A for sequence alignment)
  Args    : Seq or Align object reference or input file name
 
 

_setparams

  Title   :  _setparams
  Usage   :  Internal function, not to be called directly        
  Function:  Create parameter inputs for tcoffee program
  Example :
  Returns : parameter string to be passed to tcoffee
            during align or profile_align
  Args    : name of calling object
 
 

aformat

  Title   : aformat
  Usage   : my $alignmentformat = $self->aformat();
  Function: Get/Set alignment format
  Returns : string
  Args    : string
 
 

methods

  Title   : methods
  Usage   : my @methods = $self->methods()
  Function: Get/Set Alignment methods - NOT VALIDATED
  Returns : array of strings
  Args    : arrayref of strings
 
 

Bio::Tools::Run::BaseWrapper methods

no_param_checks

  Title   : no_param_checks
  Usage   : $obj->no_param_checks($newval)
  Function: Boolean flag as to whether or not we should
            trust the sanity checks for parameter values  
  Returns : value of no_param_checks
  Args    : newvalue (optional)
 
 

save_tempfiles

  Title   : save_tempfiles
  Usage   : $obj->save_tempfiles($newval)
  Function: 
  Returns : value of save_tempfiles
  Args    : newvalue (optional)
 
 

outfile_name

  Title   : outfile_name
  Usage   : my $outfile = $tcoffee->outfile_name();
  Function: Get/Set the name of the output file for this run
            (if you wanted to do something special)
  Returns : string
  Args    : [optional] string to set value to
 
 

tempdir

  Title   : tempdir
  Usage   : my $tmpdir = $self->tempdir();
  Function: Retrieve a temporary directory name (which is created)
  Returns : string which is the name of the temporary directory
  Args    : none
 
 

cleanup

  Title   : cleanup
  Usage   : $tcoffee->cleanup();
  Function: Will cleanup the tempdir directory
  Returns : none
  Args    : none
 
 

io

  Title   : io
  Usage   : $obj->io($newval)
  Function:  Gets a L<Bio::Root::IO> object
  Returns : L<Bio::Root::IO>
  Args    : none