illumina2srf

Langue: en

Version: 332771 (ubuntu - 24/10/10)

Section: 1 (Commandes utilisateur)

NAME

illumina2srf - Builds an SRF file from an Illumina/Solexa GA run folder.

SYNOPSIS

illumina2srf [options] tile_seq_file ...

DESCRIPTION

illumina2srf converts the Illumina GA-pipeline run folder output into an SRF file. It should be run from the Bustard<version><date> directory. It has a wealth of options, listed below, although many have defaults and may be ommitted if the run folder follows the standard directory layout. The arguments, after the options, should be the filenames of the sequence files, eg s_8_*_seq.txt. All other filenames are derived from the _seq.txt filenames.

The main structure of an SRF file is as a container, much like zip or tar. The contents however may be split into variable and common components allowing for better compression. For illumina2srf that means that we store trace data in ZTR format with common ZTR chunks (text identifiers such as base-caller name and version, matrix files and compression specifications) in an SRF Data Block Header and variable components (sequence, quality and traces) in ZTR chunks held within an SRF Data Block. Typically we have 10,000 Data Blocks per Data Block Header.

The most major decision in producing the SRF file is what data to put in it. By default the program writes the sequence and probability values along with the "processed" trace intensities. In GAPipeline v1.0 and earlier these are in the _seq.txt, _prb.txt and _sig2.txt files held within the main Bustard directory. In addition to these the -r option requests storage of the "raw" trace intensities, comprising both the pre-processed intensities and noise estimates from the Firecrest _int.txt and _nse.txt files respectively. To store only raw intensities, skipping processed data, specify the -r -P options. Finally the -I option can be used to store data from IPAR format files.

Confidence values have been a source of large variation over the pipeline releases. In GAPipeline 1.0 and earlier the _prb.txt files in the Bustard directory contain four quality values per base encoded using a log-odds system: 10*log(P/(1-P)). In addition to this there are various calibrated formats in the GERALD directory with one Phred scale value per base. See the -qf, -qr and -qc parameters.

There are a number of smaller ancillary data files that get stored too. As there is no per-lane or per-run storage mechanism in these are added for every SRF Data Block Header of which there may be several per tile. However the overhead in duplicating this data is not significant given the size of the individual SRF Data Blocks. The ancillary data files also stored are .params files (for both Bustard and Firecrest), matrices (specified using -mf and -mr) and phasing XML files (-pf and -pr).

OPTIONS

Trace data-source options

-r, -R
Specifies to store (-r) or not to store (-R - the default) "raw" data. This is currently comprised of the contents of the _int.txt and _nse.txt files in the Firecrest directory.
-p, -P
Specifies to store (-p - the default) or not to store (-P) the "processed" data. This is the contents of the _sig2.txt files in the Bustard directory.
-u
Deprecated. Older GAPipeline releases created _sig.txt files holding semi-processed data with compensation for the dye spectral overlap, but before phasing correction steps. The -u argument indicates that the processed data should be taken from these files instead of _sig2.txt.
-I
Reads IPAR files instead of the raw trace data files. These are a different format used by the incremental processing software when the pipeline is run on the instrument control PC itself.

Quality value data-source options

-qf filename
Specifies the filename of the calibrated quality values for the forward-read or both the forward and reverse read combined if appropriate. filename should be in Illumina's fastq derivative format, with quality values stored as ASCII 64 plus the log-odds score.
-qr filename
If the calibrated fastq files are split into forward and reverse files then filename specifies the reverse sequences. Otherwise we assume they are tacked onto the end of the forward sequences specified in -qf. Like the former file, this should be in Illumina's fastq-like format.
-qc directory
This is an alternative to the -qf and -qr options above and is mutually exclusive with them. This specifies that the calibrated data should come from files named "directory/s_%d_qcal.txt" where "%d" is replaced by the current tile number.

Filtering options

-c value
Only store traces that have a "chastity" score >= Value. This is mutually exclusive with the -C option.
-C value
Until the -c option, traces with a "chastity" score < Value are still stored in the SRF file but are marked as bad reads instead. srf2fasta and srf2fastq have options to subsequently filter out bad reads using this flag. This is mutually exclusive with the -c option.
-s N
This skips the first N cycles of a trace (including signal, sequence and quality values) when writing it to an SRF file. The purpose of this is to remove primer bases, but it is not recommended. Instead the SRF file should be using the ZTR region chunk (REGN) to indicate which potion of a trace is valid.

Read naming

Read names are split into two halves, a prefix and a suffix. One common prefix is stored in each and every SRF Data Block Header while the suffix is stored in every Data Block. This combination allows for removal of repetitive data in order to shrink the SRF file size.

-n format
Controls the format used for creating the sequence name suffix. This uses a printf style system of percent expansions that will be replaced with the appropriate data. The list of percent expansions are:
%%
A literal percent character
%d
Run date (taken from parsing the current working directory)
%m
Machine name (taken from parsing the current working directory)
%r
Run number (taken from parsing the current working directory)
%l
lane number (%L for hexidecimal encoding)
%t
tile number (%T for hexidecimal encoding)
%x
X coordinate (%X for hexidecimal encoding)
%y
Y coordinate (%Y for hexidecimal encoding)
%c
Counter; increments by 1 for every sequence in the tile (%C for hexidecimal encoding).

All the above format strings have an optional numerical value between the percent and the format character. This is used to control the field width. For example to print the X and Y coordinates to 3 hexidecimal places we could use -n "%3X:%3Y".

The default format is "%x:%y".

-N format
Specifies the format string for encoding the reading name prefix. It follows the same formatting rules specified in the -n above.
The default format is "%m_%r:%l:%t:".

Ancillary data files

These options govern the extra files stored per tile (or strictly speaking per SRF Data Block Header).

-2 cycle
This specifies the cycle number, counting from 1, of the second read forming a read-pair. It is used for automatic generation of filenames in several of the options below and also for construction of the ZTR region (REGN) chunks.
-mf filename
The filename of the forward matrix file. If a single printf numerical percent rule is used (such as "%d") then it will be replaced by the lane number. When not specified the default filename will be ../Matrix/s_%d_02_matrix.txt.
-mr filename
The filename of the reverse matrix file - only used on paired end runs. If a single printf numerical percent rule is used (such as "%d") then it will be replaced by the lane number. If a second printf percent rule is used then it will be replaced with the cycle number that the paired read starts on. This is equivalent to the cycle number specified in the -2 option plus one. (The plus one comes from using the second cycle per end for matrix calibration.) When -mr is not specified the default filename will be ../Matrix/s_%d_%02d_matrix.txt.
pf filename
Specifies the filename of the forward-read phasing XML file. As with -mf a printf numerical percent rule will be replaced by the lane number. The default filename format is Phasing/s_%d_01_phasing.xml.
pr filename
Specifies the filename of the reverse-read phasing XML file. As with -mr the first two printf numerical percent rules will be replaced by the lane number and the cycle number. Unlike -mr though the cycle number is the value used in the -c option as-is instead of plus one. The default filename format is Phasing/s_%d_%02d_phasing.xml.

Other options

-o srf_filename
Specifies the output filename to write the SRF data too. Defaults to "traces.srf".
-i
Indicates that an index should be appended to the SRF file. This allows for random access based on the sequence name.
-d
Enable dots-mode. This outputs a full-stop per input tile. Most useful in conjunction with quiet mode. Default is off.
-q
Quiet mode. Do not output commentary on which tile is being processed and the metrics about it. Default off.

EXAMPLES

To store a lane 4 from a paired end run with raw traces, no processed data and calibrated confidence values.

     # From Bustard directory
     illumina2srf -o all.srf -r -P \
            -qf GERALD*/s_4_1_sequence.txt \
            -qr GERALD*/s_4_2_sequence.txt \
            s_4_*_seq.txt
 

To store and index only processed traces with chastity >= 0.6

     illumina2srf -o s4.srf -c 0.6 s_4_*_seq.txt
 

CAVEATS

There are many mutually exclusive options, some of which may be for processing file formats that no longer exist. This is due to the history of the program and the rapidly changing nature of the files being processed. Some future culling of options and file formats can be expected.

Some assumptions are made as to the directory layout and the ability to parse the run folder directory name. There are currently no ways to override some of this information, including run date, run number and GAPipeline program version numbers.

AUTHOR

James Bonfield, Wellcome Trust Sanger Institute