Input and Output formats

MetaTISA Input Format

MetaTISA Output Format

MetaTISA Input Format

CDS Annotation Format

One input format for CDS annotation accepted is the output of MetaGene prediction. The other accepted format is our own MED format: an sequence fragment id beginning with ">", followed by co-ordinations of all CDSs in the fragment, one CDS per line. The coordinates of a CDS define the nucleotide region, from which the first position for positive strand (second position for negative strand) can be used to translate the CDS to amino acid sequence. At the same time, in order to facilitate users to post-processing results from other gene prediction tools, we provide several tools for a conversion from a format of a gene-finder to the MED format (converting formats).

Example (MED format):

>NC_000913_1
	   2     109   -
	 124     699   -
>NC_000913_2
	   3     305   +
	 539     700   +
>NC_000913_3
	   1     411   -
	 442     699   -
>NC_000913_4
	   2     685   -
>NC_000913_5
	   1     699   +
				   

Sequence Format

The metagenome sequence is in Fasta format. The first line is recognized as unique id for sequence fragment.

Example:

>mgutLn1_U_BL_aaa09a05_b1 Mouse Gut Community PT3 : mgutLn1_U_BL_aaa09a05_b1
AAATCTCGCCCTGTGGTGGATTCCTTTTCCCATTGCCCGATCTTATTTTT
ATCTTCCAAAAATGACGAACTGGACAAGATCCTGGGTCTTTCGGTGGGCG
GGGATGATTACGTGGCAAAGCCGTTCAGCCCGAAGGAGATCGCGTATCGG
GTCAAGGCGCAGCTCCGGCGGGCCGCGTATCAGCAAGACCCGTCGGAGGA
GGAGCTCATAAAAACAGGGGAATTGGAAATTGACGTGGAGGGCTGCAGGG
TCACAAAAGGCGGCAGCCCCATAGAACTGACCGCGCGGGAATTTGAAATC
CTGCGGTATCTGGCGGAAAATCAAGGCCGGGTCATCAGCCGCGAACGCTT
ATATGAAACCATCTGGGGCGAGGACAGCTTCGGGTGCGACAATACGGTCA
TGGTGCATATCCGGCATCTGCGTGAAAAAATAGAGGACGATCCCGCGGCG
CCCCGATACATCATCACGATGAAAGGATTAGGCTATAAGCTGGTGGACCC
TTATGAAGAATAAAAGCGATCTCAATCTGTTTTTTCGTTCGTTCGGCATT
GTCGTGATTGTGATCTTCGCGGCCATTGCAGCGGGGATATGCCTGTTTTA
TTATGTGTTCGCGATTCCGGCGCGGGAGGGACTCAGCCTGGCCTCATGGC
CAGACGTGTATACAGACAATTTTTCCCTTCAGCTTGAAGAAGAACAGGGA
GAGCTTAAAGTAAAAGAATTCGGGATTGAAGATCTGGACCGGTATGGCTT
ATGGCTGCAGGTGATCGATGAAACGGGACAGGAGTTTTTTCACACAATAA
GCCGGAGACCTGTCCCAACAGCTATACGGCCTCGAGCTTTTGGCATTCGG
GTACGAACGTTTA 
			

Top of Input and Output format


MetaTISA Output Format

MED Format

MED format gives a sequence id and corresponding CDS annotation, as described above. View the example.

GFF Format

The output in GFF (general feature format) is denoted according to the specifications of the Sanger institute:

<seqname> <source> <feature> <start> <end> <score> <strand> <frame> [attributes] [comments]  

Example:

##gff-version  3
##MetaTISA
##metagenome sequence name: NC_000913_45
NC_000913_45   MetaTISA        CDS     3       395     .       -       .
##gff-version  3
##MetaTISA
##metagenome sequence name: NC_000913_255
NC_000913_255  MetaTISA        CDS     2       298     .       -       .
NC_000913_255  MetaTISA        CDS     320     616     .       -       .

Top of Input and Output format


Configuration parameters



Sequence region around start codon

The parameters specify the region of the sequence around start codon that should be used to calculate position weight matrix:

upstream
# of nucleotides that are upstream to the start codon
Default: 50 nucleotides
downstream
# of nucleotides that are downstream to the start codon
Default: 15 nucleotides

Top of Configuration parameters


Max order of Markov model

In training parameters, TriTISA uses a cascade combination of Markov model from lower order to higher order. By Default, three model are cascaded, namely 0th, 1st and 2nd order. This option corresponds to the max order of Markov model.

Default: 2


Top of Configuration parameters


Minimal # of CDS in training

Since our method begins with a 0th order Markov model, it has a number of parameters as small as 4 at each nucleotide position, and 3 prio-probabilities to be estimated. But if the size of training set of a binned group is too small, we suggest to use parameters pre-trained from sequenced genomes for this group. Test on simulation data from the EcoGene database shows that 200 samples are sufficient for a good estimation of the parameters. And this is the default value used by MetaTISA to determine whether to self-train the parameters or use already trained parameters. กก

Default: 200

กก

Top of Configuration parameters


Binning Method

At this stage, we implemented the k-mer method for binning. Test on simulation data showed that the accuracy is slightly higher for k > 9, but a less k will greatly saves the memory, computation time as well as downloading time.

Default: 9-mer


Top of Configuration parameters


Start and Stop Codon

  • Start codons: ATG, CTG, GTG, TTG;
  • Stop codons: TAA, TGA and TAG;
  • TGA is not set as stop codon in: Mycoplasma, Acholeplasma, Aster, Onion and Ureaplasma.

  • Top of Configuration parameters



     
     
    กก

    ©2008 MetaTISA