Algorithm Description

MetaTISA is designed to post-process an existing gene annotation pipeline for metagenome, with an aim to improve its prediction accuracy of translation initiation sites (TISs). It takes inputs as a set of shotgun fragments and a set of CDS annotations, and outputs refined initiation sites.

The method constitutes two steps in TIS prediction. First, it uses binning techniques to classify all input fragments. Fragments binned in the same clade are assumed to have closely phylogenic origin, and hence share a similar mechanism in translation initiation. Then, the recently developed tool, TriTISA, is modified to predict TIS for each clade of fragments in an unsupervised iterative manner.

  • Binning Method

Binning is to assign an anonymous sequence fragment to certain phylogenic group. A number of binning method currently published may include BLAST, K-mer (Sandberg, et al., 2001), SOM (Abe, et al., 2003), PhyloPythia (McHardy, et al., 2007), TETRA (Teeling, et al., 2004), etc. At this stage, we implemented the K-mer method, and plan to include other methods in the future.

The k-mer is a supervised method and employs a naive Bayesian classifier for sequence binning. The training data is the k-mer frequencies compiled for each phylogenic clade. Given a sequence fragment, the k-mer method calculates a likelihood score for each clade based on the pre-compiled frequencies, and assigns the fragment to the clade that produces the highest likelihood score (Sandberg, et al., 2001). We prepared the training sets from completely sequenced genomes, and chose one genome per genus to reduce redundancy (the genus list). Each genome presents a phylogenic clade, and the fragment is binned to genus level.

  • TIS Prediction

We have recently proposed an unsupervised method for TIS prediction for microbial genomes (Hu et al., 2009). The method classifies all TIS candidates into three categories: true TISs, false TISs upstream of the true TISs (in noncoding region), and false TISs downstream of the TISs (in coding region). The features of sequences around TISs for each category are characterized by a non-homogenous Markov model. The three models are trained by an iterative self-learning procedure. At each step, the Markov models are combined by a Bayesian methodology, which assign three post-probabilities to each candidate TIS: the probability that the TIS is a true TIS (Pt), that it is from non-coding region (Pnc), and that it is from coding region (Pc). TriTISA predict the one with the highest Pt score as the TIS of a gene. The updated annotation constitute the training set for the next step of iteration. To further improve the prediction accuracies, TriTISA employs a cascade combination of different orders of Markov model, namely it first uses a 0th-order Markov model for initial refinements, and then move to higher (1st an 2rd) order Markov models in the later steps of the iteration. Test on simulation data and experimentally verified data show that TriTISA produces a more accurate and robust prediction than the state-of-the-art  (Hu et al., 2009).

Here, we modified the TriTISA algorithm to post-process annotation for metagenomic binned fragments. CDSs from binned fragments are assumed to share similar machinery in translation initiation, and the sequence pattern for each set of TISs are homogenous across the clade. The assumptions allow the parameters to be trained as that trained for a single genome. CDSs are extended to the 5'-most before post-processing, and CDSs that are complete in their 5'-ends are used for parameter training. With the converged parameters, TriTISA calculates for each candidate TIS three scores: Pt, Pnc and Pc. For CDSs that are complete in their 5'-ends, the start codon is predicted as the candidate start that shows the highest Pt score. For CDSs that are incomplete in their 5'-ends, we need to estimate whether the start codon is missing. In other words, is the 5'-most start-codon-like triplet belong to coding region? We estimate the distribution of Pco from training set, and it is readily to have a threshold to say if a candidate is from coding regions (at a 95% confidence interval). For CDSs that are estimated to contain start codons, we predict the TIS following the procedure acting on training CDSs.

Figure 1 Program flow chart of MetaTISA

 

Performance Evaluation


Due to the lacking of experimentally verified TISs in metagenome project, the only way to reliably evaluate the prediction performance is to simulate a metagenome based on artificial shotgun sequences from complete microbial genomes. The validities of the k-mer method and the TriTISA method are documented previously (Sandberg et al., 2001; Hu et al. 2009. Here we tested their combined effect on TIS prediction for metagenomes with shotgun sequences simulated from 95 randomly selected genomes plus 5 genomes where experimentally verified TISs are available (Hu et al. 2009). Two sets of simulation were created with different settings of fragment length: L = 700 bps and L = 400 bps. We selected genes that have experimentally verified TISs from the five genomes as benchmarks. Since many of their start codons are absent from the fragments, we calculate sensitivity TP/TP+FN (sn) and specificity TN/TN+FP (sp) for accuracy measurements, where TP, TN, FP, and FN denote the numbers of true positives, true negatives, false positives, and false negatives, respectively. We demonstrate the performance of MetaTISA used to process the outputs of the newest version of MetaGene, namely MetaGeneAnnotator or MGA (Noguchi et al. 2008) (Table 1, 2). Similar improvements are obtained for Neural Net (Hoff et al., 2008) (data not shown).
 

Table 1. Accuracies calculated according to the RefSeq whole genome annotation

  Fragment length = 700 Fragment length = 400 bps
Genomes # MGA_sn/sp MTS_sn/sp # MGA_sn/sp MTS_sn/sp
A. pernix 1309 60.25/98.58 80.29/99.04 980 60.98/98.74 79.45/98.80
Synechocysis sp. 2196 78.08/99.34 78.14/99.18 1608 80.20/99.42 79.11/99.07
E.coli 3195 87.65/99.70 91.27/99.65 2455 86.97/99.71 90.74/99.48
N. pharaonis 1978 83.78/99.21 88.50/99.20 1502 83.55/99.28 87.74/98.91
H. salinarum 1626 79.64/99.34 88.45/99.46 1234 78.18/99.39 88.02/99.30
Weighted average - 80.12/99.33 86.10/99.35 - 80.24/99.40 85.90/99.17
MGA: MetaGeneAnnotator; MTS: MetaTISA. Accuracies are calculated over 5 simulation replicates.

Figures for the other 95 genomes are included here(Fragment length=700,Fragment length=400), but caution should be taken for accuracy interpretation because RefSeq annotation on TIS is not of high quality (Hu et al. 2008).

Table 2. Accuracies calculated according to experimentally verified TISs

  Fragment length = 700 Fragment length = 400 bps
Genomes # MGA MTS # MGA MTS
A. pernix 103 64.42/98.71 94.52/99.61 78 60.54/98.70 92.29/99.25
Synechocysis sp. 92 83.53/99.25 81.33/98.93 75 82.20/99.29 82.56/98.85
E.coli 733 89.56/99.76 93.77/99.72 562 87.06/99.75 93.44/99.60
N. pharaonis 248 91.27/99.55 97.04/99.58 184 90.74/99.53 95.86/99.11
H. salinarum 428 85.72/99.54 96.04/99.71 337 82.43/99.51 94.65/99.48
Weighted average - 86.84/99.57 94.21/99.64 - 84.37/99.56 93.40/99.43
MGA: MetaGeneAnnotator; MTS: MetaTISA. Accuracies are calculated over 5 simulation replicates.

 

References
  1. Abe, T., Kanaya, S., Kinouchi, M., Ichiba, Y., Kozuki, T. and Ikemura, T.B (2003) Informatics for Unveiling Hidden Genome Signatures. Genome Res. 13: 693-702.

  2. Hu, G.-Q., Zheng, X.-B, Ju, L.-N., Zhu, H. and She, Z.S. (2008) Computational evaluation of TIS annotation for prokaryotic genomes, BMC Bioinformatics, 9:160.
  3. Hu, G.-Q., Zheng, X.-B., Zhu, H. and She, Z.S. (2009) Prediction of translation initiation site for microbial genomes with TriTISA. Bioinformatics, 25(1): 123-125.

  4. Hoff, K.J., Tech, M., Lingner, T., Daniel, R., Morgenstern, B. and Meinicke, P. (2008) Gene prediction in metagenomic fragments: a large scale machine learning approach. BMC Bioinformatics, 9:217.

  5. McHardy, A.C., Martin, H.G., Tsirigos, A., Hugenholtz, P. and Rigoutsos, I. (2007) Accurate phylogenetic classification of variable-length DNA fragments. Nature Methods. 4: 63-72.

  6. Noguchi, H., Taniguchi, T. and Itoh, T., (2008) MetaGeneAnnotator: Detecting species-specific patterns of ribosomal biding site for precise gene prediction in anonymous prokaryotic and phage genomes. DNA Research, 15:387-396.

  7. Noguchi, H., Park, J. and Takagi, T. (2006) MetaGene: prokaryotic gene finding from
    environmental genome shotgun sequences. Nucleic Acids Res., 34:5623-5630.

  8. Sandberg, R., Winberg, G., Branden, C.-I., Kaske, A., Ernberg, I. and Coster, J. (2001) Capturing Whole-Genome Characteristics in Short Sequences Using a Naive Bayesian Classifier. Genome Res. 11: 1404 - 1409.

  9. Teeling, H., Waldmann, J., Lombardot, T., Bauer, M. and Glockner, F.B. (2004) TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinformatics. 5:163.

 
 
 

©2008 MetaTISA