 |
| Algorithm Description |
MetaTISA is
designed to post-process an existing gene annotation pipeline for
metagenome, with an aim to improve its prediction accuracy of
translation initiation sites (TISs). It takes inputs as a set of shotgun
fragments and a set of CDS annotations, and outputs refined initiation
sites.
The method constitutes two steps
in TIS prediction. First, it uses binning techniques to classify all
input fragments. Fragments binned in the same clade are assumed to have
closely phylogenic origin, and hence share a similar mechanism in
translation initiation. Then, the recently developed tool, TriTISA, is
modified to predict TIS for each clade of fragments in an unsupervised
iterative manner.
Binning is to
assign an anonymous sequence fragment to certain phylogenic group. A
number of binning method currently published may include BLAST,
K-mer (Sandberg, et al., 2001),
SOM (Abe, et al., 2003), PhyloPythia (McHardy,
et al., 2007), TETRA (Teeling, et al., 2004), etc. At this stage, we
implemented the K-mer method, and plan to include other methods
in the future.
The k-mer is
a supervised method and employs a naive Bayesian classifier for sequence
binning. The training data is the k-mer frequencies compiled for
each phylogenic clade. Given a sequence
fragment, the k-mer method calculates a likelihood score for each
clade based on the pre-compiled frequencies, and assigns the fragment to
the clade that produces the highest likelihood score (Sandberg,
et al., 2001).
We prepared the training sets from completely sequenced genomes, and
chose one genome per genus to reduce redundancy (the genus
list). Each genome presents a
phylogenic clade, and the fragment is binned to genus level.
We have recently proposed an
unsupervised method for TIS prediction for microbial genomes (Hu et al.,
2009). The method classifies all TIS candidates into three categories:
true TISs, false TISs upstream of the true TISs (in noncoding region),
and false TISs downstream of the TISs (in coding region). The features
of sequences around TISs for each category are characterized by a
non-homogenous Markov model. The three models are trained by an
iterative self-learning procedure. At each step, the Markov models are
combined by a Bayesian methodology, which assign three
post-probabilities to each candidate TIS: the probability that the TIS
is a true TIS (Pt), that it is from non-coding region (Pnc), and that it
is from coding region (Pc). TriTISA predict the one with the highest
Pt
score as the TIS of a gene. The updated annotation constitute the
training set for the next step of iteration. To further improve the
prediction accuracies, TriTISA employs a cascade combination of
different orders of Markov model, namely it first uses a 0th-order
Markov model for initial refinements, and then move to higher (1st an
2rd) order Markov models in the later steps of the iteration.
Test on simulation data and experimentally verified data show that
TriTISA produces a more accurate and robust prediction than the
state-of-the-art (Hu et al., 2009).
Here, we modified the TriTISA algorithm to post-process annotation for
metagenomic binned fragments. CDSs from
binned fragments are assumed to share similar machinery
in translation initiation, and the sequence pattern for each set of TISs are homogenous
across the clade. The assumptions allow
the parameters to be trained as that trained for a single genome. CDSs
are extended to the 5'-most before post-processing, and CDSs that are
complete in their 5'-ends are used for parameter training. With the converged parameters,
TriTISA calculates for
each candidate TIS three scores: Pt, Pnc
and Pc. For CDSs that are complete in their 5'-ends,
the start codon is predicted as the candidate start that shows the
highest Pt score. For CDSs that are incomplete in their
5'-ends, we need to estimate whether the start codon is missing. In
other words, is the 5'-most start-codon-like triplet belong to coding
region? We estimate the distribution of Pco from
training set, and it is
readily to have a threshold to say if a candidate is from coding regions
(at a 95% confidence interval). For CDSs that are estimated to
contain start codons, we predict the TIS following the procedure acting
on training CDSs.
Figure 1 Program flow chart of MetaTISA
|
| Performance Evaluation |
|
Due to the lacking of experimentally verified TISs in metagenome
project, the only way to reliably evaluate the prediction performance is
to simulate a metagenome based on artificial shotgun sequences from
complete microbial genomes. The validities of the k-mer method and the
TriTISA method are documented previously (Sandberg et al., 2001; Hu et
al. 2009. Here we tested their combined effect on TIS prediction for
metagenomes with shotgun sequences simulated from 95 randomly selected
genomes plus 5 genomes where experimentally verified TISs are available
(Hu et al. 2009). Two sets of simulation were created with different
settings of fragment length: L = 700 bps and L = 400 bps. We selected
genes that have experimentally verified TISs from the five genomes as
benchmarks. Since many of their start codons are absent from the
fragments, we calculate sensitivity TP/TP+FN (sn) and specificity TN/TN+FP
(sp) for accuracy measurements, where TP, TN, FP, and FN denote the
numbers of true positives, true negatives, false positives, and false
negatives, respectively. We demonstrate the performance of MetaTISA used
to process the outputs of the newest version of MetaGene, namely
MetaGeneAnnotator or MGA (Noguchi et al. 2008) (Table 1, 2).
Similar improvements are obtained for Neural Net (Hoff et al.,
2008) (data not shown).
Table 1. Accuracies calculated according to the
RefSeq whole genome annotation
| |
Fragment length = 700 |
Fragment length = 400 bps |
| Genomes |
# |
MGA_sn/sp |
MTS_sn/sp |
# |
MGA_sn/sp |
MTS_sn/sp |
| A. pernix |
1309 |
60.25/98.58 |
80.29/99.04 |
980 |
60.98/98.74 |
79.45/98.80 |
| Synechocysis sp. |
2196 |
78.08/99.34 |
78.14/99.18 |
1608 |
80.20/99.42 |
79.11/99.07 |
| E.coli |
3195 |
87.65/99.70 |
91.27/99.65 |
2455 |
86.97/99.71 |
90.74/99.48 |
| N. pharaonis |
1978 |
83.78/99.21 |
88.50/99.20 |
1502 |
83.55/99.28 |
87.74/98.91 |
| H. salinarum |
1626 |
79.64/99.34 |
88.45/99.46 |
1234 |
78.18/99.39 |
88.02/99.30 |
| Weighted average |
- |
80.12/99.33 |
86.10/99.35 |
- |
80.24/99.40 |
85.90/99.17 |
| MGA:
MetaGeneAnnotator; MTS: MetaTISA. Accuracies are calculated
over 5 simulation replicates.
Figures for the other 95 genomes are included here(Fragment length=700,Fragment length=400), but caution should be taken for accuracy
interpretation because RefSeq annotation on TIS is not of high
quality (Hu et al. 2008). |
Table 2. Accuracies calculated according to
experimentally verified TISs
| |
Fragment length = 700 |
Fragment length = 400 bps |
| Genomes |
# |
MGA |
MTS |
# |
MGA |
MTS |
| A. pernix |
103 |
64.42/98.71 |
94.52/99.61 |
78 |
60.54/98.70
|
92.29/99.25 |
| Synechocysis sp. |
92 |
83.53/99.25
|
81.33/98.93 |
75 |
82.20/99.29 |
82.56/98.85 |
| E.coli |
733 |
89.56/99.76
|
93.77/99.72 |
562 |
87.06/99.75 |
93.44/99.60 |
| N. pharaonis |
248 |
91.27/99.55
|
97.04/99.58 |
184 |
90.74/99.53
|
95.86/99.11 |
| H. salinarum |
428 |
85.72/99.54 |
96.04/99.71 |
337 |
82.43/99.51 |
94.65/99.48 |
| Weighted average |
- |
86.84/99.57 |
94.21/99.64 |
- |
84.37/99.56 |
93.40/99.43 |
| MGA:
MetaGeneAnnotator; MTS: MetaTISA. Accuracies are calculated
over 5 simulation replicates. |
|
| References |
-
Abe, T., Kanaya, S., Kinouchi,
M., Ichiba, Y., Kozuki, T. and Ikemura, T.B (2003) Informatics for Unveiling
Hidden Genome Signatures. Genome Res. 13: 693-702.
- Hu, G.-Q., Zheng, X.-B, Ju, L.-N., Zhu, H. and She, Z.S. (2008)
Computational evaluation of TIS annotation for prokaryotic genomes,
BMC Bioinformatics, 9:160.
-
Hu, G.-Q., Zheng, X.-B., Zhu, H. and She, Z.S. (2009) Prediction of translation
initiation site for microbial genomes with TriTISA. Bioinformatics,
25(1): 123-125.
-
Hoff, K.J., Tech, M., Lingner, T., Daniel, R., Morgenstern, B.
and Meinicke, P. (2008) Gene prediction in metagenomic fragments: a
large scale machine learning approach. BMC Bioinformatics,
9:217.
-
McHardy, A.C., Martin, H.G., Tsirigos, A., Hugenholtz, P. and Rigoutsos, I.
(2007) Accurate phylogenetic
classification of variable-length DNA fragments. Nature Methods.
4: 63-72.
-
Noguchi, H., Taniguchi, T. and Itoh, T., (2008) MetaGeneAnnotator:
Detecting species-specific patterns of ribosomal biding site for
precise gene prediction in anonymous prokaryotic and phage genomes.
DNA Research, 15:387-396.
-
Noguchi, H., Park, J. and Takagi, T. (2006) MetaGene: prokaryotic
gene finding from
environmental genome shotgun sequences. Nucleic Acids Res.,
34:5623-5630.
-
Sandberg, R., Winberg, G., Branden, C.-I., Kaske, A., Ernberg, I.
and Coster, J. (2001) Capturing
Whole-Genome Characteristics in Short Sequences Using a Naive
Bayesian Classifier. Genome Res.
11: 1404 - 1409.
-
Teeling, H., Waldmann, J., Lombardot, T., Bauer, M. and Glockner, F.B.
(2004) TETRA: a web-service and a
stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences.
BMC Bioinformatics. 5:163.
|
|