Genome Re-annotation of Escherichia coli CFT073

As a human pathogen, the genome of uropathogenic Escherichia coli strain CFT073 was sequenced and published in 2002, which was a landmark for the study of uropathogenic infections (Welch et al 2002). However, the current RefSeq annotation of this pathogen is outdated to some degree, due to missing or misannotation of some ensential genes associated with its virulence. We carried out a systematic reannotation by combining automated annotation tools with manual effects to provide a comprehensive understanding of virulence of the CFT073 genome.

Since public DNA sequence databases such as DDBJ, EMBL/EBI and GenBank accept updates of annotations only from original submitters, for third party annotators, it is advised to seek alternative solutions to make genome reannotation publicly accessible to the research communities (Salzberg 2007).  This website is devoted to this issue. It includes three sections: 1) a brief overview of the methods for reannotation, 2) links to browse the  reannotation, and 3) links for data download. 

Citation: Chengwei Luo, Gang-Qing Hu and Huaiqiu Zhu: Genome  reannotation of Escherichia coli CFT073 with new insights into virulence. BMC Genomics, 2009,10:552.


Methods

All open reading frames (ORFs) longer than 60 bps are extracted from the genome sequence of the CFT073 strain downloaded from RefSeq.  We first searched the ORFs against the Swiss-Prot by blastp and conserved domain database (CDD) by rps-blast (e-value < 1E-5 and identity > 30%), then considered results from gene-finders including EasyGene 1.2, GeneMark.hmm, Glimmer 3.02 and MED 2.0. Specially, we only include genes co-predicted by at least three of the tools. Besides, to have a more complete picture, the reannotation also includes genes with known functions from the original annotations. We made comments on gene functions from CDD/Swiss-Prot blast results, as well as the original function annotation, if available.

RefSeq's original annotations on tRNAs and rRNAs and Cryptic prophages are retained in this reannotation. Of note, it is interesting to observe that most of the small RNAs (sRNA) known so far in Escherichia coli  are missing from the original annotation. To correct his systematic defect, we combined Rfam9.0 prediction and literature investigation for sRNA annotation.

For gene start annotation, we used the ProTISA pipeline  that provides  high quality annotations of gene starts with a variety of evidences including experiments, conserved domain search, n-terminal sequence alignments among orthologous genes and predictions from the state-of-the-art.


Browser

This section includes links to the reannotation of coding sequences of the genome. It gives information that occasionally appears in a  file archived in databases GenBank and RefSeq with an extension name of "ptt" (for instance), including location, strand, length, PID, comments on function etc.  In addition, it provides evidence from  blast results from Swiss-Prot, CDD and VFDB for virulence factors,  the name of pathogenicity island that the gene belong to (if available) , prediction support from the four gene finders and the category of the gene starts annotated by the ProTISA pipeline.

To speed up the browser, we split the total annotations into 21 files, ~250 entries for each, sorted according to gene locations. Please click one of the links to browse (location of the first gene included in the file is set as the file name):

190..255, 256700..256930, 528399..528716, 802268..802933, 1059490..1059657, 1295117..1295353, 1487534..1488556, 1748790..1748927, 2002243..2004189, 2252766..2253386, 2552901..2554031, 2834866..2836227, 3099533..3100099, 3379001..3379732, 3628503..3629552, 3877955..3878260, 4139768..4141039, 4393412..4394527, 4664945..4666048, 4941885..4942061, 5201332..5202405.

 Below lists explanations of column names of the file:

  • Location: for positive strand, the locations of start codon and stop codon, separated by "..". Otherwise, the positions of stop codon and start codon.

  • Strand: positive strand "+", and negative strand "-".

  • Length: length of the gene (in amino acid)

  • PID: the number that uniquely identify a protein in public databases; not set for new genes.

  • Gene: gene name.

  • Synonym: the ID code follows the order of gene locion chromosome in the reannotation.

  • Code: all set as "-", following the original annotation.

  • COG: follow the original annotation; not set for new genes.

  • Swiss-Prot hit: the most significant hit to Swiss-Prot, with e-value, identity and a link to that hit in Swiss-Prot.

  • CDD hit: the most significant hit to CDD, with e-value, identity and a link to that hit in CDD.

  • VFDB hit: the most significant hit to VFDB, with e-value, identity and a link to that hit in VFDB.

  • Prediction tools: tools that predict the gene. 1 represents EasyGene 1.2, 2 for  GeneMark.hmm3 for Glimmer 3.02 and 4 for MED 2.0.

  • PAI: name of the pathogenicity island that the gene belongs to.

  • TIS tag: IPT: direct/indirect evidence from experiments; CDC: confirmed by conserved domain search; HSC: confirmed by N-terminal Sequence alignments of orthoglosous genes; Tri: predicted by TriTISA. See the ProTISA database and associated publication for details.

  • Product: comments on gene function.


Downloads

  1. Whole genome sequence  from RefSeq (release 30; fasta format; 5182k).

  2. Original annotation from RefSeq (release 30; fasta format; 414k).

  3. Genome reannotation, as appeared in the section "Browser" (plain text; 675k).

  4. Miscellaneous RNA genes in the reannotation (plain text; 1.59k).

  5. Rule-out RefSeq genes in the reannotation (plain text; 38.1k).

  6. Newly added genes in the reannotation (plain text; 18k).


References

  1. Salzberg SL: Genome re-annotation: a wiki solution? Genome Biologygy 2007, 8:102
  2. Welch RA, Burland V, Plunkett Gr, Redford P, Roesch P, Rasko D, Buckles EL, Liou SR, Boutin A, Hackett J, Stroud D, Mayhew GF, Rose DJ, Zhou S, Schwartz DC, Perna NT, Mobley HLT, Donnenberg MS, Blat- tner FR: Extensive mosaic structure revealed by the complete genome sequence of uropathogenic Escherichia coli. Proc Natl Acad Sci U S A 2002, 99(26):17020-17024.

Last update on July, 2009, Copyright(C)2009, All Rights Reserved กก