Monday, June 6, 2011

Crowdsourcing annotation of the German (monster-) E. coli (STEC O104:H4)?

By Mariam Rizkallah - Open Source Pharmacist
 
(added by Niyaz Ahmed ->> Gut Pathogens will consider to publish any such annotation study on priority!) 

As soon as Beijing Genomics Institute has finished sequencing the E. coli strain that is responsible for the outbreak in Europe, BGI made the "draft assembly" available for download. I didn't know the difference between a draft and the elegant assembled genome I used to work on while annotating phage capsid proteins, but I did believe that this outbreak idea is a phage thing, you know, "Cherchez le phage!" I wanted to annotate this E. coli.

I suggested to Dr. Aziz that this "genome sequence" may benefit from RAST (Rapid Annotation using Subsystem Technology – http://rast.nmpdr.org), the tool developed by Dr. Edwards' Lab and ANL for rapid and accurate annotation of bacterial and archaeal genomes. Dr. Aziz replied "Then what are you waiting for? Upload it to RAST". Then he told me how the draft is in "FASTQ" format, a machine output format that can not be an input for RAST (and for Real-Time Metagenomics.web in particular http://edwards.sdsu.edu/rtmg/). Dr. Aziz explained to me that each "run/genome fragment" will be treated as metagenome, that's why I should use RTMg.

Step1: Convert sequences from fastq to fasta: I searched the web for the problem, then I found this one-line program in Perl by Robert Schmeider, PhD student at Dr. Edwards' lab on his lab blog. I am very grateful, Robert!

Step2: Uploading the draft in fasta format to RTMg as separate "runs". I don't know if it's the right measure, or I should've collected all "runs" in one big fasta file and let RTMg process them.

Step3:I made use of the "tab-separated" output of RTMg and saved it to a spreadsheet (I do apologize, but the spreadsheet on Google is not viewable, please download it to edit it). I saved the results sorted by "function" and by "3-level subsystem".

Step4: ??? I don't know really. I do want to make use of RTMg data, but I don't know where to start. I guess aligning the 5 runs as well as the strain vs other E. coli strains will be the answer. Community will take STEC O104:H4 down sooner than "it thought".

I quote Dr. Aziz here:

"RTMg is mainly a tool for metagenomics analysis, and it treats every sequence read as an environmental gene tag (EGT) or simply as an independent fragment. It cannot be really used to explain a genome; however, it gives a really quick ideaof what each sequence looks like, of the ratio of different genes, of the most frequent protein-encoding genes in this strain, the density of phage proteins (for example, you can write a script to count the word phage and calculate its percent to the overall genome), etc.

There are two main things needed though before this: 1) is to clean up the sequence fragments; 2) is to assemble them (using, for example, NEWBLER). To check the sequence and clean it up check: Prinseq and Tagcleaner then re-run. Another way to benefit from RTMg would be to compare that strain with another one sequenced exactly the same way, which is not relevant here."