The idea for GenomePeek began two years ago when I was working with Karl Klose, Liz Dinsdale, and Rob Edwards to assemble a P. salmonis genome that was being particularly difficult, even though we had 9 gigabases of sequencing. To check whether it was a single isolated genome I pulled out all the 16S reads that hit to 16S and then assembled them. All of the assembled contigs hit to P. salmonis, it was only until later that I found that genome had been shredded by an overactive transposon. We still haven’t solved that problem ….
GenomePeek lay dormant for a year, until a student from SDSU’s bacterial sequencing class came looking for help. In the class the students were isolating a bacterium, sequencing, assembling, and then analyzing it. The student wanted to create a recA phylogeny and was having trouble with taxonomic assignment. Their question was: “which RecA gene do I use?”. This set off a red flag immediately, since RecA is essentially a single copy housekeeping gene. On inspection, their assembled genome did indeed have two full copies of RecA; one that hit to a Vibrio and one that hit to a Photobacterium. At the time the two 16S rRNA genes hit to Vibrio, although they were clearly different from each other and hit to different species (since then a representative Photobacterium 16S sequence has been added to the NCBI 16S database that is now the top hit). It turned out that this bacterium also had two sets of other single copy housekeeping genes (I checked rpoB, groEL, nifD, gyrB, and fusA). One of each gene hit to a Vibrio species while the other was most similar to a Photobacterium species. Suffice to say the student was very disappointed after spending a few weeks analyzing and writing up a paper. The idea occurred for a tool where one could submit sequencing data and then quickly get back a set of useful housekeeping genes for phylogenetic analysis. I thought this tool would save everyone’s time wasted on assembly, annotation, and analysis. By quickly checking sequences, we could easily detect whether the original sequencing data was contaminated. I wrote (more…)