(this blog post was written by group #3 as a writing assignment)
Last week in class we were introduced to genome assembly and validation by using an open-source bacterial genome assembly program A5-miseq. This program produces high quality genome assemblies by automating processes such data processing, error correction, contig and scaffold assembly, and final verification. It is an updated version that is specifically designed for Illumina Miseq reads.
Students less familiar with using the command line were given the opportunity to practice changing directories and entering commands with files for input or output, while those with experience in the process were able to help out the other. We downloaded the A5-miseq pipeline and ran our sequence reads through it. This pipeline is used in data cleaning, error correction, contig assembly, and scaffolding. While giving the program time to do its work, we discussed the steps that were being performed to produce the edited contiguous sequences that we would work with next.
After the A5-miseq finalized the data, it was time for assembly validation. We looked at the output of the pipeline to verify three things. First, we used the A5-miseq information to check that the sample sequenced produced an acceptable number of contigs (less than 200) and a realistic genome size (for example, Vibrio species normally have around a 5Mb genome). Too many contigs would mean we had a very low quality sequence, and if the number of base pairs was too large, this could mean that our sample was contaminated. Second, we wanted to make sure that specific genes were present and there weren’t large numbers of duplicate genes, because this would also tell us that our sample might be contaminated with more than one bug. We performed this step using the program Phylosift, and we even had one its developers in class to assist us if we had any questions! The third step was to run our 16s rRNA sequence through Basic Local Alignment Search Tool (BLAST), as we did with our Sanger sequences a few weeks earlier. This step was used to verify that we had sequenced the same bacteria that we selected from our 16S sanger reads.
If everything looked alright with our genomes, we uploaded them to Rapid Annotation using Subsystem Technology (RAST) to be annotated. Once in RAST after a day or two will let us determine what kind of Genus and Species is.