Sequencing of PCR-amplified marker regions (e.g. 16S, ITS) for characterization of sample microbial ecology is a widely-used tool in Microbiology of the Built Environmenta (MoBE) investigations. Due to the large amount of data produced by these methods, sequences are typically clustered into operational taxonomic units (OTUs) based on sequence similarity to simplify downstream processing. However, the impacts of sequence clustering and other bioinformatics pipeline decisions on study results are too often not carefully considered by investigators.
Naomichi Yamamoto (Seoul National University) and I recently published a manuscript entitled Clustering of fungal community internal transcribed spacer (ITS) sequence data obscures taxonomic diversity (doi: 10.1111/1462-2920.12390) investigating the impact of clustering fungal community ITS sequence data on taxonomic coverage (number of taxa observed). This work demonstrates a small, but statistically significant loss in taxonomic coverage when applying typically used clustering techniques, while also demonstrating that clustering does not statistically improve taxonomic assignment. Ambiguous sequences (i.e. sequences with divergent taxonomic assignment) were excluded, alleviating obvious issues with dual nomenclature or pleomorphs.
I am not advocating that sequence clustering prior to taxonomic assignment is inappropriate in all (or even most) study cases; however, investigators should be aware of the trade-offs associated with all decisions made in bioinformatics pipelines to ensure robust study conclusions. This work points to the necessity of carefully considering all choices made in bioinformatics pipelines and not accepting default settings if they conflict with our study goals. The MoBE program represents an interesting case, where many investigators are relatively inexperienced in bioinformatics, yet seek to apply bioinformatics tools to make study conclusions. Reiterating a consistent theme throughout the MoBE program, all investigators don’t need to be bioinformatics experts (or building science experts, or architectural experts…), but everybody needs to know enough to be dangerous.
Lab Blog: Bibbylab.blogspot.com Twitter: @Bibby_Lab
Kyle – thanks for the post. I added some links (e.g., to your web site, the articles, etc) — hope that it OK.
I agree with the general point here and note that a key aspect of doing bioinformatic analysis should be publishing and sharing the full workflow and code of what one did in an analysis.
See http://blogs.biomedcentral.com/bmcblog/2013/02/28/version-control-for-scientific-research/ for example.