Well, been having many discussions recently about PCR amplification happening from “negative” controls where no sample DNA was added. Such amplification is alas pretty common – due to contamination occurring in some other material added to the PCR reaction. Obviously it would be best to eliminate all DNA contamination of all reagents and all PCRs. But if that does not happen, it is possible to try to detect contamination after it has happened. Below I post some papers related to post-sequencing detection of contamination:
- Common Contaminants in Next-Generation Sequencing That Hinder Discovery of Low-Abundance Microbes.
- Abundant Human DNA Contamination Identified in Non-Primate Genome Databases
- Fast identification and removal of sequence contamination from genomic and metagenomic datasets
- Mycoplasma contamination in the 1000 Genomes Project
- ContEst: estimating cross-contamination of human samples …
- DeconSeq @ SourceForge.net
- AlienTrimmer: A tool to quickly and accurately trim off …
- Blobology: exploring raw genome data for contaminants, symbionts, and parasites using taxon-annotated GC-coverage plots
Any other suggestions or comments would be welcome.
UPDATE 10:30 AM 7/25 –
Was reminded on Twitter of a new, critically relevant publication on this issue: Reagent contamination can critically impact sequence-based microbiome analyses
@pathogenomenick @gregcaporaso crap – can’t believe I left that off
– Jonathan Eisen (@phylogenomics) July 25, 2014
10 thoughts on “Who are the contaminants in your sequencing project?”
SourceTracker is also really useful for this. I use it all the time to, e.g., determine if samples seem to have human skin contamination. We have a QIIME tutorial covering this (though it’s covering an older version of SourceTracker now).
Do you have a database one can include as part of a QIIME workflow that would include sequences from known reagent contaminants? Then you could run Sourcetracker and see if samples seem to have reagent contamination.
We are currently discussing publishing a standard database that could be used for this, but in the meantime I typically use the data from my PNAS 2011 paper, which contains human gut, human skin, human mouth, plus soil and other environmental samples. These give you a good range of potential contaminating environments, and were sequenced with the widely used 515F/806R primers on Illumina.
Greg – Do you exclude sequences that “look like” contamination or re-process the sample? If you exclude the sequences, how are you sure the sequences are contamination and isn’t this (somewhat) assuming the answer to your sequencing effort?
I am interested in how other groups handle this. We currently run negative PCRs for every reaction and a negative for every DNA extraction batch. If we could do this well post-sequencing, it would save a lot of effort.
I’m also a bit confused as to how you can use a database to screen out contamination. If you’re looking in a new environment, particularly a human-associated one… how do you decide which taxa don’t actually belong in those samples? It seems like you have to have actual wet-lab controls as Kyle described… though it’s not clear what the best way to deal with those is either.
Pls comment on the Neanderthal and Denisovan genome projects in light of this issue.
Not sure what you are asking here …
I am in need of advice on how to “correct for contamination”. We are currently including non-template controls during our extraction process as well as our library prep process. My question now is, how do you correct for the “contamination signal”:
1. Do you remove the total number of reads present in non-template controls for specific taxa
from all your samples in the run? Or do you calculate an average number of reads sequenced
for extraction non-template controls and for library prep non-template controls and remove
these number of reads for the respective taxa from all samples in the run?
2. Would you correct for contamination at the read or OTU level?
I’d be very hesitant to rely strictly on any ‘bioinformatic’ solution to removing contaminants. If the data are not trustworthy, it would make me nervous to remove contaminants using SourceTracker (or something equivalent). The key to avoiding problems with contamination is to do good lab work on the front end – otherwise it is garbage in, garbage out.
For example – this makes me very nervous: http://americangut.org/?page_id=277 as the assumption is that there is only a handful of bacteria associated with ‘blooms’ in samples stored improperly and the abundances of other taxa will not be unduly affected.
This paper from 2002 culture-dependently profiled ultrapure water systems. Most of the contaminants they identified were also found in “Reagent and laboratory contamination can critically impact sequence-based microbiome analyses”