Pat Schloss (pdschloss at gmail.com) offers excellent workshops on the Mothur software for analyzing 16S rRNA data for bacteria. He’s just announced the next one (February 8 to 10, 2016 near the Detroit airport). I had the pleasure of attending the December workshop. A diverse and international group attended the workshop, with many folks who are interested in the human microbiome. I thought that the workshop was so helpful that I am sharing my notes in a couple of posts (due to their length). Here are my notes from Day 1. Although these are based on Pat’s lectures, all mistakes are likely to be my own.
We all should learn to program
R and/or Python
Codeacademy
Coursera
MOOCs
Online resources
NCBI
Genbank —Blast
- Taxonomy, curation is suspect
- Chimeras
- Ambigious
SRA: raw reads
- sra
- accession number
RDP Ribosomal Database Project @MSU Michigan State rdp.cme.msu.edu
- Pull sequences out of NCBI
- Curate for chimeras
- Build taxonomy- use Bergey’s taxonomy to genus
- Aligner — not a fan
- Bacteria and (some) Archaea and (some)
- Bayesian classifier
Greengenes @DOE
- Select for length
- Bacteria >1200 nt
- Archaea >900 nt
- Chimera checking
- Multiple taxonomies (good taxonomies)
- gg, rdp, ncbi, silva, Norm Pace’s to genus, some to species
- Alignment — horrible
SILVA — Breman
- ARB
- Parc — long and short
- Ref NR — long, non-redundant
- Great alignment
- Taxonomy to genus
- Bacteria, Archaea and Eukaryotes
- Have both SSU and LSU rRNA
Can bring in your own database for use in Mothur, say if you wanted to use recA
Implement a lot of other people’s tools
Mothur
rewrite others’ code
- make it more flexible
- better interface
- speed up
- parallelize
“Pipelines”
Three approaches
Two binning-based approaches
— Classification based
phylotypes
closed reference
— OTU-based (de novo)
Assign sequences to OTUs based on similarity to each other
Phylogenetic: a third approach that is not binning based but is tree-based
Tree-based
Unifrac
Phylogenetic diversity
Community wide comparisons, but not as useful for identifying say biomarkers or sources of variation
In Pat’s lab, they use combination of approaches, do individuals with different disease states differ in phylogenetic diversity, classify OTUs based on binning.
Sampling
— consistency (example of day in menstrual cycle in HMB)
size of sample
soil cores
time/weather/etc.
What is your question?
— Metadata
Biological/Clinical
Environmental: soil and water parameters
Technical: lot number of DNA extraction kit, who did the DNA extractions
Include as much data as possible
When you upload data to SRA, MIMARKS minimal information that you need to know about every dataset, such as latitude and longitude, provide as much information about your sample as you can
Randomization:
e.g., HMP: Samples collected in Houston and St. Louis; St. Louis samples extracted and amplified in St. Louis, same for Houston, sample collection site was the most significant effect and samples were not randomized
Batch effect: different lots of chips, which sequencing run (randomize samples on all sequencing runs) (include sequence run in your statistics to make sure it isn’t a factor)
Negative controls: a big issue with low biomass samples, can help you understand the role of contamination in sampling design
Extract and Purify DNA
Low biomass
Contamination from DNA extraction kit: more contamination in commercial kits
Be sure to include negative controls for DNA extraction kit
Difficult / ”Precious” tissue: use surrogate samples to test your methods
Difficult bugs / wimpy bugs
There is bias everywhere! Understand the biases inherent in your methods.
Experimental Design
Statistical power: for the microbiome, we don’t have a good sense of that, what test, what you will measure, effect size that we want to detect, how much variation; need to have a Power Analysis in an NIH grant
Say I want to detect a 10% change in Shannon Diversity (no idea what that means biologically) (this is more of a hypothesis driven approach)
Another approach is to piggyback on a study’s power analysis (for another aim, not based on measures of the microbiome but other study variables), more of a hypothesis generating approach
Another way is to base it on costs, Total $/($/sample) –> can x # samples
# reads per sample / number of samples
barcoding / indexing
20 x 10^6 reads/run
His lab does 4 96 well plates, get an average of 50,000 reads per sample
Use SequalPrep plates for normalization http://www.thermofisher.com/order/catalog/product/A1051001?ICID=search-product
(I was intrigued by this and promptly ordered some plates to try.)
Sequence V4 region (about 252 nt) with 250 PE 500 cycles
Data from MiSeq
fastq file: forward and reverse, base calls and quality score for each
Illumina flow cell
Single strand DNAs amplify by PCR
Problem with diversity, phasing, diversify barcodes, add about 10% PhiX to break things up
Sequencing Controls
Negative controls: takes a lot of DNA to see it on a gel, still need to run negative controls through normalization and sequencing
Also include a positive control
Mock community: this is the most important control that you can do, allows you to calculate your error rate, it’s a collection of DNA where you know the true sequence. Can take E coli genome and sequence the 16S to see what you get out. BEI Resources, 20 genomic DNAs pooled together in roughly similar 16S rRNA copy number: https://www.beiresources.org/Catalog/otherProducts/HM-782D.aspx
Schloss lab uses one negative and one mock community sample per run
(Seems like a good idea. Does anyone else do this?)
What to do about the contaminants?
Sequence-level contamination from kit contamination
Expect relative abundance relationships to be similar
Paper on contamination issues with DNA extraction kits:
Salter et al. 2014 http://www.biomedcentral.com/1741-7007/12/87
20 million reads but they aren’t all unique, actually maybe only something like 10^5 are unique, align and classify, what you get down to is function of error rate, error rate inflates the number of unique reads, if you can remove errors, could get down to 10^4 or 10^3 unique reads
Reads off the sequencer have an error rate of about 1%, can get that down to about 0.02%
With 0.5% error rate, every read has an error, a lot of artifacts and a lot of inflation, also we want the cleanest data possible just for good science
Mock community data allows us to do this
We know the true sequence
and we know all possible chimeras, which are not sequencing errors
allows us to know what kinds of errors we have
Goal is to reduce error rate and to increase the number of sequences
Kozich et al. 2013 http://aem.asm.org/content/79/17/5112.short
Building on Greg Caporaso’s work
V4 252 bp
V3-V4 429 bp, less overlap, higher error rate
V4-V5 375 bp, less overlap, higher error rate
By looking at longer regions can vary overlap
Looked at four types of samples:
- Mock community
- Human
- Mouse
- Soil
Error in Rev read > Error in Forward read
Looked at Substitutions (A to G) and Insertions/Deletions, found no obvious biases in either
Phred units: P=10-Q/10
Distribution of quality scores Q 20-40 (Phred units)
Predicted error rate as an inverse log, higher Phred score, the better the quality, Phred scores are pretty close to the errors seen in the mock community
What is the ΔQ that allows you to determine which base should be called?
For example, ΔQ above 6 then the error rate doesn’t change
Keep 75% of data, error rate of about 0.06%, for V4
For V4/5:
ΔQ = 6, error rate =0.5%, keep 40%
For V3/4:
ΔQ = 6, error rate = 0.5%, keep about 5% of the data
V4 still seems best due to overlapping reads
V2 Illumina chemistry
V2 -> 2×250 nt
V3 -> 2×300 nt, craps out at 500 nt
- Longer — better classification
- Primers have bias
Then we installed Mothur and used it to reduce sequencing and PCR errors in our data, following the Mothur online tutorial. The first thing I noticed was how easy it was to install.