December Mothur Workshop, part 1

Pat Schloss (pdschloss at gmail.com) offers excellent workshops on the Mothur software for analyzing 16S rRNA data for bacteria. He’s just announced the next one (February 8 to 10, 2016 near the Detroit airport). I had the pleasure of attending the December workshop. A diverse and international group attended the workshop, with many folks who are interested in the human microbiome. I thought that the workshop was so helpful that I am sharing my notes in a couple of posts (due to their length). Here are my notes from Day 1. Although these are based on Pat’s lectures, all mistakes are likely to be my own.

We all should learn to program

R and/or Python

Codeacademy

Coursera

MOOCs

Online resources

NCBI

Genbank —Blast

Taxonomy, curation is suspect
Chimeras
Ambigious

SRA: raw reads

sra
accession number

RDP Ribosomal Database Project @MSU Michigan State rdp.cme.msu.edu

Pull sequences out of NCBI
Curate for chimeras
Build taxonomy- use Bergey’s taxonomy to genus
Aligner — not a fan
Bacteria and (some) Archaea and (some)
Bayesian classifier

Greengenes @DOE

Select for length
- Bacteria >1200 nt
- Archaea >900 nt
Chimera checking
Multiple taxonomies (good taxonomies)
- gg, rdp, ncbi, silva, Norm Pace’s to genus, some to species
Alignment — horrible

SILVA — Breman

ARB
Parc — long and short
Ref NR — long, non-redundant
Great alignment
Taxonomy to genus
Bacteria, Archaea and Eukaryotes
Have both SSU and LSU rRNA

Can bring in your own database for use in Mothur, say if you wanted to use recA

Implement a lot of other people’s tools

Mothur

rewrite others’ code

make it more flexible
better interface
speed up
parallelize

“Pipelines”

Three approaches

Two binning-based approaches

— Classification based

phylotypes

closed reference

— OTU-based (de novo)

Assign sequences to OTUs based on similarity to each other

Phylogenetic: a third approach that is not binning based but is tree-based

Tree-based

Unifrac

Phylogenetic diversity

Community wide comparisons, but not as useful for identifying say biomarkers or sources of variation

In Pat’s lab, they use combination of approaches, do individuals with different disease states differ in phylogenetic diversity, classify OTUs based on binning.

Sampling

— consistency (example of day in menstrual cycle in HMB)

size of sample

soil cores

time/weather/etc.

What is your question?

— Metadata

Biological/Clinical

Environmental: soil and water parameters

Technical: lot number of DNA extraction kit, who did the DNA extractions

Include as much data as possible

When you upload data to SRA, MIMARKS minimal information that you need to know about every dataset, such as latitude and longitude, provide as much information about your sample as you can

Randomization:

e.g., HMP: Samples collected in Houston and St. Louis; St. Louis samples extracted and amplified in St. Louis, same for Houston, sample collection site was the most significant effect and samples were not randomized

Batch effect: different lots of chips, which sequencing run (randomize samples on all sequencing runs) (include sequence run in your statistics to make sure it isn’t a factor)

Negative controls: a big issue with low biomass samples, can help you understand the role of contamination in sampling design

Extract and Purify DNA

Low biomass

Contamination from DNA extraction kit: more contamination in commercial kits

Be sure to include negative controls for DNA extraction kit

Difficult / ”Precious” tissue: use surrogate samples to test your methods

Difficult bugs / wimpy bugs

There is bias everywhere! Understand the biases inherent in your methods.

Experimental Design

Statistical power: for the microbiome, we don’t have a good sense of that, what test, what you will measure, effect size that we want to detect, how much variation; need to have a Power Analysis in an NIH grant

Say I want to detect a 10% change in Shannon Diversity (no idea what that means biologically) (this is more of a hypothesis driven approach)

Another approach is to piggyback on a study’s power analysis (for another aim, not based on measures of the microbiome but other study variables), more of a hypothesis generating approach

Another way is to base it on costs, Total $/($/sample) –> can x # samples

# reads per sample / number of samples

barcoding / indexing

20 x 10^6 reads/run

His lab does 4 96 well plates, get an average of 50,000 reads per sample

Use SequalPrep plates for normalization http://www.thermofisher.com/order/catalog/product/A1051001?ICID=search-product

(I was intrigued by this and promptly ordered some plates to try.)

Sequence V4 region (about 252 nt) with 250 PE 500 cycles

Data from MiSeq

fastq file: forward and reverse, base calls and quality score for each

Illumina flow cell

Single strand DNAs amplify by PCR

Problem with diversity, phasing, diversify barcodes, add about 10% PhiX to break things up

Sequencing Controls

Negative controls: takes a lot of DNA to see it on a gel, still need to run negative controls through normalization and sequencing

Also include a positive control

Mock community: this is the most important control that you can do, allows you to calculate your error rate, it’s a collection of DNA where you know the true sequence. Can take E coli genome and sequence the 16S to see what you get out. BEI Resources, 20 genomic DNAs pooled together in roughly similar 16S rRNA copy number: https://www.beiresources.org/Catalog/otherProducts/HM-782D.aspx

Schloss lab uses one negative and one mock community sample per run

(Seems like a good idea. Does anyone else do this?)

What to do about the contaminants?

Sequence-level contamination from kit contamination

Expect relative abundance relationships to be similar

Paper on contamination issues with DNA extraction kits:

Salter et al. 2014 http://www.biomedcentral.com/1741-7007/12/87

20 million reads but they aren’t all unique, actually maybe only something like 10^5 are unique, align and classify, what you get down to is function of error rate, error rate inflates the number of unique reads, if you can remove errors, could get down to 10^4 or 10^3 unique reads

Reads off the sequencer have an error rate of about 1%, can get that down to about 0.02%

With 0.5% error rate, every read has an error, a lot of artifacts and a lot of inflation, also we want the cleanest data possible just for good science

Mock community data allows us to do this

We know the true sequence

and we know all possible chimeras, which are not sequencing errors

allows us to know what kinds of errors we have

Goal is to reduce error rate and to increase the number of sequences

Kozich et al. 2013 http://aem.asm.org/content/79/17/5112.short

Building on Greg Caporaso’s work

V4 252 bp

V3-V4 429 bp, less overlap, higher error rate

V4-V5 375 bp, less overlap, higher error rate

By looking at longer regions can vary overlap

Looked at four types of samples:

Mock community
Human
Mouse
Soil

Error in Rev read > Error in Forward read

Looked at Substitutions (A to G) and Insertions/Deletions, found no obvious biases in either

Phred units: P=10^-Q/10

Distribution of quality scores Q 20-40 (Phred units)

Predicted error rate as an inverse log, higher Phred score, the better the quality, Phred scores are pretty close to the errors seen in the mock community

What is the Î”Q that allows you to determine which base should be called?

For example, Î”Q above 6 then the error rate doesn’t change

Keep 75% of data, error rate of about 0.06%, for V4

For V4/5:

Î”Q = 6, error rate =0.5%, keep 40%

For V3/4:

Î”Q = 6, error rate = 0.5%, keep about 5% of the data

V4 still seems best due to overlapping reads

V2 Illumina chemistry

V2 -> 2×250 nt

V3 -> 2×300 nt, craps out at 500 nt

Longer — better classification
Primers have bias

Then we installed Mothur and used it to reduce sequencing and PCR errors in our data, following the Mothur online tutorial. The first thing I noticed was how easy it was to install.

December Mothur Workshop, part 1

Like this:

Related

Leave a Reply Cancel reply