Here are my notes from day 2 of the Mothur workshop taught by Pat Schloss (pdschloss at gmail.com) in December 2015. For those who are interested in learning to use Mothur for microbiome studies, Pat will be teaching another one in February.
Mothur is better for bacterial characterization than eukaryotes because the sequences are aligned before OTU clustering. This approach doesn’t work as well for 18S sequence data because the databases are so sparse.
It goes without saying that many pathogens (including anthrax and plague) cannot be identified based on sequencing 16S rRNA alone.
Questions that might be interesting for the microbe.net community to discuss include: do we believe in OTU assignments and comparisons of different approaches in OTU clustering, such as open reference, closed reference, furthest neighbor and average neighbor.
Day 2:
We talked about:
alignment
distances
classification
clustering
Alignment
16S has secondary structure
Need to make sure motifs are lined up across all our sequences
want to preserve the positioning of these motifs across sequences
–> positional homology
Pairwise
Needleman-Wunsch
Waterman
BLAST, ESPRIT, USEARCH
PAT—-SCHLOSS
SARAHSCHLOSS
overestimates similarity
doesn’t care about secondary structure
All vs All alignment ~N2
Multiple Sequence Alignment
CLUSTAL and MUSCLE
Pairwise and
Get all PW alignments
Merge PW alignments
PAT-–SCHLOSS
SARAHSCHLOSS
SAM__SCHLOSS
Effort ~N4
Global alignment:
SAM–-SCHLOSS
SARAHSCHLOSS
Local alignment:
SCHLOSS
SCHLOSS
BLAST Basic Local Alignment Search Tool
We need a better approach for 16S data.
Profile alignments
NAST (greengenes) used by Mothur
SINA (SILVA)
Infernal (RDP)
Require reference alignment
reflects secondary structure
Search refs for closest match to the unaligned sequence
Global PW alignment of the unaligned to the reference sequence
mer counting (k=7 used in Mothur)
Example of k=5
ATGCCTTCCTGA
ATGCC
TGCCT
GCCTT
CCTTC
CTTCC
TTCCT
GCCTT
CCTGA
Compare 7mers with References, which ref shares the most kmers?
It seems kind of magical how well this works, like black magic
BLAST was slow and not as good as using kmer counting searching to find reference with closest match to reference sequence
Then use Needleman-Wunsch for alignment
Effort is proportional to the number of sequences that you have
Alignments in (fasta)
RDP- variable – don’t try
gg – variable – look drunk
SILVA- manually curated
50,000 columns wide
16S 1500 bases
16S and 18S
Extract V4 region from the alignment
SEED: 14K
Larger: 100K
Mothur program for this is align.seqs
ITS
vary in length and sequence
can have multiple and different ITS fragments per genome
HUGE intragenomic variation
question of whether there is true homology
Could look at unique sequences
pre.cluster allows you to cluster sequences that are within a base count of each other
16S rRNA also has some intragenomic variation in different taxa
V6 evolves 4 times faster than other regions
99% similar across entire region, 3% cutoff in V6: example from Mitch Sogan study with 464 lots of variation in E. coli
16S rRNA – Why use it?
universal
well-conserved
some variation
well-studied
sequenceable
good taxonomy
not prone to HGT
Why not use 16S?
well conserved
no phenotype
not functional
multiple copies
intragenomic variation
universal (in plants)
Want sequences to overlap some alignment
* space in alignment
screen.seqs
removes sequences
start
end
filter.seqs
removes alignment positions
vertical = trump
trump= . (if this column has a dot in it, it is removed) could also set it to remove columns with a –
. is missing data
– is alignment gap
trump = tidies alignment
vertical = removes columns that are 100% gap
then run unique.seqs again (though more important with 454 data than Illumina data)
pre.cluster
Preclustering step to remove extra sequencing error
makes the assumption that we see later that high abundance readings are more reliable
Sort reads in decreasing order of abundance
Look for rarer reads that are within a threshold of the more abundant sequence
Remove rare sequence, add its counts to the abundant sequence
say differences allowed are 2 (diffs = 2)
100 reads
50 50 (within a base)
20 (within a base)
2 (within a base)
2 (within a base)
Remove the reads that are within 1 base of the 100 and add to 100 and the 20, end up with 224 instead of 100.
What do you use for number of diffs?
based on sequence length and threshold
length is 50 nt
2 diff is 4%
if length is 100
2 diff is 2%
They use 1 diff per 100 nt
For MiSeq 250, use diff=2
to stay under a threshold of 3%
0.06% TO 0.02%
not a huge reduction but still 3 fold reduction in error
Chimeras
PCR artifact, incomplete extension followed by heterologous priming–> chimera
annealing of a PCR fragment to another fragment that then gets completed by Taq on another fragment, ends up half from one and half from another
Use a 5 min extension time in PCR, as extension time increases, chimerism decreases (Brian Haus et al at Broad Institute)
increase DNA increased chimeras
increase PCR cycles increase chimeras
increase sheared DNA increases chimeras
5-25% of reads were chimeras
Looked at HMP data for chimeras
26% found in 2 or more samples
14 chimeras in 20/30 libraries
–>not random
most abundant chimera appeared 30 times across the libraries
What can we do to remove chimeras?
Current tools include:
ChimeraSlayer dev by Brian Hass
uchime dev by Robert Edgar
Perseus dev by Chris Quince
All have different quirks
Some require a database (ChimeraSlayer and uchime)
but the databases are littered with chimeras
best sequences that are chimera free come from cultured bugs
database-independent
For each sample:
Sort by abundance
Trust #1 and #2 most abundant sequences
Ask is #3 a chimera of 1 and 2, if no then keep 3
Continue to #4, work through whole list
FDR Sensitivity (can detect chimera) Specificity (is actually a chimera)
Slayer 2.5 80 94
Uchime 2.6 88 94
Perseus 2.4 87 93
Can run all three in Mothur and compare.
Schloss lab prefers Uchime.
Takes sequences breaks them into left and right and compares them to reference group, if similar to different places on the tree, then chimera
Challenge that we face in detecting chimeras: hard to detect them, factors influencing ability to detect them: length, similarity of parents, location of breakpoint
Seq is in ABC
Chimeric sequence is in B
What to do?
Prefer to just remove B
Classification
BLAST
kmer counting
which ref to base classification on?
k – nearest neighbor
k=? k=1? k=10?
Bayesian
Probability of Genus j given our sequence
G1 G2 G3 G4
kmers that show up multiple times
Bootstrapping
randomly pick some number N of khmers from query
sample with replacement
# taxonomy –> confidence
also a bit of black magic
Require confidence > 80% in their classifications (default in QIIME is 50%)
All these methods can be done in a command called classified.seqs
Contaminants
chloroplasts
mitochondria
Eukarya/Archaea
Rarefaction controls for the same amount of junk per sample
only look at OTUs that show up in a certain number of samples
Gaps
What about gaps? Need to do something with the gaps.
ATGCCATG
AACC–TG
one approach is to ignore gaps 2/6 = 0.33
makes things with gaps look more similar to each other than they really are
count each gap 4 differences 4/8 = 0.50
one gap = 3/7 = .43
for 16S these are all pretty similar
Protein coding sequences
Correction for multiple substitutions
jukes-cantor
Use distances to calculate OTUs
Clustering
3%
set your own threshold
OTUs are not a surrogate for species
reference-free
applies common threshold across sequences
slow/computationally demanding
How do OTUs relate to taxonomy?
Find many OTUs per genus
subgenus level lineages
Don’t find multiple genera per OTU
Where did 3% come from?
Sequencing error
intragenomic variation
OTU clustering
de novo clustering
–hierarchical
–greedy
usearch – fast
vsearch – not great OTU assignment
swarm
classify:
“neighbors”
furthest – everything in OTU is within threshold
nearest – everything in OTU is within threshold of at least one other
average – everything in OTU is on average less than the threshold apart
in same OTU are similar to each other
True Pos x x
True Neg 0 0
False Pos x 0
False Neg O x
Furthest neighbor has no false positives and more false negatives
Nearest neighbor has no false negatives and high number of false positives
Average has some false positives and some false negatives
Matthew’s Correlation Coefficient:
a metric that allows us to take these four parameters and weight them somewhat evenly
average neighbor is the best for MCC
average neighbor is also better than the greedy algorithms
QIIME closed and open reference:
numerous problems with reference-based approaches
search algorithm is a problem
for closed, take a sequence and find the closest reference and assign your sequence to that reference
for open, do closed reference and then with the leftover, they do usearch to cluster that to OTUs
both of these methods suck compared to average neighbors
Furthest neighbor is not a good method, get different numbers of OTUs (based on when you cluster relative to rarefaction?) see more in PeerJ paper
Can get variation in OTU assignments when you run it over and over again
But these slightly different results are all pretty good
Average neighbor is slow but better
Done some things to speed things up
don’t find OTUs that are made up of multiple genera, families
classify.seqs
Take sequences that we’ve classified and then split by taxon and within each taxon, we then cluster
and then put data back together
speeds things up
uses less RAM
can parallelize clustering steps
vsearch is free and will be brought into the next version of Mothur
for last few years average neighbor has been the default in Mothur
can do phylotyping as well as OTUs
Phylotyping
split sequences into OTUs by classification
can use OTUs to classify
classify.otu
then count: for each OTU can count the number of times a sequence is classified as say “T”
use T > 50%
then assign that taxon name to the OTU
majority consensus classification
alternative is to find representative sequence but this has issues
output is *.otu.taxonomy
Today the goal is to go from aligned seqs to OTUs
Can take our list file and out group/count file and generate a shared file
the shared file is a table where the rows are samples and the columns are the OTUs and the values are the counts
also the biom format is needed for picrust and other programs
Mothur can also generate biom format
—
Yesterday we left of with count.seqs
Like this:
Like Loading...
Related