home Bioinformatics, Training and Courses December Mothur Workshop, part 3

December Mothur Workshop, part 3

Here are my notes from day 3 of the Mothur Workshop that is taught by Pat Schloss (pdschloss at gmail.com) at a hotel that is conveniently located near the Detroit airport. Of all of the bioinformatic workshops I have attended, this group of students was my favorite so far. We had a lot of fun having dinner together at some of the nearby restaurants. Contact Pat if you are interested in attending the next one in February. It’s really helpful and you get a free Mothur sticker!


—Day 3 of the Mothur Workshop—
At the beginning of your study, think about what your question is, what your hypothesis is
On Wednesday we started with fasta files from sequencer
Yesterday we ended up with shared files with OTUs and counts and a consensus taxonomy file
We talked about OTUs and how to define them, using a 3% threshold in general gives you multiple OTUs per genus and not multiple genera per OTU
The SOP is exactly what the Schloss lab uses, they may play around with the reference database used
Recommends that we update the alignment databases when you start using Mothur at home
Common questions addressed by these analyses
Diversity (also called alpha diversity)
– combines evenness and richness
– functional
– variation
– number of things –> richness
– biodiversity
there are a number of papers in the microbiome literature that confuse richness with diversity
–> ignores the “names”
diversity can be the same in different habitats but contain very different organisms
alpha diversity describes a community
     Steve Carpenter Science review paper about 10 years old on diversity and stability, 1/3 pos, 1/3 neg, 1/3 neutral
     Quality of diversity matters, relates to function
breastfed babies have low diversity
How do we quantify diversity?
Rank abundance plot
x-axis OTUs
y-axis abundance
Shannon index
pi = rel abundance of OTUi
H’=-Sum pi ln(pi) ~Entropy
increasing H, increasing diversity
What does this really mean?
Commonly used term
Another tool that Pat personally likes is the
Simpson Index
D = Sum of pi squared
Increase D, diversity decreases
0 < D < 1
1/D ~ # of taxa
If richness is similar to 1/D, then you have a very even community
1/D << ST low evenness
1/D ~ ST high evenness
Non-parametric stats
Chao Index
     Chao1 = Sobs <- estimate of jobs richness + n12/2n2 <- estimate of unobserved richness
     also ACE and jackknife
in Mothur
Parametric approaches
try to fit species abundance data to normal
use lognormal
power series
log series, etc.
These transformations never fit the data well
Not in Mothur yet but will be
Rarefaction curves
collector’s curves
average of the collector’s curves –> rarefaction curve
effort/number of samples
Have rarefied to as low as 500 reads per sample
Don’t put your rarefaction curve in your papers
Measures how well you sample more abundant taxa
Sequence coverage = Good’s coverage = 100 x (1-n1/NT)%
   where n1 is singleton, NT is Effort (total number of sequences generated from that sample)
OTU Coverage = 100(1-n1/Sobs)%
% of OTUs observed more than once
Phylogenetic diversity
Build a tree of your sequences
Sum branch length across all the sequences in your sample
Also sensitive to sampling effort, need to rarefy
Randomly select a number of sequences
– Stats for diversity
– Plot
     -vs time
     – change
– line plot
– strip chart (good if you don’t have a lot of samples)
– box plot (not so good with few samples)
– Rarefaction curves
– Species abundance
– Rank abundance
Describe your community, what’s there
Beta diversity
– Pie chart
– Stacked bar chart
Why these are bad
Pie charts
   –  hard to visually compare areas in a pie chart
   –  also too many taxa for color palette to make pretty or useful
   –  not possible to see variation in data
   – do a strip chart of that population
Beta diversity
     – Uses “names”
     – membership – compares the list of names
     – structure – compares “names” and abundance
Comparing bird lists is a comparison of membership
Structure: seen an eagle 200 times
Jaccard Index = SAB/(SA + SB – SAB)
SA , SB = richness of A or B
SAB = richness that is shared
sensitive to incomplete sampling and uneven sampling
Other similar metrics include Sorenson
Unweighted unfired is a similar kind of measure of membership
has similar limitations
He prefers to randomly draw n sequences from all samples, calculate Jaccard Index , calculate J and repeat 100 times, get average J
The Schloss lab uses this as a measure of beta diversity
Preferred is to use structure based approaches
such as Bray-Curtis, Morisita-Hom, Theta YC
BC = 1 – 2 (Sum min value of (SAi , SBi))/(Sum of SAi + Sum SBi)
Sensitive to sampling effort, uneven samples (number of sequences)
MH = 1-2((Sum min value of (PAi * PBi))/(Sum of PAi2 + Sum PBi2)
emphasizes more abundant bugs
Phylogenetic approaches and what they seek to do
Summation of all the branch lengths of a tree that contains all the sequences from a sample
Unifrac tries to say what fraction of the tree is unique to each type of sample
Unifrac distance = Sum (Unique Branch Length)/ Sum(Branch Length)
Does not take into account the abundance of individual reads
Weighted Unifrac is similar but instead of all or nothing, uses relative abundance to weight the branches
Different approach from bin-based methods by using tree as input
What does sharing branch length mean?
Can’t get at function with this, although some hope we can
another way of looking at Beta Diversity
Pat’s paper on evaluating the structure of different microbial communities
Lean and obese people, build a tree to compare them, use ordination
How to tell if these differences are statistically significant?
     – get distance
     – randomize labels
     – recalculate distance
AMOVA (Permanova)
     -non-parametric analysis of variance
     – example, does average height differ between different genders
     – Do clouds in ordination have the same center?
     – tests for homogeneity of variance
     – Bartlett’s test
     – In the spread different between two groups
In the case of geography, takes 3D representation and makes it 2D
Introduces a lot of distortions into the data, for example, map of the US
Microbial ecology data is prone to a lot of distortions
– Eigenvalue / Vector-based
     data fitting procedure, try to fit lines through your data
     got a big cloud of points, try to draw a line through the cloud and say that this line accounts for most
           of the variation in the data
     draw another line that accounts for the next most amount of variation in the data, don’t have to be
          perpendicular but are orthogonal
     linear combinations of data
     Say PC1-3 ~30% of the data
     That sucks but maybe we can see signal of how the samples are separating
     Can look at contributions of different OTUs in ordination space
Principal components: don’t use this (uses R2, way of calculating similarity of samples, treats double zeros as similar, but this doesn’t work if both OTUs are absent, makes things look more similar then they are because of the zeros)
Principal coordinates works with OTUs that are present
     – non-metric dimensional scaling
     – you provide the number of dimensions that you want
          say 2 dimensions
     – will array your points in 2D so the distance between the points is proportional to the distance in the matrix, has a metric called stress, which describes how well this ordination reflects the distances from your beta diversity metric, random process, moves the points around until where they best reflect the input matrix
     stress: ideally less than 0.1, often it’s 0.2, should report it, not sure what else you can do
     the position of the points can shift when you run it multiple times but the relationships stay the same
     A con is that people outside the field always ask what the axes mean
Can take Bray-Curtis distances, generate ordination, generate a new distance matrix from the ordination, then can calculate the R2 between input and output, how well does the ordination reflect the input matrix, output is always higher for NMDS versus PCoA
     About visualizing the data, at best you are looking at a distorted 2D view of the data
Idea of the Enterotype
blood types for your gut
          Could be a soil type, leaf type
          Community types across the body
          A way to reduce the complexity of the data
          two methods that are commonly used:
               PAM Partitioning around the mediod
                    distance based
               DMM: Dirichelet multinomial mixture models
                    uses count data
               Found DMM is better than PAM in Schloss paper on body community types in Nature
What are the OTUs that make these communities different?
Is OTU 1 different between groups?
Can do a t-test or ANOVA but non-parametric is better because we have tons of zeroes in our data
     We have inflated number of zeros
     have a large number of OTUs
     problem of multiple comparisons
     0.05/# of comparisons – Bonferroni – is too strict
     Benjimani-Hochberg is another way, less strict used in Mothur
    Nonparametric approaches:
     Kruskal-Wallis (in Mothur)
     Wilcoxon test (in Mothur)
     lefse (Huttenhower method) (in Mothur)
          NP test
          also do LDA for effect size (linear discriminant analysis)
Screen OTUs
     -minimum average relative abundance, say minimum of 20 sequences per sample or 1%, or a minimum number of samples that the OTU shows up in
Problem with these approaches is that the OTUs are treated independently
We get a lot of people with zeros, a subset of people have a type of bacteria when most don’t e.g. colon cancer
Find Random Forest Models to be useful
     – Classification trees
     – Combine subsets of OTUs
     – Non-linear relationships
     – can be used to detect subsets
     – can combine metadata
     Are there markers that can be used to identify colon cancer or even IAV?
     Output can be quantitative or discrete
     Can make models for how much C. diff is expected at different time points in a mouse study
     Nice Gordon lab study of malnourished children, built a random forest model to predict age in healthy
           kids and calculated microbiome age in malnourished kids
Started tutorial in SOP with Analysis

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: