Here are my notes from day 3 of the Mothur Workshop that is taught by Pat Schloss (pdschloss at gmail.com) at a hotel that is conveniently located near the Detroit airport. Of all of the bioinformatic workshops I have attended, this group of students was my favorite so far. We had a lot of fun having dinner together at some of the nearby restaurants. Contact Pat if you are interested in attending the next one in February. It’s really helpful and you get a free Mothur sticker!

—Day 3 of the Mothur Workshop—

At the beginning of your study, think about what your

**question**is, what your**hypothesis**isOn Wednesday we started with fasta files from sequencer

Yesterday we ended up with shared files with OTUs and counts and a consensus taxonomy file

We talked about OTUs and how to define them, using a 3% threshold in general gives you multiple OTUs per genus and not multiple genera per OTU

The SOP is exactly what the Schloss lab uses, they may play around with the reference database used

Recommends that we update the alignment databases when you start using Mothur at home

Common questions addressed by these analyses

__Diversity (also called alpha diversity)__

– combines evenness and richness

– functional

– variation

– number of things –> richness

– biodiversity

there are a number of papers in the microbiome literature that confuse richness with diversity

–> ignores the “names”

diversity can be the same in different habitats but contain very different organisms

alpha diversity describes a community

Steve Carpenter Science review paper about 10 years old on diversity and stability, 1/3 pos, 1/3 neg, 1/3 neutral

Quality of diversity matters, relates to function

breastfed babies have low diversity

How do we quantify diversity?

Rank abundance plot

x-axis OTUs

y-axis abundance

Shannon index

pi = rel abundance of OTUi

H’=-Sum pi ln(pi) ~Entropy

increasing H, increasing diversity

What does this really mean?

Commonly used term

Another tool that Pat personally likes is the

Simpson Index

D = Sum of pi squared

Increase D, diversity decreases

0 < D < 1

1/D ~ # of taxa

If richness is similar to 1/D, then you have a very even community

1/D << ST low evenness

1/D ~ ST high evenness

Non-parametric stats

Chao Index

Chao1 = Sobs <- estimate of jobs richness + n12/2n2 <- estimate of unobserved richness

also ACE and jackknife

in Mothur

Parametric approaches

try to fit species abundance data to normal

use lognormal

power series

log series, etc.

These transformations never fit the data well

Not in Mothur yet but will be

Rarefaction curves

collector’s curves

average of the collector’s curves –> rarefaction curve

effort/number of samples

Have rarefied to as low as 500 reads per sample

Don’t put your rarefaction curve in your papers

Measures how well you sample more abundant taxa

Sequence coverage = Good’s coverage = 100 x (1-n1/NT)%

where n1 is singleton, NT is Effort (total number of sequences generated from that sample)

OTU Coverage = 100(1-n1/Sobs)%

% of OTUs observed more than once

__Phylogenetic diversity__

Build a tree of your sequences

Sum branch length across all the sequences in your sample

Also sensitive to sampling effort, need to rarefy

Rarefy

Randomly select a number of sequences

– Stats for diversity

– Plot

-vs time

– change

– line plot

– strip chart (good if you don’t have a lot of samples)

– box plot (not so good with few samples)

– Rarefaction curves

– Species abundance

– Rank abundance

Describe your community, what’s there

Beta diversity

– Pie chart

– Stacked bar chart

Why these are bad

Pie charts

– hard to visually compare areas in a pie chart

– also too many taxa for color palette to make pretty or useful

– not possible to see variation in data

– do a strip chart of that population

Beta diversity

– Uses “names”

– membership – compares the list of names

– structure – compares “names” and abundance

Comparing bird lists is a comparison of membership

Structure: seen an eagle 200 times

Jaccard Index = SAB/(SA + SB – SAB)

SA , SB = richness of A or B

SAB = richness that is shared

sensitive to incomplete sampling and uneven sampling

Other similar metrics include Sorenson

Unweighted unfired is a similar kind of measure of membership

has similar limitations

He prefers to randomly draw n sequences from all samples, calculate Jaccard Index , calculate J and repeat 100 times, get average J

The Schloss lab uses this as a measure of beta diversity

Preferred is to use structure based approaches

such as Bray-Curtis, Morisita-Hom, Theta YC

__Bray-Curtis__

BC = 1 – 2 (Sum min value of (SAi , SBi))/(Sum of SAi + Sum SBi)

Sensitive to sampling effort, uneven samples (number of sequences)

__Morisita-Hom__

MH = 1-2((Sum min value of (PAi * PBi))/(Sum of PAi2 + Sum PBi2)

emphasizes more abundant bugs

__Phylogenetic approaches__and what they seek to do

Summation of all the branch lengths of a tree that contains all the sequences from a sample

Unifrac tries to say what fraction of the tree is unique to each type of sample

Unifrac distance = Sum (Unique Branch Length)/ Sum(Branch Length)

Does not take into account the abundance of individual reads

Weighted Unifrac is similar but instead of all or nothing, uses relative abundance to weight the branches

Different approach from bin-based methods by using tree as input

What does sharing branch length mean?

Can’t get at function with this, although some hope we can

another way of looking at Beta Diversity

Pat’s paper on evaluating the structure of different microbial communities

Lean and obese people, build a tree to compare them, use ordination

How to tell if these differences are statistically significant?

parsimony

Unifrac

– get distance

– randomize labels

– recalculate distance

AMOVA (Permanova)

-non-parametric analysis of variance

– example, does average height differ between different genders

– Do clouds in ordination have the same center?

HOMOVA

– tests for homogeneity of variance

– Bartlett’s test

– In the spread different between two groups

**Ordination**

In the case of geography, takes 3D representation and makes it 2D

Introduces a lot of distortions into the data, for example, map of the US

Microbial ecology data is prone to a lot of distortions

– Eigenvalue / Vector-based

data fitting procedure, try to fit lines through your data

got a big cloud of points, try to draw a line through the cloud and say that this line accounts for most

of the variation in the data

draw another line that accounts for the next most amount of variation in the data, don’t have to be

perpendicular but are orthogonal

linear combinations of data

Say PC1-3 ~30% of the data

That sucks but maybe we can see signal of how the samples are separating

Can look at contributions of different OTUs in ordination space

Principal components: don’t use this (uses R2, way of calculating similarity of samples, treats double zeros as similar, but this doesn’t work if both OTUs are absent, makes things look more similar then they are because of the zeros)

Principal coordinates works with OTUs that are present

NMDS

– non-metric dimensional scaling

– you provide the number of dimensions that you want

say 2 dimensions

– will array your points in 2D so the distance between the points is proportional to the distance in the matrix, has a metric called stress, which describes how well this ordination reflects the distances from your beta diversity metric, random process, moves the points around until where they best reflect the input matrix

stress: ideally less than 0.1, often it’s 0.2, should report it, not sure what else you can do

the position of the points can shift when you run it multiple times but the relationships stay the same

A con is that people outside the field always ask what the axes mean

Can take Bray-Curtis distances, generate ordination, generate a new distance matrix from the ordination, then can calculate the R2 between input and output, how well does the ordination reflect the input matrix, output is always higher for NMDS versus PCoA

Biplot

**corr.axes**

About visualizing the data, at best you are looking at a distorted 2D view of the data

**get.communitytype**

Idea of the

*Enterotype*blood types for your gut

Could be a soil type, leaf type

Community types across the body

A way to reduce the complexity of the data

two methods that are commonly used:

PAM Partitioning around the mediod

distance based

DMM: Dirichelet multinomial mixture models

uses count data

Found DMM is better than PAM in Schloss paper on body community types in Nature

What are the OTUs that make these communities different?

Is OTU 1 different between groups?

Can do a t-test or ANOVA but non-parametric is better because we have tons of zeroes in our data

We have inflated number of zeros

have a large number of OTUs

problem of multiple comparisons

0.05/# of comparisons – Bonferroni – is too strict

Benjimani-Hochberg is another way, less strict used in Mothur

Nonparametric approaches:

Metastats

Kruskal-Wallis (in Mothur)

Wilcoxon test (in Mothur)

lefse (Huttenhower method) (in Mothur)

NP test

also do LDA for effect size (linear discriminant analysis)

Screen OTUs

-minimum average relative abundance, say minimum of 20 sequences per sample or 1%, or a minimum number of samples that the OTU shows up in

Problem with these approaches is that the OTUs are treated independently

We get a lot of people with zeros, a subset of people have a type of bacteria when most don’t e.g. colon cancer

Find

**Random Forest Models**to be useful – Classification trees

– Combine subsets of OTUs

– Non-linear relationships

– can be used to detect subsets

– can combine metadata

Are there markers that can be used to identify colon cancer or even IAV?

Output can be quantitative or discrete

Can make models for how much C. diff is expected at different time points in a mouse study

Nice Gordon lab study of malnourished children, built a random forest model to predict age in healthy

kids and calculated microbiome age in malnourished kids

Started tutorial in SOP with

**Analysis**