Metabarcoding
This tutorial is aimed at being a walkthrough of the DADA2 pipeline. It uses the data of the now famous MiSeq SOP by the Mothur authors but analyses the data using DADA2.
DADA2 is a relatively new method to analyse amplicon data which uses exact variants instead of OTUs.
The advantages of the DADA2 method is described in the paper
Before Starting
There are two ways to follow this tutorial: you can copy and paste all the codes blocks below in R directly, or you can download this document in the Rmarkdown format and execute the cells.
Install and Load Packages
First install DADA2 and other necessary packages
source('https://bioconductor.org/biocLite.R')
biocLite('dada2')
biocLite('phyloseq')
biocLite('DECIPHER')
install.packages('ggplot2')
install.packages('phangorn')
Now load the packages and verify you have the correct DADA2 version
library(dada2)
library(ggplot2)
library(phyloseq)
library(phangorn)
library(DECIPHER)
packageVersion('dada2')
Download the Data
You will also need to download the data, as well as the SILVA database
Warning
If you are following the tutorial on the website, the following block of commands has to be executed outside of R. If you run this tutorial with the R notebook, you can simply execute the cell block
wget http://www.mothur.org/w/images/d/d6/MiSeqSOPData.zip
unzip MiSeqSOPData.zip
rm -r __MACOSX/
cd MiSeq_SOP
wget https://zenodo.org/record/824551/files/silva_nr_v128_train_set.fa.gz
wget https://zenodo.org/record/824551/files/silva_species_assignment_v128.fa.gz
cd ..
Back in R, check that you have downloaded the data
path <- 'MiSeq_SOP'
list.files(path)
Filtering and Trimming
First we create two lists with the sorted name of the reads: one for the forward reads, one for the reverse reads
raw_forward <- sort(list.files(path, pattern="_R1_001.fastq",
full.names=TRUE))
raw_reverse <- sort(list.files(path, pattern="_R2_001.fastq",
full.names=TRUE))
# we also need the sample names
sample_names <- sapply(strsplit(basename(raw_forward), "_"),
`[`, # extracts the first element of a subset
1)
then we visualise the quality of our reads
plotQualityProfile(raw_forward[1:2])
plotQualityProfile(raw_reverse[1:2])
Question
What do you think of the read quality?
The forward reads are good quality (although dropping a bit at the end as usual) while the reverse are way worse.
Based on these profiles, we will truncate the forward reads at position 240 and the reverse reads at position 160 where the quality distribution crashes.
Note
in this tutorial we perform the trimming using DADA2's own functions. If you wish to do it outside of DADA2, you can refer to the Quality Control tutorial
Dada2 requires us to define the name of our output files
# place filtered files in filtered/ subdirectory
filtered_path <- file.path(path, "filtered")
filtered_forward <- file.path(filtered_path,
paste0(sample_names, "_R1_trimmed.fastq.gz"))
filtered_reverse <- file.path(filtered_path,
paste0(sample_names, "_R2_trimmed.fastq.gz"))
We’ll use standard filtering parameters: maxN=0
(DADA22 requires no Ns), truncQ=2
, rm.phix=TRUE
and maxEE=2
.
The maxEE parameter sets the maximum number of “expected errors” allowed in a read, which according to the USEARCH authors is a better filter than simply averaging quality scores.
out <- filterAndTrim(raw_forward, filtered_forward, raw_reverse,
filtered_reverse, truncLen=c(240,160), maxN=0,
maxEE=c(2,2), truncQ=2, rm.phix=TRUE, compress=TRUE,
multithread=TRUE)
head(out)
Learn the Error Rates
The DADA2 algorithm depends on a parametric error model and every amplicon dataset has a slightly different error rate.
The learnErrors
of Dada2 learns the error model from the data and will help DADA2 to fits its method to your data
errors_forward <- learnErrors(filtered_forward, multithread=TRUE)
errors_reverse <- learnErrors(filtered_reverse, multithread=TRUE)
then we visualise the estimated error rates
plotErrors(errors_forward, nominalQ=TRUE) +
theme_minimal()
Question
Do you think the error model fits your data correctly?
Dereplication
From the Dada2 documentation:
Dereplication combines all identical sequencing reads into into “unique sequences” with a corresponding “abundance”: the number of reads with that unique sequence. Dereplication substantially reduces computation time by eliminating redundant comparisons.
derep_forward <- derepFastq(filtered_forward, verbose=TRUE)
derep_reverse <- derepFastq(filtered_reverse, verbose=TRUE)
# name the derep-class objects by the sample names
names(derep_forward) <- sample_names
names(derep_reverse) <- sample_names
Sample inference
We are now ready to apply the core sequence-variant inference algorithm to the dereplicated data.
dada_forward <- dada(derep_forward, err=errors_forward, multithread=TRUE)
dada_reverse <- dada(derep_reverse, err=errors_reverse, multithread=TRUE)
# inspect the dada-class object
dada_forward[[1]]
The DADA2 algorithm inferred 128 real sequence variants from the 1979 unique sequences in the first sample.
Merge Paired-end Reads
Now that the reads are trimmed, dereplicated and error-corrected we can merge them together
merged_reads <- mergePairs(dada_forward, derep_forward, dada_reverse,
derep_reverse, verbose=TRUE)
# inspect the merger data.frame from the first sample
head(merged_reads[[1]])
Construct Sequence Table
We can now construct a sequence table of our mouse samples, a higher-resolution version of the OTU table produced by traditional methods.
seq_table <- makeSequenceTable(merged_reads)
dim(seq_table)
# inspect distribution of sequence lengths
table(nchar(getSequences(seq_table)))
Remove Chimeras
The dada
method used earlier removes substitutions and indel errors but chimeras remain.
We remove the chimeras with
seq_table_nochim <- removeBimeraDenovo(seq_table, method='consensus',
multithread=TRUE, verbose=TRUE)
dim(seq_table_nochim)
# which percentage of our reads did we keep?
sum(seq_table_nochim) / sum(seq_table)
As a final check of our progress, we’ll look at the number of reads that made it through each step in the pipeline
get_n <- function(x) sum(getUniques(x))
track <- cbind(out, sapply(dada_forward, get_n), sapply(merged_reads, get_n),
rowSums(seq_table), rowSums(seq_table_nochim))
colnames(track) <- c('input', 'filtered', 'denoised', 'merged', 'tabled',
'nonchim')
rownames(track) <- sample_names
head(track)
We kept the majority of our reads!
Assign Taxonomy
Now we assign taxonomy to our sequences using the SILVA database
taxa <- assignTaxonomy(seq_table_nochim,
'MiSeq_SOP/silva_nr_v128_train_set.fa.gz',
multithread=TRUE)
taxa <- addSpecies(taxa, 'MiSeq_SOP/silva_species_assignment_v128.fa.gz')
for inspecting the classification
taxa_print <- taxa # removing sequence rownames for display only
rownames(taxa_print) <- NULL
head(taxa_print)
Phylogenetic Tree
DADA2 is reference-free so we have to build the tree ourselves
We first align our sequences
sequences <- getSequences(seq_table)
names(sequences) <- sequences # this propagates to the tip labels of the tree
alignment <- AlignSeqs(DNAStringSet(sequences), anchor=NA)
Then we build a neighbour-joining tree then fit a maximum likelihood tree using the neighbour-joining tree as a starting point
phang_align <- phyDat(as(alignment, 'matrix'), type='DNA')
dm <- dist.ml(phang_align)
treeNJ <- NJ(dm) # note, tip order != sequence order
fit = pml(treeNJ, data=phang_align)
## negative edges length changed to 0!
fitGTR <- update(fit, k=4, inv=0.2)
fitGTR <- optim.pml(fitGTR, model='GTR', optInv=TRUE, optGamma=TRUE,
rearrangement = 'stochastic',
control = pml.control(trace = 0))
detach('package:phangorn', unload=TRUE)
Phyloseq
First load the metadata
sample_data <- read.table(
'https://hadrieng.github.io/tutorials/data/16S_metadata.txt',
header=TRUE, row.names="sample_name")
We can now construct a phyloseq object from our output and newly created metadata
physeq <- phyloseq(otu_table(seq_table_nochim, taxa_are_rows=FALSE),
sample_data(sample_data),
tax_table(taxa),
phy_tree(fitGTR$tree))
# remove mock sample
physeq <- prune_samples(sample_names(physeq) != 'Mock', physeq)
physeq
Let's look at the alpha diversity of our samples
plot_richness(physeq, x='day', measures=c('Shannon', 'Fisher'), color='when') +
theme_minimal()
No obvious differences. Let's look at ordination methods (beta diversity)
We can perform an MDS with euclidean distance (mathematically equivalent to a PCA)
ord <- ordinate(physeq, 'MDS', 'euclidean')
plot_ordination(physeq, ord, type='samples', color='when',
title='PCA of the samples from the MiSeq SOP') +
theme_minimal()
now with the Bray-Curtis distance
ord <- ordinate(physeq, 'NMDS', 'bray')
plot_ordination(physeq, ord, type='samples', color='when',
title='PCA of the samples from the MiSeq SOP') +
theme_minimal()
There we can see a clear difference between our samples.
Let us take a look a the distribution of the most abundant families
top20 <- names(sort(taxa_sums(physeq), decreasing=TRUE))[1:20]
physeq_top20 <- transform_sample_counts(physeq, function(OTU) OTU/sum(OTU))
physeq_top20 <- prune_taxa(top20, physeq_top20)
plot_bar(physeq_top20, x='day', fill='Family') +
facet_wrap(~when, scales='free_x') +
theme_minimal()
We can place them in a tree
bacteroidetes <- subset_taxa(physeq, Phylum %in% c('Bacteroidetes'))
plot_tree(bacteroidetes, ladderize='left', size='abundance',
color='when', label.tips='Family')