Metagenome assembly and binning

In this tutorial you'll learn how to inspect assemble metagenomic data and retrieve draft genomes from assembled metagenomes

We'll use a mock community of 20 bacteria sequenced using the Illumina HiSeq. In reality the data were simulated using InSilicoSeq.

The 20 bacteria in the dataset were selected from the Tara Ocean study that recovered 957 distinct Metagenome-assembled-genomes (or MAGs) that were previsouly unknown! (full list on figshare )

Getting the Data

mkdir -p ~/data
cd ~/data
curl -O -J -L https://osf.io/th9z6/download
curl -O -J -L https://osf.io/k6vme/download
chmod -w tara_reads_R*

Quality Control

we'll use FastQC to check the quality of our data, as well as sickle for trimming the bad quality part of the reads. If you need a refresher on how and why to check the quality of sequence data, please check the Quality Control and Trimming tutorial

mkdir -p ~/results
cd ~/results
ln -s ~/data/tara_reads_* .
fastqc tara_reads_*.fastq.gz

Question

What is the average read length? The average quality?

Question

Compared to single genome sequencing, what graphs differ?

Now we'll trim the reads using sickle

sickle pe -f tara_reads_R1.fastq.gz -r tara_reads_R2.fastq.gz -t sanger \
    -o tara_trimmed_R1.fastq -p tara_trimmed_R2.fastq -s /dev/null

Question

How many reads were trimmed?

Assembly

Megahit will be used for the assembly.

megahit -1 tara_trimmed_R1.fastq -2 tara_trimmed_R2.fastq -o tara_assembly

the resulting assenmbly can be found under tara_assembly/final.contigs.fa.

Question

How many contigs does this assembly contain?

Binning

First we need to map the reads back against the assembly to get coverage information

ln -s tara_assembly/final.contigs.fa .
bowtie2-build final.contigs.fa final.contigs
bowtie2 -x final.contigs -1 tara_reads_R1.fastq.gz -2 tara_reads_R2.fastq.gz | \
    samtools view -bS -o tara_to_sort.bam
samtools sort tara_to_sort.bam -o tara.bam
samtools index tara.bam

then we run metabat

runMetaBat.sh -m 1500 final.contigs.fa tara.bam
mv final.contigs.fa.metabat-bins1500 metabat

Question

How many bins did we obtain?

Checking the quality of the bins

The first time you run checkm you have to create the database

sudo checkm data setRoot ~/.local/data/checkm

checkm lineage_wf -x fa metabat checkm/
checkm bin_qa_plot -x fa checkm metabat plots

Question

Which bins should we keep for downstream analysis?

Note

checkm can plot a lot of metrics. If you have time, check the manual and try to produce different plots

Warning

if checkm fails at the phylogeny step, it is likely that your vm doesn't have enough RAM. pplacer requires about 35G of RAM to place the bins in the tree of life.

In that case, execute the following

cd ~/results
curl -O -J -L https://osf.io/xuzhn/download
tar xzf checkm.tar.gz
checkm qa checkm/lineage.ms checkm

then plot the completeness

checkm bin_qa_plot -x fa checkm metabat plots

and take a look at plots/bin_qa_plot.png

Metagenome assembly and binning

Getting the Data

Quality Control

Assembly

Binning

Checking the quality of the bins

Further reading