Introduction to Nanopore Sequencing
In this tutorial we will assemble the E. coli genome using a mix of long, error-prone reads from the MinION (Oxford Nanopore) and short reads from a HiSeq instrument (Illumina).
The MinION data used in this tutorial come a test run by the Loman lab.
The Illumina data were simulated using InSilicoSeq
Get the Data
First download the nanopore data
wget http://s3.climb.ac.uk/nanopore/ecoli_allreads.fasta
You will not need the HiSeq data right away, but you can start the download in another window
curl -O -J -L https://osf.io/pxk7f/download
curl -O -J -L https://osf.io/zax3c/download
look at basic stats of the nanopore reads
assembly-stats ecoli_allreads.fasta
Question
How many nanopore reads do we have?
Question
How long is the longest read?
Question
What is the average read length?
Adapter trimming
The guppy basecaller, i.e. the program that transform raw electrical signal in fastq files, already demultiplex and trim for us.
Assembly
We assemble the reads using wtdbg2 (version > 2.3)
head -n 20000 ecoli_allreads.fasta > subset.fasta
wtdbg2 -x ont -i subset.fasta -fo assembly
wtpoa-cns -i assembly.ctg.lay.gz -fo assembly.ctg.fa
Polishing
Since the assembly likely contains a lot of errors, we correct it with Illumina reads.
First we map the short reads against the assembly
bowtie2-build assembly.ctg.fa assembly
bowtie2 -x assembly -1 ecoli_hiseq_R1.fastq.gz -2 ecoli_hiseq_R2.fastq.gz | \
samtools view -bS -o assembly_short_reads.bam
samtools sort assembly_short_reads.bam -o assembly_short_sorted.bam
samtools index assembly_short_sorted.bam
then we run the consensus step
samtools view assembly_short_sorted.bam | wtpoa-cns -t 16 -x sam-sr \
-d assembly.ctg.fa -i - -fo assembly_polished.fasta
which will correct eventual misamatches in our assembly and write the new improved assembly to assembly_polished.fasta
For better results we should perform more than one round of polishing.
Compare with the existing assembly and an illumina only assembly
an existing assembly
Go to https://www.ncbi.nlm.nih.gov and search for NC_000913.
Download the associated genome in fasta format and rename it to ecoli_ref.fasta
nucmer --maxmatch -c 100 -p ecoli assembly_polished.fasta ecoli_ref.fasta
mummerplot --fat --filter --png --large -p ecoli ecoli.delta
then take a look at ecoli.png
compare metrics
Note
First you need to assemble the illumina data
Then run busco and quast on the 3 assemblies
Question
which assembly would you say is the best?
Annotation
If you have time, train your annotation skills by running prokka on your genome!
prokka --outdir annotation --kingdom Bacteria assembly_polished.fasta
You can open the output to see how it went
cat annotation/*.txt
Question
Does it fit your expectations? How many genes were you expecting?