Command-line Blast

Installing blast

While you should have installed blast during the installing software tutorial, you can copy/paste the code block below to reinstall it if needed

sudo apt install ncbi-blast+

Getting data

We will download some cows and human proteins from RefSeq


Both these files are compressed. They are not tar archives, like we encountered earlier, but gzip files. To uncompress:

gzip -d *.gz

Let us take a look at the human file

head human.1.protein.faa

Both files contain protein sequences in the FASTA format


How many sequences do I have in each file?

The files are slightly too big for our first time blasting things at the command-line. Let's downsize the cow file

head -6 cow.1.protein.faa > cow.small.faa

Our first blast

Now we can blast these two cow sequences against the set of human sequences.

First we need to build a blast database with our human sequences

makeblastdb -in human.1.protein.faa -dbtype prot

The makeblastdb produced a lot of extra files. Those files are indexes and necessary for blast to function.

Now we can run blast

blastp -query cow.small.faa -db human.1.protein.faa -out cow_vs_human_blast_results.txt

We can look at the results using less

less cow_vs_human_blast_results.txt

To know about the various options that we can use with blastp:

blastp -help

and for easier reading

blastp -help | less


How could I modify the previous blast command to filter the hits with an e-value of 1e-5

Bigger dataset

Now that we succeeded using a small dataset of two proteins, let's try with a slightly bigger one.

head -199 cow.1.protein.faa > cow.medium.faa


How many protein sequences does cow.medium.faa contain?

We run blast again

blastp -query cow.medium.faa -db human.1.protein.faa \
    -out -evalue 1e-5 \
    -outfmt 6 -max_target_seqs 1


What do -outfmt and -max_target_seqs do?