Installing software

Bioinformatics is a relatively new (It's younger that Erik!) and fast-progressing field. Therefore new software as well as new versions of existing software are released on a regular basis.

During this course as well as during your future career as a bioinformatician ( ;-) ) you will be confronted quite often to the installation of new software on UNIX platforms (i.e. the server you are using at the moment)

Compiled and Interpreted languages

Programming languages in the bioinformatics world - and in general - can be separated in two categories: intepreted languages, and compiled languages. While with interpreted languages you write scripts, and execute them (as we saw with the bash scripts during the UNIX lesson) it is different for compiled languages: an extra step is required

Compilation

As from Wikipedia, compilation is the translation of source code into object code by a compiler.

That's right. The extra step required by compiled languages is translating the source code, that is the lines of code the programmer(s) wrote into a language that your computer understand better, usually binary (1s and 0s).

The big advantage of compiled languages is that they are much faster than interpreted languages. However, programming in them is usually slower and more difficult than in interpreted languages. Using them or not for a software project is a trade-off between development-time, and how much faster your software could run if it was programmed using a compiled language.

The most popular compiled language is the C programming language, which Linux is mainly written in.

Package Managers

All modern linux distributions come with a package manager, i.e. a tool that automates installation of software. In most cases the software manager download already compiled binaries and installs them in your system. We'll see how it works in a moment

Let us install our first package!

The package manager for Ubuntu is called APT. Like most package managers, the syntax will look like this:

[package_manager] [action] [package_name]

We'll use apt to install a local version of ncbi-blast that you've use previously.

First we search if the package is available

apt search ncbi-blast

There seems to be two versions of it. The legacy version is probably outdated, so let us investigate the other one

apt show ncbi-blast+

It seems to be what we are looking for, we install it with:

apt install ncbi-blast+

Question

Did it work? What could have been wrong?

You should have gotten an error message asking if you are root.
The user root is the most powerful user in a linux system and usually has extra rights that a regular user does not have.
To install software in the default system location with apt, you have to have special permissions.
We can "borrow" those permissions from root by prefixing our command with sudo.

sudo apt install ncbi-blast+

Now if you execute

blastn -help

it should print the (rather long) error message of the blastn command.

Question

Why does blast has different executable?
What is the difference between blastn and blastp?

Downloading and unpacking

Although most popular software can be installed with your distribution's package manager, sometimes (especially in some fast-growing areas of bioinformatics) the software you want isn't available through a package manager.

We'll install spades, a popular genome assembly tool. Let's imagine it is not available in the apt sources. We'd have to:

Which is quite cumbersome, especially the compilation. Luckily, it is fairly common for developers to make linux binaries - that is compiled version of the software - already available for download.

First let us create a directory for all our future installs:

mkdir -p ~/install
cd ~/install

The spades binaries are available on their website, http://cab.spbu.ru/software/spades/

Download them with

wget http://cab.spbu.ru/files/release3.11.1/SPAdes-3.11.1-Linux.tar.gz

and uncompress

tar xvf SPAdes-3.11.1-Linux.tar.gz
cd SPAdes-3.11.1-Linux/bin/

and now if we execute spades.py

./spades.py

we get the help of the spades assembler!

A minor inconvenience is that right now

pwd
# /home/hadrien/install/SPAdes-3.11.1-Linux/bin

we have to always go to this directory to run spades.py, or call the software with the full path. We'd like to be able to execute spades from anywhere, like we do with ls and cd.

In most linux distributions, which directory can contain software that are executed from anywhere is defined by an environment variable: $PATH

Let us take a look:

echo $PATH
# /home/hadrien/bin:/home/hadrien/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin

To make spades.py available from anywhere we have to put it in one of the above locations.

Note

When apt installs software it usually places it in /usr/bin, which requires administration privileges. This is why we needed sudo for installing packages earlier.

mkdir -p ~/.local/bin
mv * ~/.local/bin/

Et voilĂ ! Now you can execute spades.py from anywhere!

Installing from source

For some bioinformatics software, binaries are not available. In that case you have to download the source code, and compile it yourself for your system.

This is the case of samtools per example. samtools is one of the most popular bioinformatics software and allows you to deal with bam and sam files (more about that later)

We'll need a few things to be able to compile samtools, notably make and a C compiler, gcc

sudo apt install make gcc

samtools also need some libraries that are not installed by default on an ubuntu system.

sudo apt install libncurses5-dev libbz2-dev liblzma-dev libcurl4-gnutls-dev

Now we can download and unpack the source code:

cd ~/install
wget https://github.com/samtools/samtools/releases/download/1.6/samtools-1.6.tar.bz2
tar xvf samtools-1.6.tar.bz2
cd samtools-1.6

Compiling software written in C usually follows the same 3 steps.

  1. ./configure to configure the compilation options to our machine architecture
  2. we run make to compile the software
  3. we run make install to move the compiled binaries into a location in the $PATH
./configure
make
make install

Warning

Did make install succeed? Why not?

As we saw before, we need sudo to install packages to system locations with apt. make install follows the same principle and tries by default to install software in /usr/bin

We can change that default behavior by passing options to configure, but first we have to clean our installation:

make clean

than we can run configure, make and make install again

./configure --prefix=/home/$(whoami)/.local/
make
make install
samtools

Question

The bwa source code is available on github, a popular code sharing platform (more on this in the git lesson!). Navigate to https://github.com/lh3/bwa then in release copy the link behind bwa-0.7.17.tar.bz2
- Install bwa!

Installing python packages

While compiled languages are faster than interpreted languages, they are usually harder to learn, code in and debug. For theses reasons you'll often find many bioinformatics packages written in interpreted languages such as python or ruby.

While historically it has been a pain to install software written in interpreted languages, most modern languages now come with their own package managers! For example:

Most of theses package managers have similar syntaxes. We will focus on python here since it's one of the most popular languages in bioinformatics.

Note

You will notice the absence of R here. R is mostly used interactively and installing packages in R will be part of the R part of the course.

Your ubuntu comes with an old version of python. We start with installing a newer one

cd ~/install
wget https://www.python.org/ftp/python/3.6.4/Python-3.6.4.tar.xz
tar xvf Python-3.6.4.tar.xz
cd Python-3.6.4
./configure --prefix=/home/$(whoami)/.local/
make -j2
make install

Question

What does the make option -j2 do?

which python3
which pip3

We now have the newest python installed.

Let us install our first python package

pip3 install multiqc

it should take a while and install multiqc as well as all the necessary dependencies.

to see if multiqc was properly installed:

multiqc -h

Exercises

During the following weeks we'll use a lot of different bioinformatics software to perform a variety of tasks.

Tip

Most software come with a file named INSTALL or README. Such file usually contains instructions on how to install!

Note

unless indicated otherwise, try with apt first

Note

do not hesitate to ask your teacher for help!

Let's install a few: