Installing software
Bioinformatics is a relatively new (It's younger that Erik!) and fast-progressing field. Therefore new software as well as new versions of existing software are released on a regular basis.
During this course as well as during your future career as a bioinformatician ( ;-) ) you will be confronted quite often to the installation of new software on UNIX platforms (i.e. the server you are using at the moment)
Compiled and Interpreted languages
Programming languages in the bioinformatics world - and in general - can be separated in two categories: intepreted languages, and compiled languages. While with interpreted languages you write scripts, and execute them (as we saw with the bash
scripts during the UNIX lesson) it is different for compiled languages: an extra step is required
Compilation
As from Wikipedia, compilation is the translation of source code into object code by a compiler.
That's right. The extra step required by compiled languages is translating the source code, that is the lines of code the programmer(s) wrote into a language that your computer understand better, usually binary (1s and 0s).
The big advantage of compiled languages is that they are much faster than interpreted languages. However, programming in them is usually slower and more difficult than in interpreted languages. Using them or not for a software project is a trade-off between development-time, and how much faster your software could run if it was programmed using a compiled language.
The most popular compiled language is the C programming language, which Linux is mainly written in.
Package Managers
All modern linux distributions come with a package manager, i.e. a tool that automates installation of software. In most cases the software manager download already compiled binaries and installs them in your system. We'll see how it works in a moment
Let us install our first package!
The package manager for Ubuntu is called APT. Like most package managers, the syntax will look like this:
[package_manager] [action] [package_name]
We'll use apt to install a local version of ncbi-blast that you've use previously.
First we search if the package is available
apt search ncbi-blast
There seems to be two versions of it. The legacy version is probably outdated, so let us investigate the other one
apt show ncbi-blast+
It seems to be what we are looking for, we install it with:
apt install ncbi-blast+
Question
Did it work? What could have been wrong?
You should have gotten an error message asking if you are root
.
The user root
is the most powerful user in a linux system and usually has extra rights that a regular user does not have.
To install software in the default system location with apt, you have to have special permissions.
We can "borrow" those permissions from root
by prefixing our command with sudo
.
sudo apt install ncbi-blast+
Now if you execute
blastn -help
it should print the (rather long) error message of the blastn command.
Question
Why does blast has different executable?
What is the difference between blastn and blastp?
Downloading and unpacking
Although most popular software can be installed with your distribution's package manager, sometimes (especially in some fast-growing areas of bioinformatics) the software you want isn't available through a package manager.
We'll install spades, a popular genome assembly tool. Let's imagine it is not available in the apt sources. We'd have to:
- download the source code
- compile the software
- move it at the right place on our system
Which is quite cumbersome, especially the compilation. Luckily, it is fairly common for developers to make linux binaries - that is compiled version of the software - already available for download.
First let us create a directory for all our future installs:
mkdir -p ~/install
cd ~/install
The spades binaries are available on their website, http://cab.spbu.ru/software/spades/
Download them with
wget http://cab.spbu.ru/files/release3.11.1/SPAdes-3.11.1-Linux.tar.gz
and uncompress
tar xvf SPAdes-3.11.1-Linux.tar.gz
cd SPAdes-3.11.1-Linux/bin/
and now if we execute spades.py
./spades.py
we get the help of the spades assembler!
A minor inconvenience is that right now
pwd
# /home/hadrien/install/SPAdes-3.11.1-Linux/bin
we have to always go to this directory to run spades.py
, or call the software with the full path.
We'd like to be able to execute spades
from anywhere, like we do with ls
and cd
.
In most linux distributions, which directory can contain software that are executed from anywhere is defined by an environment variable: $PATH
Let us take a look:
echo $PATH
# /home/hadrien/bin:/home/hadrien/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
To make spades.py
available from anywhere we have to put it in one of the above locations.
Note
When apt
installs software it usually places it in /usr/bin
, which requires administration privileges.
This is why we needed sudo
for installing packages earlier.
mkdir -p ~/.local/bin
mv * ~/.local/bin/
Et voilĂ ! Now you can execute spades.py
from anywhere!
Installing from source
For some bioinformatics software, binaries are not available. In that case you have to download the source code, and compile it yourself for your system.
This is the case of samtools per example. samtools is one of the most popular bioinformatics software and allows you to deal with bam
and sam
files (more about that later)
We'll need a few things to be able to compile samtools, notably make and a C compiler, gcc
sudo apt install make gcc
samtools also need some libraries that are not installed by default on an ubuntu system.
sudo apt install libncurses5-dev libbz2-dev liblzma-dev libcurl4-gnutls-dev
Now we can download and unpack the source code:
cd ~/install
wget https://github.com/samtools/samtools/releases/download/1.6/samtools-1.6.tar.bz2
tar xvf samtools-1.6.tar.bz2
cd samtools-1.6
Compiling software written in C usually follows the same 3 steps.
./configure
to configure the compilation options to our machine architecture- we run
make
to compile the software - we run
make install
to move the compiled binaries into a location in the$PATH
./configure
make
make install
Warning
Did make install
succeed? Why not?
As we saw before, we need sudo
to install packages to system locations with apt
.
make install
follows the same principle and tries by default to install software in /usr/bin
We can change that default behavior by passing options to configure
, but first we have to clean our installation:
make clean
than we can run configure, make and make install again
./configure --prefix=/home/$(whoami)/.local/
make
make install
samtools
Question
The bwa source code is available on github, a popular code sharing platform (more on this in the git lesson!).
Navigate to https://github.com/lh3/bwa then in release copy the link behind bwa-0.7.17.tar.bz2
- Install bwa!
Installing python packages
While compiled languages are faster than interpreted languages, they are usually harder to learn, code in and debug. For theses reasons you'll often find many bioinformatics packages written in interpreted languages such as python or ruby.
While historically it has been a pain to install software written in interpreted languages, most modern languages now come with their own package managers! For example:
- Python has
pip
- Ruby has
gem
- Javascript has
npm
- ...
Most of theses package managers have similar syntaxes. We will focus on python here since it's one of the most popular languages in bioinformatics.
Note
You will notice the absence of R here. R is mostly used interactively and installing packages in R will be part of the R part of the course.
Your ubuntu comes with an old version of python. We start with installing a newer one
cd ~/install
wget https://www.python.org/ftp/python/3.6.4/Python-3.6.4.tar.xz
tar xvf Python-3.6.4.tar.xz
cd Python-3.6.4
./configure --prefix=/home/$(whoami)/.local/
make -j2
make install
Question
What does the make
option -j2
do?
which python3
which pip3
We now have the newest python installed.
Let us install our first python package
pip3 install multiqc
it should take a while and install multiqc as well as all the necessary dependencies.
to see if multiqc was properly installed:
multiqc -h
Exercises
During the following weeks we'll use a lot of different bioinformatics software to perform a variety of tasks.
Tip
Most software come with a file named INSTALL
or README
.
Such file usually contains instructions on how to install!
Note
unless indicated otherwise, try with apt
first
Note
do not hesitate to ask your teacher for help!
Let's install a few: