mpiBLAST (http://www.mpiblast.org) is a freely available, open-source, parallel implementation of NCBI BLAST. mpiBLAST takes advantage of distributed computational resources, i.e., a cluster, through explicit MPI communication and thereby utilizes all available resources unlike standard NCBI BLAST which can only take advantage of shared-memory multi-processor computers. The primary advantage to using mpiBLAST versus traditional NCBI BLAST is performance. mpiBLAST can increase performance by several orders of magnitude while still retaining identical results as output from NCBI BLAST.
mpiBLAST is already installed on peregrine and the manual discusses how any NREL user who has login access to peregrine can run mpiBLAST on peregrine.
To run mpiBLAST on peregrine, you need to create a configuration file, named .ncbirc, in your home directory. A sample configuration file is as follows:
Please modify only the lines starting with the keywords BLASTDB and Shared, to point to your own scratch directory containing the reference database.
Before you can run the “mpiblast” command, you need to create the database against which you search your set of query sequences. This is achieved by running the “mpiformatdb” command. For example, suppose that one wants to search a set of protein sequences against a database of all proteins from Populus trichocarpa , available as a downloaded "fasta" file (e.g., Ptrichocarpa_210_v3.0.protein) from the latest version of Phytozome. A PBS script to run mpiformatdb on this fasta file stored in /scratch/$USER would be as follows:
#!/bin/bash –l #PBS -e /scratch/$USER/mpiformatdb_ptrprot.err #PBS -o /scratch/$USER/mpiformatdb_ptrprot.log #PBS -N mpiformatdb_ptrprot #PBS -l walltime=1:00:00 #PBS -l nodes=1 #PBS -q debug #PBS -A <allocation_id> module purge module load epel module load openmpi-gcc/1.6.4-4.8.1 module load mpiblast mpiformatdb -N 32 -i /scratch/anag/Ptrichocarpa_210_v3.0.protein -o T -p T -n /scratch/anag/mpiblast/blastdb
You would of course need to substitute whatever allocation name is to be charged. Usually, it should not take more than an hour for the mpiformatdb program to run.
The mpiformatdb options are summarized in the table below:
|-N||number of fragments into which mpiformatdb will split the input file to build the database|
|-i||Input file for formatting|
T - True: Parse SeqId and create indexes
F - False: Do not parse SeqId. Do not create indexes.
[T/F] Optional, default = F
Note: If the "-o" option is TRUE (and the source database is in FASTA format), then the database identifiers in the FASTA definition line must follow the convention of the FASTA Defline Format.
|-p||Type of file
T – protein
F - nucleotide [T/F] Optional
default = T
|-n||Location of formatted database fragments; if this option is not provided, mpiformatdb places the formatted database fragments in the same directory as the input FASTA file.|
Once the formatted database has been created, you can run an mpiBLAST job of your query sequences against this database. To check that your database has been created, look in your blastdb folder for files ending with .psq, .psi, .psd, .pni, .pnd, .pin, .phr and .pal if the database is a protein database, and ending with .nsq, .nsi, .nsd, .nni, .nnd, .nin, .nhr and .nal if the database is a nucleotide database. For both nucleotide and protein databases, there should be two more files ending with .dbs and .mbf. Also check the log file from mpiformatdb to check that there are no errors.
The next step is to run mpiblast of your query sequences against the database you have created.
A sample pbs script for running mpiblast is below.
#PBS -e /scratch/anag/mpiblast_blastp_Ath_NDP_Ptr_test.err
#PBS -N mpiblast_blastp_Ath_NDP_Ptr
#PBS -l walltime=1:00:00
#PBS -l nodes=2:ppn=16
#PBS -q debug
#PBS -A CSCDAV
# Create fresh mpiblast directory in /dev/shm
module load epel
module load openmpi-gcc/1.6.4-4.8.1
module load mpiblast
mpirun -np 32 mpiblast --copy-via=mpi -p blastp -d Ptrichocarpa_210_v3.0.protein -e 0.01 -i /scratch/anag/Athaliana_NDP_sugar_enzymes_TAIR10.fasta -m 0 -G 11 -E 1 -M BLOSUM62 -W 3 -v 4 -b 4 –o /scratch/anag/mpiblast_blastp_Ath_NDP_Ptr_test.out
# Clean up /dev/shm scratch on job nodes
rm -rf /dev/shm/*
See http://www.ncbi.nlm.nih.gov/Class/BLAST/blastallopts.txt for all BLAST-specific options.
Note: The stderr file may contain the following message, which is safe to ignore.
find: `/dev/shm/mpiblast': No such file or directory