PyBlast is a small library using mostly BioPython, designed to process easily local blast requests and global alignment on multiple cores.
The library was written using python 3.6 but should work on lower version of python 3, however I did not test that.
Install first biopython (using conda for example) :
conda install -c anaconda biopython
Using pip
pip install git+https://github.com/jsgounot/PyBlast.git
Or download / clone the github
git clone https://github.com/jsgounot/PyBlast.git
cd PyBlast
python setup.py install --user
A blast command line can be launched using BCLine
. By default only one core is used, unless ncore
is provided (see below for more informations about how multiprocessing is done). If multiple cores are used, the default chunksize (corresponding to the maximum number of sequences in the query file of each process) is set to 200.
from pyblast import BCLine
query = "path/to/your/query"
db = "path/to/your/blastdb"
bcl = BCLine("blastp", query=query, subject=db, word_size=3)
print (bcl.run(ncore=4, chunksize=20))
Most of the process is similar to usual biopython blastcommandline functions. Additionally, PyBlast includes BCLine6
, which automatically process blast table output (outfmt 6
) and returns a pandas dataframe.
from pyblast import BCLine6
query = "path/to/your/query"
db = "path/to/your/blastdb"
bcl = BCLine6("blastp", query=query, subject=db)
print (bcl.run(ncore=4, chunksize=20))
The dataframe includes several columns, such as identity, coverage and percentage of each of these columns compared to the query and subject size.
One can add supplemental outfmt columns :
bcl = BCLine6("blastn", query=query, subject=db, outfmt="evalue")
res = bcl.run(ncore=1, chunksize=100, quiet=False)
However, please be aware that wrong outfmt names in the outfmt mostly lead to treatment errors, since blast will simply ignore wrong outfmt names and that pyblast is not able to check which supplemental outfmt names are correct or not.
Global alignment is provided by the BioPython pairwise2
module. By default, the pairwise2.align.globalxx
function is used. Before using this function, I highly encourage to read the biopython documentation about this module since most of the results while depend of the input function and parameters. Within PyBlast, local alignments can be launched using the GlobalAlignment
object which took as parameter an iterable of SeqRecord
pairs. In this example, we will just make local alignments of every pairs of a single fasta file.
from Bio import SeqIO
from itertools import combinations
from pyblast import GlobalAlignment
records = SeqIO.parse(query, "fasta")
pairs = combinations(records, 2)
gal = GlobalAlignment(pairs)
print (gal.run(ncore=4))
By default, results are only output alignments. You can also use the callback function run_pair_identity
which will automatically determine query and subject similarity of each alignments.
records = SeqIO.parse(query, "fasta")
pairs = combinations(records, 2)
gal = GlobalAlignment(pairs)
callback = GlobalAlignment.run_pair_identity
print (gal.run(ncore=4, callback=callback))
Note that the biopython alignment function and paramaters can be changed using the gfun
, args
and kwargs
attributes.
args = (5, -4, -2, -0.5)
gal = GlobalAlignment(pairs)
callback = GlobalAlignment.run_pair_identity
print (gal.run(* args, ncore=4, callback=callback, gfun="localms", one_alignment_only=True))
PyBlast provides a way to combine (blast) local and (biopython) global alignment into a single object : Global2Local
. The idea is to select only the best local alignments for each query and produce a global alignments for the selected pairs. By default, the BCLine6.get_best
function is used, which took the 10 best local alignments based on the query positive value (qpos
). Both local and global functions can be modified.
from pyblast import Local2Global
l2g = Local2Global(query=query, subject=db)
res = l2g.run("blastp", ncore=4, chunksize=20, gkwargs={"one_alignment_only" : True, "gfun" : None})
print (res)
In some case, merging HSPs from the same query / subject pair can provide a faster solution instead of global alignments. In this case, all HSPs from the same pair are mixed together and the identity and similarity values correspond to all positions which have matched in at least one HSP. Nevertheless, note that it could not provide similar results than global alignments and can leads to wrong similarity values. This method can be used with the BCLineHSPFuse
object, which performs both the blast query and the result analysis in each core.
from pyblast import BCLineHSPFuse
bopt = {"evalue" : "2", "word_size" : "3"}
bc = BCLineHSPFuse("blastp", query=fname, subject=fname, ** bopt)
res = bc.run(ncore=4)
print (res)
Below are explained the signification of columns added by pyblast
. Note that some of them can be linked to either the query (q) or the subject (s). Example : qlen
corresponding to the size of the query sequence.
Column name | Meaning |
---|---|
len | Size of the sequence |
length | Length of the part covered by the HSP |
ide | Percentage of the sequence size (len) covered by identical element (nident) |
cov | Percentage of the sequence size (len) covered by the HSP (length) |
pos | Percentage of the sequence size (len) covered by positive elements (positive) |
gid | Number of identical element from global alignment analysis |
gLEN | Size of the global alignment |
gPID | This division : gid / gLEN |
gXPID | Percentage of the gid divided by either the query or subject length |
gSCORE | Score of the global alignment |
ide_prc | Percentage of identical elements compared to the total length |
pos_prc | Percentage of positive elements compared to the total length |
For multicore (ncore
> 1) Blast analysis, the query fasta file is splitted into several sub fasta files with a predifined sequences number called chunksize
and the results are afterward merged together. This option can be combined with the blast option num_threads
. However, while an increasing number of threads allows a lower memory consomption and increase execution speed, splitting the fasta file and running multiple blast tasks is way faster.
Below is shown the time used to blast 3000 random S. cerevisiae protein sequences against the S. cerevisiae proteom (SGD), full data (50x50) can be found in the blast_time
directory. Note that while num_thread
does not speed up that much execution time, increasing only the number of ncore can lead to memory overflow.
Multicore issue with BCLine6 and MacOS
The multicore library might have some issue depending on your operating system, especially on Mac OS. If you encounter errors when using ncore > 1
, you can add macos_formating=True
in the BCLine6 constructor:
from pyblast import BCLine6
query = "path/to/your/query"
db = "path/to/your/blastdb"
bcl = BCLine6("blastn", query=query, db=db, macos_formating=False)
print (bcl.run(ncore=4, chunksize=500, quiet=False))
RuntimeError:freeze_support()
Please see this or this. You have to enclose your codelines into a main function as such:
from pyblast import BCLine6
if __name__ == "__main__":
query = "path/to/your/query"
db = "path/to/your/blastdb"
bcl = BCLine6("blastn", query=query, db=db, macos_formating=False)
print (bcl.run(ncore=4, chunksize=500, quiet=False))