This repository contains documentation for the NCBI BLAST+ command line applications in a Docker image. We will demonstrate how to use the Docker image to run BLAST analysis on the Google Cloud Platform (GCP) and Amazon Web Services (AWS) using a small basic example and a more advanced production-level example. Some basic knowledge of Unix/Linux commands and BLAST+ is useful in completing this tutorial.
- What Is NCBI BLAST?
- What Is Cloud Computing?
- What Is Docker?
- Google Cloud Platform Setup
- Section 1 - Getting Started Using the BLAST+ Docker Image with A Small Example
- Section 2 - A Step-by-Step Guide Using the BLAST+ Docker Image
- Section 3 - Using the BLAST+ Docker Image at Production Scale
- Amazon Web Services Setup
- BLAST Databases
- BLAST Database Metadata
- Additional Resources
- Maintainer
- License
- Appendix
The National Center for Biotechnology Information (NCBI) Basic Local Alignment Search Tool(BLAST) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families.
Introduced in 2009, BLAST+ is an improved version of BLAST command line applications. For a full description of the features and capabilities of BLAST+, please refer to the BLAST Command Line Applications User Manual.
Cloud computing offers potential cost savings by using on-demand, scalable, and elastic computational resources. While a detailed description of various cloud technologies and benefits is out of the scope for this repository, the following sections contain information needed to get started running the BLAST+ Docker image on the Google Cloud Platform (GCP).
Docker is a tool to perform operating-system level virtualization using software containers. In containerization technology*, an image is a snapshot of an analytical environment encapsulating application(s) and dependencies. An image, which is essentially a file built from a list of instructions, can be saved and easily shared for others to recreate the exact analytical environment across platforms and operating systems. A container is a runtime instance of an image. By using containerization, users can bypass the often-complicated steps in compiling, configuring, and installing a Unix-based tool like BLAST+. In addition to portability, containerization is a lightweight approach to make analysis more findable, accessible, interoperable, reusable (F.A.I.R.) and, ultimately, reproducible.
*There are many containerization tools and standards, such as Docker and Singularity. We will focus solely on Docker, which is considered the de facto standard by many in the field.
The following sections include instructions to create a Google virtual machine, install Docker, and run BLAST+ commands using the Docker image.
This section provides a quick run-through of a BLAST analysis in the Docker environment on a Google instance. This is intended as an overview for those who just want an understanding of the principles of the solution. If you work with Amazon instances, please go the the Amazon Web Services Setup section of this documentation. The Google Cloud Shell, an interactive shell environment, will be used for this example, which makes it possible to run the following small example without having to perform additional setup, such as creating a billing account or compute instance. More detailed descriptions of analysis steps, alternative commands, and more advanced topics are covered in the later sections of this documentation.
Requirements: A Google account
Input data:
- Query – 1 sequence, 44 nucleotides, file size 0.2 KB
- Databases
- 7 sequences, 922 nucleotides, file size 1.7 KB
- PDB protein database (pdbaa) 0.2831 GB
First, in a separate browser window or tab, sign in at https://console.cloud.google.com/
Click the Activate Cloud Shell button at the top right corner of the Google Cloud Platform Console.
You now will see your Cloud Shell session window:
The next step is to copy-and-paste the commands below in your Cloud Shell session.
Please note: In GitHub you can use your mouse to copy; however, in the command shell you must use your keyboard. In Windows or Unix/Linux, use the shortcut Control+C
to copy and Control+V
to paste. On macOS, use Command+C
to copy and Command+V
to paste.
To scroll in the Cloud Shell, enable the scrollbar in Terminal settings
with the wrench icon.
# Time needed to complete this section: <10 minutes
# Step 1. Retrieve sequences
## Create directories for analysis
cd ; mkdir blastdb queries fasta results blastdb_custom
## Retrieve query sequence
docker run --rm ncbi/blast efetch -db protein -format fasta \
-id P01349 > queries/P01349.fsa
## Retrieve database sequences
docker run --rm ncbi/blast efetch -db protein -format fasta \
-id Q90523,P80049,P83981,P83982,P83983,P83977,P83984,P83985,P27950 \
> fasta/nurse-shark-proteins.fsa
## Step 2. Make BLAST database
docker run --rm \
-v $HOME/blastdb_custom:/blast/blastdb_custom:rw \
-v $HOME/fasta:/blast/fasta:ro \
-w /blast/blastdb_custom \
ncbi/blast \
makeblastdb -in /blast/fasta/nurse-shark-proteins.fsa -dbtype prot \
-parse_seqids -out nurse-shark-proteins -title "Nurse shark proteins" \
-taxid 7801 -blastdb_version 5
## Step 3. Run BLAST+
docker run --rm \
-v $HOME/blastdb:/blast/blastdb:ro \
-v $HOME/blastdb_custom:/blast/blastdb_custom:ro \
-v $HOME/queries:/blast/queries:ro \
-v $HOME/results:/blast/results:rw \
ncbi/blast \
blastp -query /blast/queries/P01349.fsa -db nurse-shark-proteins
## Output on screen
## Scroll up to see the entire output
## Type "exit" to leave the Cloud Shell or continue to the next section
At this point, you should see the output on the screen. With your query, BLAST identified the protein sequence P80049.1 as a match with a score of 14.2 and an E-value of 0.96.
For larger analysis, it is recommended to use the -out
flag to save the output to a file. For example, append -out /blast/results/blastp.out
to the last command in Step 3 above and view the content of this output file using more $HOME/results/blastp.out
.
You can also query P01349.fsa against the PDB as shown in the following code block.
## Extend the example to query against the Protein Data Bank
## Time needed to complete this section: <10 minutes
## Confirm query
ls queries/P01349.fsa
## Download Protein Data Bank amino acid database (pdbaa)
docker run --rm \
-v $HOME/blastdb:/blast/blastdb:rw \
-w /blast/blastdb \
ncbi/blast \
update_blastdb.pl --source gcp pdbaa
## Run BLAST+
docker run --rm \
-v $HOME/blastdb:/blast/blastdb:ro \
-v $HOME/blastdb_custom:/blast/blastdb_custom:ro \
-v $HOME/queries:/blast/queries:ro \
-v $HOME/results:/blast/results:rw \
ncbi/blast \
blastp -query /blast/queries/P01349.fsa -db pdbaa
## Output on screen
## Scroll up to see the entire output
## Leave the Cloud Shell
exit
You have now completed a simple task and seen how BLAST+ with Docker works. To learn about Docker and BLAST+ at production scale, please proceed to the next section.
In Section 2 - A Step-by-Step Guide Using the BLAST+ Docker Image, we will use the same small example from the previous section and discuss alternative approaches, additional useful Docker and BLAST+ commands, and Docker command options and structures. In Section 3, we will demonstrate how to run the BLAST+ Docker image at production scale.
First, you need to set up a Google Cloud Platform (GCP) virtual machine (VM) for analysis.
- A GCP account linked to a billing account
- A GCP VM running Ubuntu 18.04LTS
1. Creating your GCP account and registering for the free $300 credit program. (If you already have a GCP billing account, you can skip to step 2.)
- First, in a separate browser window or tab, sign in at https://console.cloud.google.com/
- If you need to create one, go to https://cloud.google.com/ and click “Get started for free” to sign up for a trial account.
- If you have multiple Google accounts, sign in using an Incognito Window (Chrome) or Private Window (Safari) or any other private browser window.
GCP is currently offering a $300 credit, which expires 12 months from activation, to incentivize new cloud users. The following steps will show you how to activate this credit. You will be asked for billing information, but GCP will not auto-charge you once the trial ends; you must elect to manually upgrade to a paid account.
-
After signing in, click Activate to activate the $300 credit.
-
Enter your country, for example, United States, and check the box indicating that you have read and accept the terms of service.
-
Under “Account type,” select “Individual.” (This may be pre-selected in your Google account)
-
Enter your name and address.
-
Under “How you pay," select “Automatic payments.” (This may be pre-selected in your Google account) This indicates that you will pay costs after you have used the service, either when you have reached your billing threshold or every 30 days, whichever comes first.
-
Under “Payment method,” select “add a credit or debit card” and enter your credit card information. You will not be automatically charged once the trial ends. You must elect to upgrade to a paid account before your payment method will be charged.
-
Click “Start my free trial” to finish registration. When this process is completed, you should see a GCP welcome screen.
- On the GCP welcome screen from the last step, click "Compute Engine" or navigate to the "Compute Engine" section by clicking on the navigation menu with the "hamburger icon" (three horizontal lines) on the top left corner.
- Click on the blue “CREATE INSTANCE” button on the top bar.
- Create an image with the following parameters: (if parameter is not list below, keep the default setting)
- Name: keep the default or enter a name
- Region: us-east4 (Northern Virginia)
- For Section 2, change these settings -
- Machine Type: micro (1 shared vCPU), 0.6 GB memory, f1-micro
- Boot Disk: Click "Change," select Ubuntu 18.04 LTS, and click "Select" (Boot disc size is default 10 GB).
- For Section 3, change these settings -
- Machine Type: 16 vCPU, 104 GB memory, n1-highmem-16
- Boot Disk: Click "Change" and select Ubuntu 18.04 LTS, change the "Boot disk size" to 200 GB Standard persistent disk, and click "Select."
At this point, you should see a cost estimate for this instance on the right side of your window.
- Click the blue “Create” button. This will create and start the VM.
Please note: Creating a VM in the same region as storage can provide better performance. We recommend creating a VM in the us-east4 region. If you have a job that will take several hours, but less than 24 hours, you can potentially take advantage of preemptible VMs.
Detailed instructions for creating a GCP account and launching a VM can be found here.
Once you have your VM created, you must access it from your local computer. There are many methods to access your VM, depending on the ways in which you would like to use it. On the GCP, the most straightforward way is to SSH from the browser.
You now have a command shell running and you are ready to proceed.
Remember to stop or delete the VM to prevent incurring additional cost.
In this section, we will cover Docker installation, discuss various docker run
command options, and examine the structure of a Docker command. We will use the same small example from Section 1 and explore alternative approaches in running the BLAST+ Docker image. However, we are using a real VM instance, which provides greater performance and functionality than the Google Cloud Shell.
Input data
- Query – 1 sequence, 44 nucleotides, file size 0.2 KB
- Database – 7 sequences, 922 nucleotides, file size 1.7 KB
In a production system, Docker has to be installed as an application.
## Run these commands to install Docker and add non-root users to run Docker
sudo snap install docker
sudo apt update
sudo apt install -y docker.io
sudo usermod -aG docker $USER
exit
# exit and SSH back in for changes to take effect
To confirm the correct installation of Docker, run the command docker run hello-world
. If correctly installed, you should see "Hello from Docker!..."(https://docs.docker.com/samples/library/hello-world/)
This section is optional.
Below is a list of docker run
command line options used in this tutorial.
Name, short-hand(if available) | Description |
---|---|
--rm |
Automatically remove the container when it exits |
--volume , -v |
Bind mount a volume |
--workdir , -w |
Working directory inside the container |
This section is optional.
For this tutorial, it would be useful to understand the structure of a Docker command. The following command consists of three parts.
docker run --rm ncbi/blast \
-v $HOME/blastdb_custom:/blast/blastdb_custom:rw \
-v $HOME/fasta:/blast/fasta:ro \
-w /blast/blastdb_custom \
makeblastdb -in /blast/fasta/nurse-shark-proteins.fsa -dbtype prot \
-parse_seqids -out nurse-shark-proteins -title "Nurse shark proteins" \
-taxid 7801 -blastdb_version 5
The first part of the command docker run --rm ncbi/blast
is an instruction to run the docker image ncbi/blast
and remove the container when the run is completed.
The second part of the command makes the query sequence data accessible in the container. Docker bind mounts uses -v
to mount the local directories to directories inside the container and provide access permission rw (read and write) or ro (read only). For instance, assuming your subject sequences are stored in the $HOME/fasta directory on the local host, you can use the following parameter to make that directory accessible inside the container in /blast/fasta as a read-only directory -v $HOME/fasta:/blast/fasta:ro
. The -w /blast/blastdb_custom
flag sets the working directory inside the container.
The third part of the command is the BLAST+ command. In this case, it is executing makeblastdb to create BLAST database files.
You can start an interactive bash session for this image by using docker run -it ncbi/blast /bin/bash
. For the BLAST+ Docker image, the executables are in the folder /blast/bin and /root/edirect and added to the variable $PATH.
For additional documentation on the docker run
command, please refer to documentation.
This section is optional.
Docker Command | Description |
---|---|
docker ps -a |
Displays a list of containers |
docker rm $(docker ps -q -f status=exited) |
Removes all exited containers, if you have at least 1 exited container |
docker rm <CONTAINER_ID> |
Removes a container |
docker images |
Displays a list of images |
docker rmi <REPOSITORY (IMAGE_NAME)> |
Removes an image |
This section is optional.
With this Docker image you can run BLAST+ in an isolated container, facilitating reproducibility of BLAST results. As a user of this Docker image, you are expected to provide BLAST databases and query sequence(s) to run BLAST as well as a location outside the container to save the results. The following is a list of directories used by BLAST+. You will create them in Step 2.
Directory | Purpose | Notes |
---|---|---|
$HOME/blastdb |
Stores NCBI-provided BLAST databases | If set to a single, absolute path, the $BLASTDB environment variable could be used instead (see Configuring BLAST via environment variables.) |
$HOME/queries |
Stores user-provided query sequence(s) | |
$HOME/fasta |
Stores user-provided FASTA sequences to create BLAST database(s) | |
$HOME/results |
Stores BLAST results | Mount with rw permissions |
$HOME/blastdb_custom |
Stores user-provided BLAST databases |
This section is optional.
The following command displays the latest BLAST version.
docker run --rm ncbi/blast blastn -version
Appending a tag to the image name (ncbi/blast
) allows you to use a
different version of BLAST+ (see “Supported Tags and Respective Release Notes” section for supported versions).
Different versions of BLAST+ exist in different Docker images. The following command will initiate download of the BLAST+ version 2.9.0 Docker image.
docker run --rm ncbi/blast:2.9.0 blastn -version
## Display a list of images
docker images
For example, to use the BLAST+ version 2.9.0 Docker image instead of the latest version, replace the first part of the command
docker run --rm ncbi/blast
with docker run --rm ncbi/blast:2.9.0
This section is optional.
- 2.11.0: release notes
- 2.10.1: release notes
- 2.10.0: release notes
- 2.9.0: release notes
- 2.8.1: release notes
In this example, we will start by fetching query and database sequences and then create a custom BLAST database.
# Start in a directory where you want to perform the analysis
## Create directories for analysis
cd ; mkdir blastdb queries fasta results blastdb_custom
## Retrieve query sequences
docker run --rm ncbi/blast efetch -db protein -format fasta \
-id P01349 > queries/P01349.fsa
## Retrieve database sequences
docker run --rm ncbi/blast efetch -db protein -format fasta \
-id Q90523,P80049,P83981,P83982,P83983,P83977,P83984,P83985,P27950 \
> fasta/nurse-shark-proteins.fsa
## Make BLAST database
docker run --rm \
-v $HOME/blastdb_custom:/blast/blastdb_custom:rw \
-v $HOME/fasta:/blast/fasta:ro \
-w /blast/blastdb_custom \
ncbi/blast \
makeblastdb -in /blast/fasta/nurse-shark-proteins.fsa -dbtype prot \
-parse_seqids -out nurse-shark-proteins -title "Nurse shark proteins" \
-taxid 7801 -blastdb_version 5
To verify the newly created BLAST database above, you can run the following command to display the accessions, sequence length, and common name of the sequences in the database.
docker run --rm \
-v $HOME/blastdb:/blast/blastdb:ro \
-v $HOME/blastdb_custom:/blast/blastdb_custom:ro \
ncbi/blast \
blastdbcmd -entry all -db nurse-shark-proteins -outfmt "%a %l %T"
As an alternative, you can also download preformatted BLAST databases from NCBI or the NCBI Google storage bucket.
docker run --rm ncbi/blast update_blastdb.pl --showall pretty --source gcp
For a detailed description of update_blastdb.pl
, please refer to the documentation. By default update_blastdb.pl
will download from the cloud provided you are connected to, or from NCBI if you are not using a supported cloud provider.
This section is optional.
docker run --rm ncbi/blast update_blastdb.pl --showall --source ncbi
This section is optional.
The command below mounts the $HOME/blastdb
path on the local machine as
/blast/blastdb
on the container, and blastdbcmd
shows the available BLAST
databases at this location.
## Download Protein Data Bank amino acid database (pdbaa)
docker run --rm \
-v $HOME/blastdb:/blast/blastdb:rw \
-w /blast/blastdb \
ncbi/blast \
update_blastdb.pl pdbaa
## Display database(s) in $HOME/blastdb
docker run --rm \
-v $HOME/blastdb:/blast/blastdb:ro \
ncbi/blast \
blastdbcmd -list /blast/blastdb -remove_redundant_dbs
You should see an output /blast/blastdb/pdbaa Protein
.
## For the custom BLAST database used in this example -
docker run --rm \
-v $HOME/blastdb_custom:/blast/blastdb_custom:ro \
ncbi/blast \
blastdbcmd -list /blast/blastdb_custom -remove_redundant_dbs
You should see an output /blast/blastdb_custom/nurse-shark-proteins Protein
.
When running BLAST in a Docker container, note the mounts specified to the docker run
command to make the input and outputs accessible. In the examples below, the first two mounts provide access to the BLAST databases, the third
mount provides access to the query sequence(s), and the fourth mount provides a directory to save the results. (Note the :ro
and :rw
options, which mount the directories as read-only and read-write respectively.)
docker run --rm \
-v $HOME/blastdb:/blast/blastdb:ro \
-v $HOME/blastdb_custom:/blast/blastdb_custom:ro \
-v $HOME/queries:/blast/queries:ro \
-v $HOME/results:/blast/results:rw \
ncbi/blast \
blastp -query /blast/queries/P01349.fsa -db nurse-shark-proteins \
-out /blast/results/blastp.out
At this point, you should see the output file $HOME/results/blastp.out
. With your query, BLAST identified the protein sequence P80049.1 as a match with a score of 14.2 and an E-value of 0.96. To view the content of this output file, use the command more $HOME/results/blastp.out
.
Remember to stop or delete the VM to prevent incurring additional cost. You can do this at the GCP Console as shown below.
One of the promises of cloud computing is scalability. In this section, we will demonstrate how to use the BLAST+ Docker image at production scale on the Google Cloud Platform. We will perform a BLAST analysis similar to the approach described in this publication to compare de novo aligned contigs from bacterial 16S-23S sequencing against the nucleotide collection (nt) database.
To test scalability, we will use inputs of different sizes to estimate the amount of time to download the nucleotide collection database and run BLAST search using the latest version of the BLAST+ Docker image. Expected results are summarized in the following tables.
Input files: 28 samples (multi-FASTA files) containing de novo aligned contigs from the publication.
(Instructions to download and create the input files are described in the code block below.)
Database: Pre-formatted BLAST nucleotide collection database, version 5 (nt): 68.7217 GB (from May 2019)
Input file name | File content | File size | Number of sequences | Number of nucleotides | Expected output size | |
---|---|---|---|---|---|---|
Analysis 1 | query1.fa | only sample 1 | 59 KB | 121 | 51,119 | 3.1 GB |
Analysis 2 | query5.fa | only samples 1-5 | 422 KB | 717 | 375,154 | 10.4 GB |
Analysis 3 | query.fa | all 28 samples | 2.322 MB | 3798 | 2,069,892 | 47.8 GB |
VM Type/Zone | CPU | Memory (GB) | Hourly Cost* | Download nt (min) | Analysis 1 (min) | Analysis 2 (min) | Analysis 3 (min) | Total Cost** |
---|---|---|---|---|---|---|---|---|
n1-standard-8 us-east4c | 8 | 30 | $0.312 | 9 | 22 | - | - | - |
n1-standard-16 us-east4c | 16 | 60 | $0.611 | 9 | 14 | 53 | 205 | $2.86 |
n1-highmem-16 us-east4c | 16 | 104 | $0.767 | 9 | 9 | 30 | 143 | $2.44 |
n1-highmem-16 us-west2a | 16 | 104 | $0.809 | 11 | 9 | 30 | 147 | $2.60 |
n1-highmem-16 us-west1b | 16 | 104 | $0.674 | 11 | 9 | 30 | 147 | $2.17 |
BLAST website (blastn) | - | - | - | - | Searches exceed current restrictions on usage | Searches exceed current restrictions on usage | Searches exceed current restrictions on usage | - |
All GCP instances are configured with 200 GB of persistent standard disk.
*Hourly costs were provided by Google Cloud Platform (May 2019) when VMs were created and are subject to change.
**Total costs were estimated using the hourly cost and total time to download nt and run Analysis 1, Analysis 2, and Analysis 3. Estimates are used for comparison only; your costs may vary and are your responsibility to monitor and manage.
Please refer to GCP for more information on machine types, regions and zones, and compute cost.
Please note that running the blastn
binary without specifying its -task
parameter invokes the MegaBLAST algorithm.
## Install Docker if not already done
## This section assumes using recommended hardware requirements below
## 16 CPUs, 104 GB memory and 200 GB persistent hard disk
## Modify the number of CPUs (-num_threads) in Step 3 if another type of VM is used.
## Step 1. Prepare for analysis
## Create directories
cd ; mkdir -p blastdb queries fasta results blastdb_custom
## Import and process input sequences
sudo apt install unzip
wget https://ndownloader.figshare.com/articles/6865397?private_link=729b346eda670e9daba4 -O fa.zip
unzip fa.zip -d fa
### Create three input query files
### All 28 samples
cat fa/*.fa > query.fa
### Sample 1
cat fa/'Sample_1 (paired) trimmed (paired) assembly.fa' > query1.fa
### Sample 1 to Sample 5
cat fa/'Sample_1 (paired) trimmed (paired) assembly.fa' \
fa/'Sample_2 (paired) trimmed (paired) assembly.fa' \
fa/'Sample_3 (paired) trimmed (paired) assembly.fa' \
fa/'Sample_4 (paired) trimmed (paired) assembly.fa' \
fa/'Sample_5 (paired) trimmed (paired) assembly.fa' > query5.fa
### Copy query sequences to $HOME/queries folder
cp query* $HOME/queries/.
## Step 2. Display BLAST databases on the GCP
docker run --rm ncbi/blast update_blastdb.pl --showall pretty --source gcp
## Download nt (nucleotide collection version 5) database
## This step takes approximately 10 min. The following command runs in the background.
docker run --rm \
-v $HOME/blastdb:/blast/blastdb:rw \
-w /blast/blastdb \
ncbi/blast \
update_blastdb.pl --source gcp nt &
## At this point, confirm query/database have been properly provisioned before proceeding
## Check the size of the directory containing the BLAST database
## nt should be around 68 GB (this was in May 2019)
du -sk $HOME/blastdb
## Check for queries, there should be three files - query.fa, query1.fa and query5.fa
ls -al $HOME/queries
## From this point forward, it may be easier if you run these steps in a script.
## Simply copy and paste all the commands below into a file named script.sh
## Then run the script in the background `nohup bash script.sh > script.out &`
## Step 3. Run BLAST
## Run BLAST using query1.fa (Sample 1)
## This command will take approximately 9 minutes to complete.
## Expected output size: 3.1 GB
docker run --rm \
-v $HOME/blastdb:/blast/blastdb:ro -v $HOME/blastdb_custom:/blast/blastdb_custom:ro \
-v $HOME/queries:/blast/queries:ro \
-v $HOME/results:/blast/results:rw \
ncbi/blast \
blastn -query /blast/queries/query1.fa -db nt -num_threads 16 \
-out /blast/results/blastn.query1.denovo16s.out
## Run BLAST using query5.fa (Samples 1-5)
## This command will take approximately 30 minutes to complete.
## Expected output size: 10.4 GB
docker run --rm \
-v $HOME/blastdb:/blast/blastdb:ro -v $HOME/blastdb_custom:/blast/blastdb_custom:ro \
-v $HOME/queries:/blast/queries:ro \
-v $HOME/results:/blast/results:rw \
ncbi/blast \
blastn -query /blast/queries/query5.fa -db nt -num_threads 16 \
-out /blast/results/blastn.query5.denovo16s.out
## Run BLAST using query.fa (All 28 samples)
## This command will take approximately 147 minutes to complete.
## Expected output size: 47.8 GB
docker run --rm \
-v $HOME/blastdb:/blast/blastdb:ro -v $HOME/blastdb_custom:/blast/blastdb_custom:ro \
-v $HOME/queries:/blast/queries:ro \
-v $HOME/results:/blast/results:rw \
ncbi/blast \
blastn -query /blast/queries/query.fa -db nt -num_threads 16 \
-out /blast/results/blastn.query.denovo16s.out
## Stdout and stderr will be in script.out
## BLAST output will be in $HOME/results
You have completed the entire tutorial. At this point, if you do not need the downloaded data for further analysis, please delete the VM to prevent incurring additional cost.
To delete an instance, follow instructions in the section Stop the GCP instance.
For additional information, please refer to Google Cloud Platform's documentation on instance life cycle.
To run these examples you'll need an Amazon Web Services (AWS) account. If you don't have one already, you can create an account that provides the ability to explore and try out AWS services free of charge up to specified limits for each service. To get started visit the Free Tier site, this will require a valid credit card however it will not be charged if you compute within the Free Tier. When choosing a Free Tier product, be sure it's in the Product Category Compute.
- An AWS account
- An EC2 VM running Linux, on an instance type of t2.micro
- An SSH client, such as the native Terminal application on OS X or on Windows 8 or greater with the CMD prompt or Putty on Windows
These instructions create an EC2 VM based on an Amazon Machine Image (AMI) that includes Docker and its dependencies.
- Log into the AWS console and select the EC2 service.
- Start the instance creation process by selecting Launch Instance (a virtual machine)
- In Step 1: Choose an Amazon Machine Image (AMI) select the AWS Marketplace tab
- In the search box enter the value ECS-Optimized Amazon Linux AMI
- Select one of the Free tier eligible AMIs; Amazon ECS-Optimized Amazon Linux AMI; select Continue
- In Step 2: Choose an Instance Type choose the t2.micro Type; select Next: Review and Launch
- Select Launch
- To allow SSH connection to this VM you'll need a key pair. When prompted, select an existing, or create a new, key pair. Be sure to record the location (directory) in which you place the associated .pem file, then select Launch Instances.
- Select View Instances
With the VM created, you access it from your local computer using SSH. Your key pair / .pem file serves as your credential.
There are several ways to establish an SSH connection. From the EC2 Instance list in the AWS Console, select Connect, then follow the instructions for the Connection Method A standalone SSH client.
The detailed instructions for connecting to a Linux VM can be found here.
Specify ec2-user as the username, instead of root in your ssh command line or when prompted to login, specify ec2-user as the username.
In this example, we will start by fetching query and database sequences and then create a custom BLAST database.
## Retrieve sequences
## Create directories for analysis
cd $HOME; sudo mkdir bin blastdb queries fasta results blastdb_custom; sudo chown ec2-user:ec2-user *
## Retrieve query sequence
docker run --rm ncbi/blast efetch -db protein -format fasta \
-id P01349 > queries/P01349.fsa
## Retrieve database sequences
docker run --rm ncbi/blast efetch -db protein -format fasta \
-id Q90523,P80049,P83981,P83982,P83983,P83977,P83984,P83985,P27950 \
> fasta/nurse-shark-proteins.fsa
## Make BLAST database
docker run --rm \
-v $HOME/blastdb_custom:/blast/blastdb_custom:rw \
-v $HOME/fasta:/blast/fasta:ro \
-w /blast/blastdb_custom \
ncbi/blast \
makeblastdb -in /blast/fasta/nurse-shark-proteins.fsa -dbtype prot \
-parse_seqids -out nurse-shark-proteins -title "Nurse shark proteins" \
-taxid 7801 -blastdb_version 5
To verify the newly created BLAST database above, you can run the following command to display the accessions, sequence length, and common name of the sequences in the database.
## Verify BLAST DB
docker run --rm \
-v $HOME/blastdb:/blast/blastdb:ro \
-v $HOME/blastdb_custom:/blast/blastdb_custom:ro \
ncbi/blast \
blastdbcmd -entry all -db nurse-shark-proteins -outfmt "%a %l %T"
When running BLAST in a Docker container, note the mounts (-v
option) specified to the docker run
command to make the input and outputs accessible. In the examples below, the first two mounts provide access to the BLAST databases, the third mount provides access to the query sequence(s), and the fourth mount provides a directory to save the results. (Note the :ro
and :rw
options, which mount the directories as read-only and read-write respectively.)
## Run BLAST+
docker run --rm \
-v $HOME/blastdb:/blast/blastdb:ro \
-v $HOME/blastdb_custom:/blast/blastdb_custom:ro \
-v $HOME/queries:/blast/queries:ro \
-v $HOME/results:/blast/results:rw \
ncbi/blast \
blastp -query /blast/queries/P01349.fsa -db nurse-shark-proteins \
-out /blast/results/blastp.out
At this point, you should see the output file $HOME/results/blastp.out
. With your query, BLAST identified the protein sequence P80049.1 as a match with a score of 14.2 and an E-value of 0.96. To view the content of this output file, use the command more $HOME/results/blastp.out
.
docker run --rm ncbi/blast update_blastdb.pl --showall pretty --source aws
The expected output is a list of BLAST DBs, including their name, description, size, and last updated date.
For a detailed description of update_blastdb.pl
, please refer to the documentation. By default update_blastdb.pl
will download from the cloud provided you are connected to, or from NCBI if you are not using a supported cloud provider.
docker run --rm ncbi/blast update_blastdb.pl --showall --source ncbi
The expected output is a list of the names of BLAST DBs.
Remember to stop or terminate the VM to prevent incurring additional cost. You can do this from the EC2 Instance list in the AWS Console as shown below.
This example requires a multi-core host. As such, EC2 compute charges will be realized by executing this example. The current rate for the Instance Type used - t2.large - is $0.093/hr.
These instructions create an EC2 VM based on an Amazon Machine Image (AMI) that includes Docker and its dependencies.
- Log into the AWS console and select the EC2 service.
- Start the instance creation process by selecting Launch Instance (a virtual machine)
- In Step 1: Choose an Amazon Machine Image (AMI) select the AWS Marketplace tab
- In the search box enter the value ECS-Optimized Amazon Linux AMI
- Select one of the Free tier eligible AMIs; Amazon ECS-Optimized Amazon Linux AMI; select Continue
- In Step 2: Choose an Instance Type choose the t2.large Type; select Next: Review and Launch
- Select Launch
- To allow SSH connection to this VM you'll need a key pair. When prompted, select an existing, or create a new, key pair. Be sure to record the location (directory) in which you place the associated .pem file, then select Launch Instances. You can use the same key pair as used in Example 1.
- Select View Instances
With the VM created, you access it from your local computer using SSH. Your key pair / .pem file serves as your credential.
There are several ways to establish an SSH connection. From the EC2 Instance list in the AWS Console, select Connect, then follow the instructions for the Connection Method A standalone SSH client.
The detailed instructions for connecting to a Linux VM can be found here.
Specify ec2-user as the username, instead of root in your ssh command line or when prompted to login, specify ec2-user as the username.
## Create directories for analysis
cd $HOME; sudo mkdir bin blastdb queries fasta results blastdb_custom; sudo chown ec2-user:ec2-user *
## Retrieve query sequence
docker run --rm ncbi/blast efetch -db protein -format fasta \
-id P01349 > queries/P01349.fsa
The command below mounts (using the -v
option) the $HOME/blastdb
path on the local machine as /blast/blastdb
on the container, and blastdbcmd
shows the available BLAST databases at this location.
## Download Protein Data Bank amino acid database (pdbaa)
docker run --rm \
-v $HOME/blastdb:/blast/blastdb:rw \
-w /blast/blastdb \
ncbi/blast \
update_blastdb.pl pdbaa
## Display database(s) in $HOME/blastdb
docker run --rm \
-v $HOME/blastdb:/blast/blastdb:ro \
ncbi/blast \
blastdbcmd -list /blast/blastdb -remove_redundant_dbs
You should see an output /blast/blastdb/pdbaa Protein
.
## Run BLAST+
docker run --rm \
-v $HOME/blastdb:/blast/blastdb:ro \
-v $HOME/blastdb_custom:/blast/blastdb_custom:ro \
-v $HOME/queries:/blast/queries:ro \
-v $HOME/results:/blast/results:rw \
ncbi/blast \
blastp -query /blast/queries/P01349.fsa -db pdbaa \
-out /blast/results/blastp_pdbaa.out
At this point, you should see the output file $HOME/results/blastp_pdbaa.out
. To view the content of this output file, use the command more $HOME/results/blastp_pdbaa.out
.
One way to transfer files between your local computer and a Linux instance is to use the secure copy protocol (SCP).
The section Transferring files to Linux instances from Linux using SCP of the Amazon EC2 User Guide for Linux Instances provides detailed instructions for this process.
The NCBI hosts the same databases on AWS, GCP, and the NCBI FTP site. The table below has the list of databases current as of November, 2022.
It is also possible to obtain the current list with the command:
docker run --rm ncbi/blast update_blastdb.pl --showall pretty
or
update_blastdb.pl --showall pretty # after downloading the BLAST+ package.
As shown above, update_blastdb.pl can also be used to download these databases. It will automatically select the appropriate resource (e.g., GCP if you are within that provider).
These databases can also be searched with ElasticBLAST on GCP and AWS.
Accessing the databases on AWS or GCP outside of the cloud provider will likely result in egress charges to your account. If you are not on the cloud provider, you should use the databases at the NCBI FTP site.
Name | Type | Title |
---|---|---|
16S_ribosomal_RNA | DNA | 16S ribosomal RNA (Bacteria and Archaea type strains) |
18S_fungal_sequences | DNA | 18S ribosomal RNA sequences (SSU) from Fungi type and reference material |
28S_fungal_sequences | DNA | 28S ribosomal RNA sequences (LSU) from Fungi type and reference material |
Betacoronavirus | DNA | Betacoronavirus |
GCF_000001405.38_top_level | DNA | Homo sapiens GRCh38.p12 [GCF_000001405.38] chromosomes plus unplaced and unlocalized scaffolds |
GCF_000001635.26_top_level | DNA | Mus musculus GRCm38.p6 [GCF_000001635.26] chromosomes plus unplaced and unlocalized scaffolds |
ITS_RefSeq_Fungi | DNA | Internal transcribed spacer region (ITS) from Fungi type and reference material |
ITS_eukaryote_sequences | DNA | ITS eukaryote BLAST |
LSU_eukaryote_rRNA | DNA | Large subunit ribosomal nucleic acid for Eukaryotes |
LSU_prokaryote_rRNA | DNA | Large subunit ribosomal nucleic acid for Prokaryotes |
SSU_eukaryote_rRNA | DNA | Small subunit ribosomal nucleic acid for Eukaryotes |
env_nt | DNA | environmental samples |
nt | DNA | Nucleotide collection (nt) |
patnt | DNA | Nucleotide sequences derived from the Patent division of GenBank |
pdbnt | DNA | PDB nucleotide database |
ref_euk_rep_genomes | DNA | RefSeq Eukaryotic Representative Genome Database |
ref_prok_rep_genomes | DNA | Refseq prokaryote representative genomes (contains refseq assembly) |
ref_viroids_rep_genomes | DNA | Refseq viroids representative genomes |
ref_viruses_rep_genomes | DNA | Refseq viruses representative genomes |
refseq_rna | DNA | NCBI Transcript Reference Sequences |
refseq_select_rna | DNA | RefSeq Select RNA sequences |
tsa_nt | DNA | Transcriptome Shotgun Assembly (TSA) sequences |
env_nr | Protein | Proteins from WGS metagenomic projects |
landmark | Protein | Landmark database for SmartBLAST |
nr | Protein | All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects |
pdbaa | Protein | PDB protein database |
pataa | Protein | Protein sequences derived from the Patent division of GenBank |
refseq_protein | Protein | NCBI Protein Reference Sequences |
refseq_select_prot | Protein | RefSeq Select proteins |
swissprot | Protein | Non-redundant UniProtKB/SwissProt sequences |
tsa_nr | Protein | Transcriptome Shotgun Assembly (TSA) sequences |
cdd | Protein | Conserved Domain Database (CDD) is a collection of well-annotated multiple sequence alignment models reprepresented as position-specific score matrices |
The NCBI provides metadata for the available BLAST databases at AWS, GCP and the NCBI FTP site.
Accessing the databases on AWS or GCP outside of the cloud provider will likely result in egress charges to your account. If you are not on the cloud provider, you should use the databases at the NCBI FTP site.
On AWS and GCP, the file is in a date dependent subdirectory with the
databases. To find the latest valid subdirectory, first read
s3://ncbi-blast-databases/latest-dir
(on AWS) or gs://blast-db/latest-dir
(on
GCP). latest-dir
is a text file with a date stamp (e.g., 2020-09-29-01-05-01)
specifying the most recent directory. The proper directory will be the AWS or
GCP base URI for the BLAST databases (e.g., s3://ncbi-blast-databases/
for
AWS) plus the text in the latest-dir
file. An example URI, in AWS, would be
s3://ncbi-blast-databases/2020-09-29-01-05-01
. The GCP URI would be similar.
An excerpt from a metadata file is shown below. Most fields have obvious
meanings. The files comprise the BLAST database. The bytes-total
field
represents the total BLAST database size in bytes and is intended to specify
how much disk space is required.
The example below is from AWS, but the metadata files on GCP have the same format. Databases on the FTP site are in gzipped tarfiles, one per volume of the BLAST database, so those are listed rather than the individual files.
"16S_ribosomal_RNA": {
"version": "1.2",
"dbname": "16S_ribosomal_RNA",
"dbtype": "Nucleotide",
"db-version": 5,
"description": "16S ribosomal RNA (Bacteria and Archaea type strains)",
"number-of-letters": 32435109,
"number-of-sequences": 22311,
"last-updated": "2022-03-07T11:23:00",
"number-of-volumes": 1,
"bytes-total": 14917073,
"bytes-to-cache": 8495841,
"files": [
"s3://ncbi-blast-databases/2020-09-26-01-05-01/16S_ribosomal_RNA.ndb",
"s3://ncbi-blast-databases/2020-09-26-01-05-01/16S_ribosomal_RNA.nog",
"s3://ncbi-blast-databases/2020-09-26-01-05-01/16S_ribosomal_RNA.nni",
"s3://ncbi-blast-databases/2020-09-26-01-05-01/16S_ribosomal_RNA.nnd",
"s3://ncbi-blast-databases/2020-09-26-01-05-01/16S_ribosomal_RNA.nsq",
"s3://ncbi-blast-databases/2020-09-26-01-05-01/16S_ribosomal_RNA.nin",
"s3://ncbi-blast-databases/2020-09-26-01-05-01/16S_ribosomal_RNA.ntf",
"s3://ncbi-blast-databases/2020-09-26-01-05-01/16S_ribosomal_RNA.not",
"s3://ncbi-blast-databases/2020-09-26-01-05-01/16S_ribosomal_RNA.nhr",
"s3://ncbi-blast-databases/2020-09-26-01-05-01/16S_ribosomal_RNA.nos",
"s3://ncbi-blast-databases/2020-09-26-01-05-01/16S_ribosomal_RNA.nto",
"s3://ncbi-blast-databases/2020-09-26-01-05-01/taxdb.btd",
"s3://ncbi-blast-databases/2020-09-26-01-05-01/taxdb.bti"
]
}
- BLAST:
- Docker:
- Other:
- Common Workflow Language (CWL) is a specification to describe tools and workflows. This GitHub Repository contains sample CWL workflows using containerized BLAST+.
- Google Cloud Platform
- NIH/STRIDES
- GitHub
or email us.
National Center for Biotechnology Information (NCBI)
National Library of Medicine (NLM)
National Institutes of Health (NIH)
View refer to the license and copyright information for the software contained in this image.
As with all Docker images, these likely also contain other software which may be under other licenses (such as bash, etc., from the base distribution, along with any direct or indirect dependencies of the primary software being contained).
As with any pre-built image usage, it is the image user's responsibility to ensure that any use of this image complies with any relevant licenses for all software contained within.
Figure 1. Docker and Cloud Computing Concept. Users can access compute resources provided by cloud service providers (CSPs), such as the Google Cloud Platform, using SSH tunneling (1). When you create a VM (2), a hard disk (also called a boot/persistent disk) (3) is attached to that VM. With the right permissions, VMs can also access other storage buckets (4) or other data repositories in the public domain. Once inside a VM with Docker installed, you can run a Docker image (5), such as NCBI's BLAST image. An image can be used to create multiple running instances or containers (6). Each container is in an isolated environment. In order to make data accessible inside the container, you need to use Docker bind mounts (7) described in this tutorial.
A Docker image can be used to create a Singularity image. Please refer to Singularity's documentation for more detail.
As an alternative to what is described above, you can also run BLAST interactively inside a container.
When to use: This is useful for running a few (e.g., fewer than 5-10) BLAST searches on small BLAST databases where you expect the search to complete in seconds/minutes.
docker run --rm -it \
-v $HOME/blastdb:/blast/blastdb:ro -v $HOME/blastdb_custom:/blast/blastdb_custom:ro \
-v $HOME/queries:/blast/queries:ro \
-v $HOME/results:/blast/results:rw \
ncbi/blast \
/bin/bash
# Once you are inside the container (note the root prompt), run the following BLAST commands.
blastp -query /blast/queries/P01349.fsa -db nurse-shark-proteins \
-out /blast/results/blastp.out
# To view output, run the following command
more /blast/results/blastp.out
# Leave container
exit
In addition, you can run BLAST in detached mode by running a container in the background.
When to use: This is a more practical approach if you have many (e.g., 10 or more) BLAST searches to run or you expect the search to take a long time to execute. In this case it may be better to start the BLAST container in detached mode and execute commands on it.
NOTE: Be sure to mount all required directories, as these need to be specified when the container is started.
# Start a container named 'blast' in detached mode
docker run --rm -dit --name blast \
-v $HOME/blastdb:/blast/blastdb:ro -v $HOME/blastdb_custom:/blast/blastdb_custom:ro \
-v $HOME/queries:/blast/queries:ro \
-v $HOME/results:/blast/results:rw \
ncbi/blast \
sleep infinity
# Check the container is running in the background
docker ps -a
docker ps --filter "status=running"
Once the container is confirmed to be running in detached mode, run the following BLAST command.
docker exec blast blastp -query /blast/queries/P01349.fsa \
-db nurse-shark-proteins -out /blast/results/blastp.out
# View output
more $HOME/results/blastp.out
# stop the container
docker stop blast
If you run into issues with docker stop blast
command, reset the VM from the GCP Console or restart the SSH session.
To copy the file $HOME/script.out
in the home directory on a local machine to the home directory on a GCP VM named instance-1
in project My First Project
using GCP Cloud SDK.
GCP documentation
First install GCP Cloud SDK command line tools for your operating system.
# First, set up gcloud tools
# From local machine's terminal
gcloud init
# Enter a configuration name
# Select the sign-in email account
# Select a project, for example “my-first-project”
# Select a compute engine zone, for example, “us-east4-c”
# To copy the file $HOME/script.out to the home directory of GCP instance-1
# Instance name can be found in your Google Cloud Console -> Compute Engine -> VM instances
gcloud compute scp $HOME/script.out instance-1:~
# Optional - to transfer the file from the GCP instance to a local machine's home directory
gcloud compute scp instance-1:~/script.out $HOME/.