Skip to content

ncbi/blast_plus_docs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 

Repository files navigation

Official NCBI BLAST+ Docker Image Documentation

This repository contains documentation for the NCBI BLAST+ command line applications in a Docker image. We will demonstrate how to use the Docker image to run BLAST analysis on the Google Cloud Platform (GCP) and Amazon Web Services (AWS) using a small basic example and a more advanced production-level example. Some basic knowledge of Unix/Linux commands and BLAST+ is useful in completing this tutorial.

Table of Contents

What Is NCBI BLAST?

ncbi-logo

The National Center for Biotechnology Information (NCBI) Basic Local Alignment Search Tool(BLAST) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families.

Introduced in 2009, BLAST+ is an improved version of BLAST command line applications. For a full description of the features and capabilities of BLAST+, please refer to the BLAST Command Line Applications User Manual.

What Is Cloud Computing?

Cloud computing offers potential cost savings by using on-demand, scalable, and elastic computational resources. While a detailed description of various cloud technologies and benefits is out of the scope for this repository, the following sections contain information needed to get started running the BLAST+ Docker image on the Google Cloud Platform (GCP).

What Is Docker?

Docker is a tool to perform operating-system level virtualization using software containers. In containerization technology*, an image is a snapshot of an analytical environment encapsulating application(s) and dependencies. An image, which is essentially a file built from a list of instructions, can be saved and easily shared for others to recreate the exact analytical environment across platforms and operating systems. A container is a runtime instance of an image. By using containerization, users can bypass the often-complicated steps in compiling, configuring, and installing a Unix-based tool like BLAST+. In addition to portability, containerization is a lightweight approach to make analysis more findable, accessible, interoperable, reusable (F.A.I.R.) and, ultimately, reproducible.

*There are many containerization tools and standards, such as Docker and Singularity. We will focus solely on Docker, which is considered the de facto standard by many in the field.

Google Cloud Platform Setup

The following sections include instructions to create a Google virtual machine, install Docker, and run BLAST+ commands using the Docker image.

Section 1 - Getting Started Using the BLAST+ Docker Image with a Small Example

This section provides a quick run-through of a BLAST analysis in the Docker environment on a Google instance. This is intended as an overview for those who just want an understanding of the principles of the solution. If you work with Amazon instances, please go the the Amazon Web Services Setup section of this documentation. The Google Cloud Shell, an interactive shell environment, will be used for this example, which makes it possible to run the following small example without having to perform additional setup, such as creating a billing account or compute instance. More detailed descriptions of analysis steps, alternative commands, and more advanced topics are covered in the later sections of this documentation.

Requirements: A Google account

Flow of the Task: Task-Flow

Input data:

  • Query – 1 sequence, 44 nucleotides, file size 0.2 KB
  • Databases
    • 7 sequences, 922 nucleotides, file size 1.7 KB
    • PDB protein database (pdbaa) 0.2831 GB

First, in a separate browser window or tab, sign in at https://console.cloud.google.com/

Click the Activate Cloud Shell button at the top right corner of the Google Cloud Platform Console. Activate-Cloud-Shell

You now will see your Cloud Shell session window: Cloud-Shell-Commandline

The next step is to copy-and-paste the commands below in your Cloud Shell session.

Please note: In GitHub you can use your mouse to copy; however, in the command shell you must use your keyboard. In Windows or Unix/Linux, use the shortcut Control+C to copy and Control+V to paste. On macOS, use Command+C to copy and Command+V to paste.

To scroll in the Cloud Shell, enable the scrollbar in Terminal settings with the wrench icon. Cloud-Shell-wrench

# Time needed to complete this section: <10 minutes

# Step 1. Retrieve sequences
## Create directories for analysis
cd ; mkdir blastdb queries fasta results blastdb_custom

## Retrieve query sequence
docker run --rm ncbi/blast efetch -db protein -format fasta \
    -id P01349 > queries/P01349.fsa
    
## Retrieve database sequences
docker run --rm ncbi/blast efetch -db protein -format fasta \
    -id Q90523,P80049,P83981,P83982,P83983,P83977,P83984,P83985,P27950 \
    > fasta/nurse-shark-proteins.fsa
    
## Step 2. Make BLAST database 
docker run --rm \
    -v $HOME/blastdb_custom:/blast/blastdb_custom:rw \
    -v $HOME/fasta:/blast/fasta:ro \
    -w /blast/blastdb_custom \
    ncbi/blast \
    makeblastdb -in /blast/fasta/nurse-shark-proteins.fsa -dbtype prot \
    -parse_seqids -out nurse-shark-proteins -title "Nurse shark proteins" \
    -taxid 7801 -blastdb_version 5
    
## Step 3. Run BLAST+ 
docker run --rm \
    -v $HOME/blastdb:/blast/blastdb:ro \
    -v $HOME/blastdb_custom:/blast/blastdb_custom:ro \
    -v $HOME/queries:/blast/queries:ro \
    -v $HOME/results:/blast/results:rw \
    ncbi/blast \
    blastp -query /blast/queries/P01349.fsa -db nurse-shark-proteins
    
## Output on screen
## Scroll up to see the entire output
## Type "exit" to leave the Cloud Shell or continue to the next section

At this point, you should see the output on the screen. With your query, BLAST identified the protein sequence P80049.1 as a match with a score of 14.2 and an E-value of 0.96.

For larger analysis, it is recommended to use the -out flag to save the output to a file. For example, append -out /blast/results/blastp.out to the last command in Step 3 above and view the content of this output file using more $HOME/results/blastp.out.

You can also query P01349.fsa against the PDB as shown in the following code block.

## Extend the example to query against the Protein Data Bank
## Time needed to complete this section: <10 minutes

## Confirm query
ls queries/P01349.fsa

## Download Protein Data Bank amino acid database (pdbaa)
docker run --rm \
     -v $HOME/blastdb:/blast/blastdb:rw \
     -w /blast/blastdb \
     ncbi/blast \
     update_blastdb.pl --source gcp pdbaa

## Run BLAST+ 
docker run --rm \
     -v $HOME/blastdb:/blast/blastdb:ro \
     -v $HOME/blastdb_custom:/blast/blastdb_custom:ro \
     -v $HOME/queries:/blast/queries:ro \
     -v $HOME/results:/blast/results:rw \
     ncbi/blast \
     blastp -query /blast/queries/P01349.fsa -db pdbaa

## Output on screen
## Scroll up to see the entire output
## Leave the Cloud Shell

exit

You have now completed a simple task and seen how BLAST+ with Docker works. To learn about Docker and BLAST+ at production scale, please proceed to the next section.

In Section 2 - A Step-by-Step Guide Using the BLAST+ Docker Image, we will use the same small example from the previous section and discuss alternative approaches, additional useful Docker and BLAST+ commands, and Docker command options and structures. In Section 3, we will demonstrate how to run the BLAST+ Docker image at production scale.

First, you need to set up a Google Cloud Platform (GCP) virtual machine (VM) for analysis.

Requirements

  • A GCP account linked to a billing account
  • A GCP VM running Ubuntu 18.04LTS

Set up your GCP account and create a VM for analysis

1. Creating your GCP account and registering for the free $300 credit program. (If you already have a GCP billing account, you can skip to step 2.)

  • First, in a separate browser window or tab, sign in at https://console.cloud.google.com/
    • If you need to create one, go to https://cloud.google.com/ and click “Get started for free” to sign up for a trial account.
    • If you have multiple Google accounts, sign in using an Incognito Window (Chrome) or Private Window (Safari) or any other private browser window.

GCP is currently offering a $300 credit, which expires 12 months from activation, to incentivize new cloud users. The following steps will show you how to activate this credit. You will be asked for billing information, but GCP will not auto-charge you once the trial ends; you must elect to manually upgrade to a paid account.

  • After signing in, click Activate to activate the $300 credit. GCP credit

  • Enter your country, for example, United States, and check the box indicating that you have read and accept the terms of service.

  • Under “Account type,” select “Individual.” (This may be pre-selected in your Google account)

  • Enter your name and address.

  • Under “How you pay," select “Automatic payments.” (This may be pre-selected in your Google account) This indicates that you will pay costs after you have used the service, either when you have reached your billing threshold or every 30 days, whichever comes first.

  • Under “Payment method,” select “add a credit or debit card” and enter your credit card information. You will not be automatically charged once the trial ends. You must elect to upgrade to a paid account before your payment method will be charged.

  • Click “Start my free trial” to finish registration. When this process is completed, you should see a GCP welcome screen.

2. Create a Virtual Machine (VM)

  • On the GCP welcome screen from the last step, click "Compute Engine" or navigate to the "Compute Engine" section by clicking on the navigation menu with the "hamburger icon" (three horizontal lines) on the top left corner.

GCP instance

  • Click on the blue “CREATE INSTANCE” button on the top bar.
  • Create an image with the following parameters: (if parameter is not list below, keep the default setting)
    • Name: keep the default or enter a name
    • Region: us-east4 (Northern Virginia)
    • For Section 2, change these settings -
      • Machine Type: micro (1 shared vCPU), 0.6 GB memory, f1-micro
      • Boot Disk: Click "Change," select Ubuntu 18.04 LTS, and click "Select" (Boot disc size is default 10 GB).
    • For Section 3, change these settings -
      • Machine Type: 16 vCPU, 104 GB memory, n1-highmem-16
      • Boot Disk: Click "Change" and select Ubuntu 18.04 LTS, change the "Boot disk size" to 200 GB Standard persistent disk, and click "Select."

At this point, you should see a cost estimate for this instance on the right side of your window.
GCP VM cost

  • Click the blue “Create” button. This will create and start the VM.

Please note: Creating a VM in the same region as storage can provide better performance. We recommend creating a VM in the us-east4 region. If you have a job that will take several hours, but less than 24 hours, you can potentially take advantage of preemptible VMs.

Detailed instructions for creating a GCP account and launching a VM can be found here.

3. Access a GCP VM from a local machine

Once you have your VM created, you must access it from your local computer. There are many methods to access your VM, depending on the ways in which you would like to use it. On the GCP, the most straightforward way is to SSH from the browser.

  • Connect to your new VM instance by clicking the "SSH" button GCP SSH

You now have a command shell running and you are ready to proceed.

Remember to stop or delete the VM to prevent incurring additional cost.

Section 2 - A Step-by-Step Guide Using the BLAST+ Docker Image

In this section, we will cover Docker installation, discuss various docker run command options, and examine the structure of a Docker command. We will use the same small example from Section 1 and explore alternative approaches in running the BLAST+ Docker image. However, we are using a real VM instance, which provides greater performance and functionality than the Google Cloud Shell.

Input data

  • Query – 1 sequence, 44 nucleotides, file size 0.2 KB
  • Database – 7 sequences, 922 nucleotides, file size 1.7 KB

Step 1. Install Docker

In a production system, Docker has to be installed as an application.

## Run these commands to install Docker and add non-root users to run Docker
sudo snap install docker
sudo apt update
sudo apt install -y docker.io
sudo usermod -aG docker $USER
exit
# exit and SSH back in for changes to take effect

To confirm the correct installation of Docker, run the command docker run hello-world. If correctly installed, you should see "Hello from Docker!..."(https://docs.docker.com/samples/library/hello-world/)

Docker run command options

This section is optional.

Below is a list of docker run command line options used in this tutorial.

Name, short-hand(if available) Description
--rm Automatically remove the container when it exits
--volume , -v Bind mount a volume
--workdir , -w Working directory inside the container

Docker run command structure

This section is optional.

For this tutorial, it would be useful to understand the structure of a Docker command. The following command consists of three parts.

docker run --rm ncbi/blast \
    -v $HOME/blastdb_custom:/blast/blastdb_custom:rw \
    -v $HOME/fasta:/blast/fasta:ro \
    -w /blast/blastdb_custom \
    makeblastdb -in /blast/fasta/nurse-shark-proteins.fsa -dbtype prot \
    -parse_seqids -out nurse-shark-proteins -title "Nurse shark proteins" \
    -taxid 7801 -blastdb_version 5

The first part of the command docker run --rm ncbi/blast is an instruction to run the docker image ncbi/blast and remove the container when the run is completed.

The second part of the command makes the query sequence data accessible in the container. Docker bind mounts uses -v to mount the local directories to directories inside the container and provide access permission rw (read and write) or ro (read only). For instance, assuming your subject sequences are stored in the $HOME/fasta directory on the local host, you can use the following parameter to make that directory accessible inside the container in /blast/fasta as a read-only directory -v $HOME/fasta:/blast/fasta:ro. The -w /blast/blastdb_custom flag sets the working directory inside the container.

The third part of the command is the BLAST+ command. In this case, it is executing makeblastdb to create BLAST database files.

You can start an interactive bash session for this image by using docker run -it ncbi/blast /bin/bash. For the BLAST+ Docker image, the executables are in the folder /blast/bin and /root/edirect and added to the variable $PATH.

For additional documentation on the docker run command, please refer to documentation.

Useful Docker commands

This section is optional.

Docker Command Description
docker ps -a Displays a list of containers
docker rm $(docker ps -q -f status=exited) Removes all exited containers, if you have at least 1 exited container
docker rm <CONTAINER_ID> Removes a container
docker images Displays a list of images
docker rmi <REPOSITORY (IMAGE_NAME)> Removes an image

Using BLAST+ with Docker

This section is optional.

With this Docker image you can run BLAST+ in an isolated container, facilitating reproducibility of BLAST results. As a user of this Docker image, you are expected to provide BLAST databases and query sequence(s) to run BLAST as well as a location outside the container to save the results. The following is a list of directories used by BLAST+. You will create them in Step 2.

Directory Purpose Notes
$HOME/blastdb Stores NCBI-provided BLAST databases If set to a single, absolute path, the $BLASTDB environment variable could be used instead (see Configuring BLAST via environment variables.)
$HOME/queries Stores user-provided query sequence(s)
$HOME/fasta Stores user-provided FASTA sequences to create BLAST database(s)
$HOME/results Stores BLAST results Mount with rw permissions
$HOME/blastdb_custom Stores user-provided BLAST databases

Versions of BLAST Docker image

This section is optional.

The following command displays the latest BLAST version.
docker run --rm ncbi/blast blastn -version

Appending a tag to the image name (ncbi/blast) allows you to use a different version of BLAST+ (see “Supported Tags and Respective Release Notes” section for supported versions).

Different versions of BLAST+ exist in different Docker images. The following command will initiate download of the BLAST+ version 2.9.0 Docker image.

docker run --rm ncbi/blast:2.9.0 blastn -version
## Display a list of images
docker images

For example, to use the BLAST+ version 2.9.0 Docker image instead of the latest version, replace the first part of the command

docker run --rm ncbi/blast with docker run --rm ncbi/blast:2.9.0

Supported tags

This section is optional.

Step 2. Import sequences and create a BLAST database

In this example, we will start by fetching query and database sequences and then create a custom BLAST database.

# Start in a directory where you want to perform the analysis
## Create directories for analysis
cd ; mkdir blastdb queries fasta results blastdb_custom

## Retrieve query sequences
docker run --rm ncbi/blast efetch -db protein -format fasta \
    -id P01349 > queries/P01349.fsa
    
## Retrieve database sequences
docker run --rm ncbi/blast efetch -db protein -format fasta \
    -id Q90523,P80049,P83981,P83982,P83983,P83977,P83984,P83985,P27950 \
    > fasta/nurse-shark-proteins.fsa
    
## Make BLAST database 
docker run --rm \
    -v $HOME/blastdb_custom:/blast/blastdb_custom:rw \
    -v $HOME/fasta:/blast/fasta:ro \
    -w /blast/blastdb_custom \
    ncbi/blast \
    makeblastdb -in /blast/fasta/nurse-shark-proteins.fsa -dbtype prot \
    -parse_seqids -out nurse-shark-proteins -title "Nurse shark proteins" \
    -taxid 7801 -blastdb_version 5

To verify the newly created BLAST database above, you can run the following command to display the accessions, sequence length, and common name of the sequences in the database.

docker run --rm \
    -v $HOME/blastdb:/blast/blastdb:ro \
    -v $HOME/blastdb_custom:/blast/blastdb_custom:ro \
    ncbi/blast \
    blastdbcmd -entry all -db nurse-shark-proteins -outfmt "%a %l %T"

As an alternative, you can also download preformatted BLAST databases from NCBI or the NCBI Google storage bucket.

Show BLAST databases available for download from the Google Cloud bucket

docker run --rm ncbi/blast update_blastdb.pl --showall pretty --source gcp

For a detailed description of update_blastdb.pl, please refer to the documentation. By default update_blastdb.pl will download from the cloud provided you are connected to, or from NCBI if you are not using a supported cloud provider.

Show BLAST databases available for download from NCBI

This section is optional.

docker run --rm ncbi/blast update_blastdb.pl --showall --source ncbi

Show available BLAST databases on local host

This section is optional.

The command below mounts the $HOME/blastdb path on the local machine as /blast/blastdb on the container, and blastdbcmd shows the available BLAST databases at this location.

## Download Protein Data Bank amino acid database (pdbaa)
docker run --rm \
     -v $HOME/blastdb:/blast/blastdb:rw \
     -w /blast/blastdb \
     ncbi/blast \
     update_blastdb.pl pdbaa

## Display database(s) in $HOME/blastdb
docker run --rm \
    -v $HOME/blastdb:/blast/blastdb:ro \
    ncbi/blast \
    blastdbcmd -list /blast/blastdb -remove_redundant_dbs

You should see an output /blast/blastdb/pdbaa Protein.

## For the custom BLAST database used in this example -
docker run --rm \
    -v $HOME/blastdb_custom:/blast/blastdb_custom:ro \
    ncbi/blast \
    blastdbcmd -list /blast/blastdb_custom -remove_redundant_dbs

You should see an output /blast/blastdb_custom/nurse-shark-proteins Protein.

Step 3. Run BLAST

When running BLAST in a Docker container, note the mounts specified to the docker run command to make the input and outputs accessible. In the examples below, the first two mounts provide access to the BLAST databases, the third mount provides access to the query sequence(s), and the fourth mount provides a directory to save the results. (Note the :ro and :rw options, which mount the directories as read-only and read-write respectively.)

docker run --rm \
    -v $HOME/blastdb:/blast/blastdb:ro \
    -v $HOME/blastdb_custom:/blast/blastdb_custom:ro \
    -v $HOME/queries:/blast/queries:ro \
    -v $HOME/results:/blast/results:rw \
    ncbi/blast \
    blastp -query /blast/queries/P01349.fsa -db nurse-shark-proteins \
    -out /blast/results/blastp.out

At this point, you should see the output file $HOME/results/blastp.out. With your query, BLAST identified the protein sequence P80049.1 as a match with a score of 14.2 and an E-value of 0.96. To view the content of this output file, use the command more $HOME/results/blastp.out.

Stop the GCP instance

Remember to stop or delete the VM to prevent incurring additional cost. You can do this at the GCP Console as shown below. GCP instance stop

Section 3 - Using the BLAST+ Docker Image at Production Scale

Background

One of the promises of cloud computing is scalability. In this section, we will demonstrate how to use the BLAST+ Docker image at production scale on the Google Cloud Platform. We will perform a BLAST analysis similar to the approach described in this publication to compare de novo aligned contigs from bacterial 16S-23S sequencing against the nucleotide collection (nt) database.

To test scalability, we will use inputs of different sizes to estimate the amount of time to download the nucleotide collection database and run BLAST search using the latest version of the BLAST+ Docker image. Expected results are summarized in the following tables.

Input files: 28 samples (multi-FASTA files) containing de novo aligned contigs from the publication.
(Instructions to download and create the input files are described in the code block below.)

Database: Pre-formatted BLAST nucleotide collection database, version 5 (nt): 68.7217 GB (from May 2019)

Input file name File content File size Number of sequences Number of nucleotides Expected output size
Analysis 1 query1.fa only sample 1 59 KB 121 51,119 3.1 GB
Analysis 2 query5.fa only samples 1-5 422 KB 717 375,154 10.4 GB
Analysis 3 query.fa all 28 samples 2.322 MB 3798 2,069,892 47.8 GB

BLAST+ Docker image benchmarks

VM Type/Zone CPU Memory (GB) Hourly Cost* Download nt (min) Analysis 1 (min) Analysis 2 (min) Analysis 3 (min) Total Cost**
n1-standard-8 us-east4c 8 30 $0.312 9 22 - - -
n1-standard-16 us-east4c 16 60 $0.611 9 14 53 205 $2.86
n1-highmem-16 us-east4c 16 104 $0.767 9 9 30 143 $2.44
n1-highmem-16 us-west2a 16 104 $0.809 11 9 30 147 $2.60
n1-highmem-16 us-west1b 16 104 $0.674 11 9 30 147 $2.17
BLAST website (blastn) - - - - Searches exceed current restrictions on usage Searches exceed current restrictions on usage Searches exceed current restrictions on usage -

All GCP instances are configured with 200 GB of persistent standard disk.

*Hourly costs were provided by Google Cloud Platform (May 2019) when VMs were created and are subject to change.
**Total costs were estimated using the hourly cost and total time to download nt and run Analysis 1, Analysis 2, and Analysis 3. Estimates are used for comparison only; your costs may vary and are your responsibility to monitor and manage.

Please refer to GCP for more information on machine types, regions and zones, and compute cost.

Please note that running the blastn binary without specifying its -task parameter invokes the MegaBLAST algorithm.

Commands to run

## Install Docker if not already done
## This section assumes using recommended hardware requirements below
## 16 CPUs, 104 GB memory and 200 GB persistent hard disk

## Modify the number of CPUs (-num_threads) in Step 3 if another type of VM is used.

## Step 1. Prepare for analysis
## Create directories
cd ; mkdir -p blastdb queries fasta results blastdb_custom

## Import and process input sequences
sudo apt install unzip
wget https://ndownloader.figshare.com/articles/6865397?private_link=729b346eda670e9daba4 -O fa.zip
unzip fa.zip -d fa

### Create three input query files
### All 28 samples
cat fa/*.fa > query.fa

### Sample 1
cat fa/'Sample_1 (paired) trimmed (paired) assembly.fa' > query1.fa

### Sample 1 to Sample 5
cat fa/'Sample_1 (paired) trimmed (paired) assembly.fa' \
    fa/'Sample_2 (paired) trimmed (paired) assembly.fa' \
    fa/'Sample_3 (paired) trimmed (paired) assembly.fa' \
    fa/'Sample_4 (paired) trimmed (paired) assembly.fa' \
    fa/'Sample_5 (paired) trimmed (paired) assembly.fa' > query5.fa
    
### Copy query sequences to $HOME/queries folder
cp query* $HOME/queries/.

## Step 2. Display BLAST databases on the GCP
docker run --rm ncbi/blast update_blastdb.pl --showall pretty --source gcp

## Download nt (nucleotide collection version 5) database
## This step takes approximately 10 min.  The following command runs in the background.
docker run --rm \
  -v $HOME/blastdb:/blast/blastdb:rw \
  -w /blast/blastdb \
  ncbi/blast \
  update_blastdb.pl --source gcp nt &

## At this point, confirm query/database have been properly provisioned before proceeding

## Check the size of the directory containing the BLAST database
## nt should be around 68 GB    (this was in May 2019)
du -sk $HOME/blastdb

## Check for queries, there should be three files - query.fa, query1.fa and query5.fa
ls -al $HOME/queries

## From this point forward, it may be easier if you run these steps in a script. 
## Simply copy and paste all the commands below into a file named script.sh
## Then run the script in the background `nohup bash script.sh > script.out &`

## Step 3. Run BLAST
## Run BLAST using query1.fa (Sample 1) 
## This command will take approximately 9 minutes to complete.
## Expected output size: 3.1 GB  
docker run --rm \
  -v $HOME/blastdb:/blast/blastdb:ro -v $HOME/blastdb_custom:/blast/blastdb_custom:ro \
  -v $HOME/queries:/blast/queries:ro \
  -v $HOME/results:/blast/results:rw \
  ncbi/blast \
  blastn -query /blast/queries/query1.fa -db nt -num_threads 16 \
  -out /blast/results/blastn.query1.denovo16s.out

## Run BLAST using query5.fa (Samples 1-5) 
## This command will take approximately 30 minutes to complete.
## Expected output size: 10.4 GB  
docker run --rm \
  -v $HOME/blastdb:/blast/blastdb:ro -v $HOME/blastdb_custom:/blast/blastdb_custom:ro \
  -v $HOME/queries:/blast/queries:ro \
  -v $HOME/results:/blast/results:rw \
  ncbi/blast \
  blastn -query /blast/queries/query5.fa -db nt -num_threads 16 \
  -out /blast/results/blastn.query5.denovo16s.out

## Run BLAST using query.fa (All 28 samples) 
## This command will take approximately 147 minutes to complete.
## Expected output size: 47.8 GB  
docker run --rm \
  -v $HOME/blastdb:/blast/blastdb:ro -v $HOME/blastdb_custom:/blast/blastdb_custom:ro \
  -v $HOME/queries:/blast/queries:ro \
  -v $HOME/results:/blast/results:rw \
  ncbi/blast \
  blastn -query /blast/queries/query.fa -db nt -num_threads 16 \
  -out /blast/results/blastn.query.denovo16s.out

## Stdout and stderr will be in script.out
## BLAST output will be in $HOME/results

You have completed the entire tutorial. At this point, if you do not need the downloaded data for further analysis, please delete the VM to prevent incurring additional cost.

To delete an instance, follow instructions in the section Stop the GCP instance.

For additional information, please refer to Google Cloud Platform's documentation on instance life cycle.

Amazon Web Services Setup

Overview

To run these examples you'll need an Amazon Web Services (AWS) account. If you don't have one already, you can create an account that provides the ability to explore and try out AWS services free of charge up to specified limits for each service. To get started visit the Free Tier site, this will require a valid credit card however it will not be charged if you compute within the Free Tier. When choosing a Free Tier product, be sure it's in the Product Category Compute.

Requirements

  • An AWS account
  • An EC2 VM running Linux, on an instance type of t2.micro
  • An SSH client, such as the native Terminal application on OS X or on Windows 8 or greater with the CMD prompt or Putty on Windows

Example 1: Run BLAST+ Docker on an Amazon EC2 Virtual Machine

Step 1: Create an EC2 Virtual Machine (VM)

These instructions create an EC2 VM based on an Amazon Machine Image (AMI) that includes Docker and its dependencies.

  1. Log into the AWS console and select the EC2 service.
  2. Start the instance creation process by selecting Launch Instance (a virtual machine)
  3. In Step 1: Choose an Amazon Machine Image (AMI) select the AWS Marketplace tab
  4. In the search box enter the value ECS-Optimized Amazon Linux AMI
  5. Select one of the Free tier eligible AMIs; Amazon ECS-Optimized Amazon Linux AMI; select Continue
  6. In Step 2: Choose an Instance Type choose the t2.micro Type; select Next: Review and Launch
  7. Select Launch
  8. To allow SSH connection to this VM you'll need a key pair. When prompted, select an existing, or create a new, key pair. Be sure to record the location (directory) in which you place the associated .pem file, then select Launch Instances.
  9. Select View Instances

Step 2: Establish an SSH session with the EC2 VM

With the VM created, you access it from your local computer using SSH. Your key pair / .pem file serves as your credential.

There are several ways to establish an SSH connection. From the EC2 Instance list in the AWS Console, select Connect, then follow the instructions for the Connection Method A standalone SSH client.

The detailed instructions for connecting to a Linux VM can be found here.

aws-ssh-connect-t2-micro

Specify ec2-user as the username, instead of root in your ssh command line or when prompted to login, specify ec2-user as the username.

aws-ssh-connected

Step 3: Import sequences and create a BLAST database

In this example, we will start by fetching query and database sequences and then create a custom BLAST database.

## Retrieve sequences
## Create directories for analysis
cd $HOME; sudo mkdir bin blastdb queries fasta results blastdb_custom; sudo chown ec2-user:ec2-user *

## Retrieve query sequence
docker run --rm ncbi/blast efetch -db protein -format fasta \
    -id P01349 > queries/P01349.fsa

## Retrieve database sequences
docker run --rm ncbi/blast efetch -db protein -format fasta \
    -id Q90523,P80049,P83981,P83982,P83983,P83977,P83984,P83985,P27950 \
    > fasta/nurse-shark-proteins.fsa

## Make BLAST database 
docker run --rm \
    -v $HOME/blastdb_custom:/blast/blastdb_custom:rw \
    -v $HOME/fasta:/blast/fasta:ro \
    -w /blast/blastdb_custom \
    ncbi/blast \
    makeblastdb -in /blast/fasta/nurse-shark-proteins.fsa -dbtype prot \
    -parse_seqids -out nurse-shark-proteins -title "Nurse shark proteins" \
    -taxid 7801 -blastdb_version 5

To verify the newly created BLAST database above, you can run the following command to display the accessions, sequence length, and common name of the sequences in the database.

## Verify BLAST DB
docker run --rm \
    -v $HOME/blastdb:/blast/blastdb:ro \
    -v $HOME/blastdb_custom:/blast/blastdb_custom:ro \
    ncbi/blast \
    blastdbcmd -entry all -db nurse-shark-proteins -outfmt "%a %l %T"

Step 4: Run BLAST

When running BLAST in a Docker container, note the mounts (-v option) specified to the docker run command to make the input and outputs accessible. In the examples below, the first two mounts provide access to the BLAST databases, the third mount provides access to the query sequence(s), and the fourth mount provides a directory to save the results. (Note the :ro and :rw options, which mount the directories as read-only and read-write respectively.)

## Run BLAST+ 
docker run --rm \
    -v $HOME/blastdb:/blast/blastdb:ro \
    -v $HOME/blastdb_custom:/blast/blastdb_custom:ro \
    -v $HOME/queries:/blast/queries:ro \
    -v $HOME/results:/blast/results:rw \
    ncbi/blast \
    blastp -query /blast/queries/P01349.fsa -db nurse-shark-proteins \
	-out /blast/results/blastp.out

At this point, you should see the output file $HOME/results/blastp.out. With your query, BLAST identified the protein sequence P80049.1 as a match with a score of 14.2 and an E-value of 0.96. To view the content of this output file, use the command more $HOME/results/blastp.out.

Step 5: Optional - Show BLAST databases available for download from the NCBI AWS bucket

docker run --rm ncbi/blast update_blastdb.pl --showall pretty --source aws

The expected output is a list of BLAST DBs, including their name, description, size, and last updated date.

For a detailed description of update_blastdb.pl, please refer to the documentation. By default update_blastdb.pl will download from the cloud provided you are connected to, or from NCBI if you are not using a supported cloud provider.

Step 6: Optional - Show BLAST databases available for download from NCBI

docker run --rm ncbi/blast update_blastdb.pl --showall --source ncbi

The expected output is a list of the names of BLAST DBs.

Step 7: Stop the EC2 VM

Remember to stop or terminate the VM to prevent incurring additional cost. You can do this from the EC2 Instance list in the AWS Console as shown below.

aws-instance-stop-or-terminate

Example 2: Run BLAST+ Docker on an Amazon EC2 Virtual Machine - Protein Data Bank Amino Acid DB

This example requires a multi-core host. As such, EC2 compute charges will be realized by executing this example. The current rate for the Instance Type used - t2.large - is $0.093/hr.

Step 1: Create an EC2 Virtual Machine (VM)

These instructions create an EC2 VM based on an Amazon Machine Image (AMI) that includes Docker and its dependencies.

  1. Log into the AWS console and select the EC2 service.
  2. Start the instance creation process by selecting Launch Instance (a virtual machine)
  3. In Step 1: Choose an Amazon Machine Image (AMI) select the AWS Marketplace tab
  4. In the search box enter the value ECS-Optimized Amazon Linux AMI
  5. Select one of the Free tier eligible AMIs; Amazon ECS-Optimized Amazon Linux AMI; select Continue
  6. In Step 2: Choose an Instance Type choose the t2.large Type; select Next: Review and Launch
  7. Select Launch
  8. To allow SSH connection to this VM you'll need a key pair. When prompted, select an existing, or create a new, key pair. Be sure to record the location (directory) in which you place the associated .pem file, then select Launch Instances. You can use the same key pair as used in Example 1.
  9. Select View Instances

Step 2: Establish an SSH session with the EC2 VM

With the VM created, you access it from your local computer using SSH. Your key pair / .pem file serves as your credential.

There are several ways to establish an SSH connection. From the EC2 Instance list in the AWS Console, select Connect, then follow the instructions for the Connection Method A standalone SSH client.

The detailed instructions for connecting to a Linux VM can be found here.

aws-ssh-connect-t2-large

Specify ec2-user as the username, instead of root in your ssh command line or when prompted to login, specify ec2-user as the username.

aws-ssh-connected

Step 2. Retrieve sequences

## Create directories for analysis
cd $HOME; sudo mkdir bin blastdb queries fasta results blastdb_custom; sudo chown ec2-user:ec2-user *

## Retrieve query sequence
docker run --rm ncbi/blast efetch -db protein -format fasta \
    -id P01349 > queries/P01349.fsa

Step 3: Download Protein Data Bank Amino Acid Database (pdbaa)

The command below mounts (using the -v option) the $HOME/blastdb path on the local machine as /blast/blastdb on the container, and blastdbcmd shows the available BLAST databases at this location.

## Download Protein Data Bank amino acid database (pdbaa)
docker run --rm \
     -v $HOME/blastdb:/blast/blastdb:rw \
     -w /blast/blastdb \
     ncbi/blast \
     update_blastdb.pl pdbaa

## Display database(s) in $HOME/blastdb
docker run --rm \
    -v $HOME/blastdb:/blast/blastdb:ro \
    ncbi/blast \
    blastdbcmd -list /blast/blastdb -remove_redundant_dbs

You should see an output /blast/blastdb/pdbaa Protein.

Step 4: Run BLAST+

## Run BLAST+ 
docker run --rm \
     -v $HOME/blastdb:/blast/blastdb:ro \
     -v $HOME/blastdb_custom:/blast/blastdb_custom:ro \
     -v $HOME/queries:/blast/queries:ro \
     -v $HOME/results:/blast/results:rw \
     ncbi/blast \
     blastp -query /blast/queries/P01349.fsa -db pdbaa \
	 -out /blast/results/blastp_pdbaa.out

At this point, you should see the output file $HOME/results/blastp_pdbaa.out. To view the content of this output file, use the command more $HOME/results/blastp_pdbaa.out.

Appendix

Appendix A: Transfer Files to/from an AWS VM

One way to transfer files between your local computer and a Linux instance is to use the secure copy protocol (SCP).

The section Transferring files to Linux instances from Linux using SCP of the Amazon EC2 User Guide for Linux Instances provides detailed instructions for this process.

BLAST Databases

The NCBI hosts the same databases on AWS, GCP, and the NCBI FTP site. The table below has the list of databases current as of November, 2022.

It is also possible to obtain the current list with the command:

docker run --rm ncbi/blast update_blastdb.pl --showall pretty

or

update_blastdb.pl --showall pretty # after downloading the BLAST+ package.

As shown above, update_blastdb.pl can also be used to download these databases. It will automatically select the appropriate resource (e.g., GCP if you are within that provider).

These databases can also be searched with ElasticBLAST on GCP and AWS.

Accessing the databases on AWS or GCP outside of the cloud provider will likely result in egress charges to your account. If you are not on the cloud provider, you should use the databases at the NCBI FTP site.

Name Type Title
16S_ribosomal_RNA DNA 16S ribosomal RNA (Bacteria and Archaea type strains)
18S_fungal_sequences DNA 18S ribosomal RNA sequences (SSU) from Fungi type and reference material
28S_fungal_sequences DNA 28S ribosomal RNA sequences (LSU) from Fungi type and reference material
Betacoronavirus DNA Betacoronavirus
GCF_000001405.38_top_level DNA Homo sapiens GRCh38.p12 [GCF_000001405.38] chromosomes plus unplaced and unlocalized scaffolds
GCF_000001635.26_top_level DNA Mus musculus GRCm38.p6 [GCF_000001635.26] chromosomes plus unplaced and unlocalized scaffolds
ITS_RefSeq_Fungi DNA Internal transcribed spacer region (ITS) from Fungi type and reference material
ITS_eukaryote_sequences DNA ITS eukaryote BLAST
LSU_eukaryote_rRNA DNA Large subunit ribosomal nucleic acid for Eukaryotes
LSU_prokaryote_rRNA DNA Large subunit ribosomal nucleic acid for Prokaryotes
SSU_eukaryote_rRNA DNA Small subunit ribosomal nucleic acid for Eukaryotes
env_nt DNA environmental samples
nt DNA Nucleotide collection (nt)
patnt DNA Nucleotide sequences derived from the Patent division of GenBank
pdbnt DNA PDB nucleotide database
ref_euk_rep_genomes DNA RefSeq Eukaryotic Representative Genome Database
ref_prok_rep_genomes DNA Refseq prokaryote representative genomes (contains refseq assembly)
ref_viroids_rep_genomes DNA Refseq viroids representative genomes
ref_viruses_rep_genomes DNA Refseq viruses representative genomes
refseq_rna DNA NCBI Transcript Reference Sequences
refseq_select_rna DNA RefSeq Select RNA sequences
tsa_nt DNA Transcriptome Shotgun Assembly (TSA) sequences
env_nr Protein Proteins from WGS metagenomic projects
landmark Protein Landmark database for SmartBLAST
nr Protein All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects
pdbaa Protein PDB protein database
pataa Protein Protein sequences derived from the Patent division of GenBank
refseq_protein Protein NCBI Protein Reference Sequences
refseq_select_prot Protein RefSeq Select proteins
swissprot Protein Non-redundant UniProtKB/SwissProt sequences
tsa_nr Protein Transcriptome Shotgun Assembly (TSA) sequences
cdd Protein Conserved Domain Database (CDD) is a collection of well-annotated multiple sequence alignment models reprepresented as position-specific score matrices

Database Metadata

The NCBI provides metadata for the available BLAST databases at AWS, GCP and the NCBI FTP site.

Accessing the databases on AWS or GCP outside of the cloud provider will likely result in egress charges to your account. If you are not on the cloud provider, you should use the databases at the NCBI FTP site.

On AWS and GCP, the file is in a date dependent subdirectory with the databases. To find the latest valid subdirectory, first read s3://ncbi-blast-databases/latest-dir (on AWS) or gs://blast-db/latest-dir (on GCP). latest-dir is a text file with a date stamp (e.g., 2020-09-29-01-05-01) specifying the most recent directory. The proper directory will be the AWS or GCP base URI for the BLAST databases (e.g., s3://ncbi-blast-databases/ for AWS) plus the text in the latest-dir file. An example URI, in AWS, would be s3://ncbi-blast-databases/2020-09-29-01-05-01. The GCP URI would be similar.

An excerpt from a metadata file is shown below. Most fields have obvious meanings. The files comprise the BLAST database. The bytes-total field represents the total BLAST database size in bytes and is intended to specify how much disk space is required.

The example below is from AWS, but the metadata files on GCP have the same format. Databases on the FTP site are in gzipped tarfiles, one per volume of the BLAST database, so those are listed rather than the individual files.

"16S_ribosomal_RNA": {
    "version": "1.2",
    "dbname": "16S_ribosomal_RNA",
    "dbtype": "Nucleotide",
    "db-version": 5,
    "description": "16S ribosomal RNA (Bacteria and Archaea type strains)",
    "number-of-letters": 32435109,
    "number-of-sequences": 22311,
    "last-updated": "2022-03-07T11:23:00",
    "number-of-volumes": 1,
    "bytes-total": 14917073,
    "bytes-to-cache": 8495841,
    "files": [
      "s3://ncbi-blast-databases/2020-09-26-01-05-01/16S_ribosomal_RNA.ndb",
      "s3://ncbi-blast-databases/2020-09-26-01-05-01/16S_ribosomal_RNA.nog",
      "s3://ncbi-blast-databases/2020-09-26-01-05-01/16S_ribosomal_RNA.nni",
      "s3://ncbi-blast-databases/2020-09-26-01-05-01/16S_ribosomal_RNA.nnd",
      "s3://ncbi-blast-databases/2020-09-26-01-05-01/16S_ribosomal_RNA.nsq",
      "s3://ncbi-blast-databases/2020-09-26-01-05-01/16S_ribosomal_RNA.nin",
      "s3://ncbi-blast-databases/2020-09-26-01-05-01/16S_ribosomal_RNA.ntf",
      "s3://ncbi-blast-databases/2020-09-26-01-05-01/16S_ribosomal_RNA.not",
      "s3://ncbi-blast-databases/2020-09-26-01-05-01/16S_ribosomal_RNA.nhr",
      "s3://ncbi-blast-databases/2020-09-26-01-05-01/16S_ribosomal_RNA.nos",
      "s3://ncbi-blast-databases/2020-09-26-01-05-01/16S_ribosomal_RNA.nto",
      "s3://ncbi-blast-databases/2020-09-26-01-05-01/taxdb.btd",
      "s3://ncbi-blast-databases/2020-09-26-01-05-01/taxdb.bti"
    ]
  }

Additional Resources

or email us.

Maintainer

National Center for Biotechnology Information (NCBI)
National Library of Medicine (NLM)
National Institutes of Health (NIH)

License

View refer to the license and copyright information for the software contained in this image.

As with all Docker images, these likely also contain other software which may be under other licenses (such as bash, etc., from the base distribution, along with any direct or indirect dependencies of the primary software being contained).

As with any pre-built image usage, it is the image user's responsibility to ensure that any use of this image complies with any relevant licenses for all software contained within.

Appendix

Appendix A. Cloud and Docker Concepts

Cloud-Docker-Simple Figure 1. Docker and Cloud Computing Concept. Users can access compute resources provided by cloud service providers (CSPs), such as the Google Cloud Platform, using SSH tunneling (1). When you create a VM (2), a hard disk (also called a boot/persistent disk) (3) is attached to that VM. With the right permissions, VMs can also access other storage buckets (4) or other data repositories in the public domain. Once inside a VM with Docker installed, you can run a Docker image (5), such as NCBI's BLAST image. An image can be used to create multiple running instances or containers (6). Each container is in an isolated environment. In order to make data accessible inside the container, you need to use Docker bind mounts (7) described in this tutorial.

A Docker image can be used to create a Singularity image. Please refer to Singularity's documentation for more detail.

Appendix B. Alternative Ways to Run Docker

As an alternative to what is described above, you can also run BLAST interactively inside a container.

Run BLAST+ Docker image interactively

When to use: This is useful for running a few (e.g., fewer than 5-10) BLAST searches on small BLAST databases where you expect the search to complete in seconds/minutes.

docker run --rm -it \
    -v $HOME/blastdb:/blast/blastdb:ro -v $HOME/blastdb_custom:/blast/blastdb_custom:ro \
    -v $HOME/queries:/blast/queries:ro \
    -v $HOME/results:/blast/results:rw \
    ncbi/blast \
    /bin/bash

# Once you are inside the container (note the root prompt), run the following BLAST commands.
blastp -query /blast/queries/P01349.fsa -db nurse-shark-proteins \
    -out /blast/results/blastp.out

# To view output, run the following command
more /blast/results/blastp.out

# Leave container
exit

In addition, you can run BLAST in detached mode by running a container in the background.

Run BLAST+ Docker image in detached mode

When to use: This is a more practical approach if you have many (e.g., 10 or more) BLAST searches to run or you expect the search to take a long time to execute. In this case it may be better to start the BLAST container in detached mode and execute commands on it.

NOTE: Be sure to mount all required directories, as these need to be specified when the container is started.

# Start a container named 'blast' in detached mode
docker run --rm -dit --name blast \
    -v $HOME/blastdb:/blast/blastdb:ro -v $HOME/blastdb_custom:/blast/blastdb_custom:ro \
    -v $HOME/queries:/blast/queries:ro \
    -v $HOME/results:/blast/results:rw \
    ncbi/blast \
    sleep infinity

# Check the container is running in the background
docker ps -a
docker ps --filter "status=running"

Once the container is confirmed to be running in detached mode, run the following BLAST command.

docker exec blast blastp -query /blast/queries/P01349.fsa \
    -db nurse-shark-proteins -out /blast/results/blastp.out

# View output
more $HOME/results/blastp.out

# stop the container
docker stop blast

If you run into issues with docker stop blast command, reset the VM from the GCP Console or restart the SSH session.

Appendix C. Transfer Files to/from a GCP VM

To copy the file $HOME/script.out in the home directory on a local machine to the home directory on a GCP VM named instance-1 in project My First Project using GCP Cloud SDK.

GCP documentation

First install GCP Cloud SDK command line tools for your operating system.

# First, set up gcloud tools
# From local machine's terminal

gcloud init

# Enter a configuration name
# Select the sign-in email account
# Select a project, for example “my-first-project”
# Select a compute engine zone, for example, “us-east4-c”

# To copy the file $HOME/script.out to the home directory of GCP instance-1 
# Instance name can be found in your Google Cloud Console -> Compute Engine -> VM instances

gcloud compute scp $HOME/script.out instance-1:~

# Optional - to transfer the file from the GCP instance to a local machine's home directory

gcloud compute scp instance-1:~/script.out $HOME/.