This workflow was created to efficiently process and visualize data presented in Supplemental Figure 2 of Halfmann and Minor et al. 2022, Evolution of a globally unique SARS-CoV-2 Spike E484T monoclonal antibody escape mutation in a persistently infected, immunocompromised individual. Its purpose is to compare the viral evolution in one, long-term host with SARS-CoV-2's evolution worldwide, across all hosts. Ultimately, it shows that the virus can accrue a similar number of subsitutions within one, long-term host as it can across thousands of short-term hosts. In other words, the one persistently infecting virus we document in our manuscript was able to recapitulate global SARS-CoV-2 evolution within a single host.
We pull all consensus sequences and metadata used in this figure from GenBank. If needed, the workflow can download these automatically; more on that below.
If you already have Docker and NextFlow installed on your system, simply run the following command in the directory of your choice:
nextflow run dholab/prolonged-infection-suppfig2 -latest
This command automatically pulls the workflow from GitHub and runs it. If you do not have Docker and NextFlow installed, or want to tweak any of the default configurations in the workflow, proceed to the following sections.
To run this workflow, start by running git clone
to bring the workflow files into your directory of choice, like so:
git clone https://github.com/dholab/prolonged-infection-suppfig2.git .
Once the workflow bundle is in place, you may need to double-check that the workflow scripts are executable by running chmod +x bin/*
in the command line.
You will also need to install the Docker engine if you haven't already. The workflow pulls all the software it needs automatically from Docker Hub, which means you will never need to permanently install that software on your system. To install Docker, simply visit the Docker installation page at https://docs.docker.com/get-docker/.
This workflow uses the NextFlow workflow manager. We recommend you install NextFlow to your system in one of the two following ways:
- Install the miniconda python distribution, if you haven't already: https://docs.conda.io/en/latest/miniconda.html
- Install the
mamba
package installation tool in the command line, if not already installed:conda install -y -c conda-forge mamba
- Install Nextflow to your base environment:
mamba install -c bioconda nextflow
- Run the following line in a directory where you'd like to install NextFlow, and run the following line of code:
curl -fsSL https://get.nextflow.io | bash
- Add this directory to your $PATH. If on MacOS, a helpful guide can be viewed here.
To double check that the installation was successful, type nextflow -v
into the terminal. If it returns something like nextflow version 21.04.0.5552
, you are ready to proceed.
With the workflow files cloned, Docker installed, and NextFlow installed, you are ready to run the workflow. To do so, simply change into the workflow directory and run the following in the BASH terminal:
nextflow run prolonged_infection_suppfig2.nf
If the workflow runs partway, but a computer outage or other issue interrupts its progress, no need to start over! Instead, run:
nextflow run prolonged_infection_suppfig2.nf -resume
The workflow's configurations (see below) tell NextFlow to plot the workflow and record run statistics. However, to visualize the workflow itself in a directed acyclic graph (DAG), NextFlow requires the package GraphViz, which is easiest to install via the intructions on GraphViz's website. If you do not install GraphViz, the workflow will still run as expected, but will not produce a DAG png.
The following runtime parameters have been set for the whole workflow in the file nextflow.config
:
subsample_size
- the number of SARS-CoV-2 samples to compare with the persistent infection in the final plot. We chose 5000 as our default, but this can be changed to any number as shown in the code block below. NOTE: If you choose a number other than our default, the workflow will need to gather a new subsample from GenBank. To prompt it to do so, simply delete the fileinclude_list.csv
from theresources/
directory. This process of pulling data from GenBank generally takes 4 to 6 hours, but may take longer depending on your system.min_date
- the earliest date to display in the plot; our default is September 1st, 2020. Dates specified here must be in "YYYY-mm-dd" format. Specifying different dates makes it possible to use this workflow for cases other than that described in our manuscript.max_date
- the latest date to display in the plot; default is February 1st, 2022. Dates specified here must be in "YYYY-mm-dd" format.random_sample_seed
- A seed that determines which GenBank accessions are pseudo-randomly selected by the scriptselect_subsample.R
. Our default value for this seed is 14; changing this will result in a final figure that is not identical to the figure in our manuscript.refseq
- path to the SARS-CoV-2 reference sequence (GenBank Accession MN9089473.3), which is in the workflow subdirectoryresources/
results
- path to default workflow output directoryresults_data_files
- path to a subdirectory ofresults/
where data files like VCFs are placed.visuals
- path to a subdirectory ofresults/
where graphics are stored. This is where the final supplementary figure 2 PDF is placed.
These parameters can be altered in the command line with a double-dash flag, like so:
nextflow run prolonged_infection_workflow.nf --min_date "2020-07-01" --max_date "2022-08-01" --random_sample_seed 7 --subsample_size 9999
- First, in the process PULL_METADATA, the workflow looks for the pre-existing include file specified in the configuration file,
nextflow.config
. By default, this file is calledinclude_list.csv
. If this process does not find a pre-existing include file, it will pull metadata for all complete SARS-CoV-2 genomes in GenBank. To do so, it uses the excellent NCBI datasets command-line interface. This has generally taken 4 to 6 hours on a standard desktop computer. - In the process REFORMAT_METADATA, the workflow either sees that the include file already exists, as in the process PULL_METADATA, or it converts the GenBank metadata from PULL_METADATA into a convenient tab-separated format. To do so, it uses the handy NCBI dataformat tool.
- In the process SELECT_SUBSAMPLE, the workflow either a) sees that the include file already exists and streams it into the next process, or b) or takes the TSV-formatted metadata from REFORMAT_METADATA and generates a new include file with the R script
select_subsample.R
. Note that the workflow configuration file,nextflow.config
, specifies 4 parameters for this script, as described above:subsample_size
,min_date
,max_date
, andrandom_sample_seed
. These settings can all be changed to produce different results with different input data. - Next, in the process PULL_FASTAs, the workflow splits the include file row by row, and pulls a FASTA file separately for the accession in each row. This makes it possible for multiple FASTAs to be pulled and then pushed to the next step at once.
- Each GenBank FASTA is then aligned to Wuhan-1 in the process SUBSAMPLE_ALIGNMENT. As in PULL_FASTAs, these alignments can take place in parallel to one another, making the workflow more efficient.
- In the process SUBSAMPLE_VARIANT_CALLING, we use the
callvariants.sh
script from BBTools to identify mutations in each of the GenBank accessions. - Finally, in the process SUPP_FIGURE_2_PLOTTING, all VCFs from SUBSAMPLE_VARIANT_CALLING are brought together with some pre-existing inputs and plotted with the script
SupplementalFigure2_global_roottotip_plot.R
. The final plot is placed inresults/visuals/
, and the associated plotting data are placed inresults/
as{run date}.csv
. With that, the workflow is finished!
The steps described above are also visualized in the file prolonged_infection_suppfig2_dag.png.
Bundled together with the workflow are:
- the SARS-CoV-2 Wuhan-1 sequence from GenBank Accession MN9089473.3 is in
resources/
and is called reference.fasta - the file
data/patient_variant_counts.csv
comes from the FIGURE_2A_PLOTTING process in the main workflow, which is available at the GitHub repository for this project. - a list of GenBank accessions to include in the final plot called
include_list.csv
. If this file is absent, the workflow will pull the SARS-CoV-2 metadata from GenBank and generate the include list again. Note that the R script in thebin/
directory,select_subsample.R
, uses a consistent seed
results/visuals/figsupp2_global_roottotip_plot.pdf
is the final plot, which can be edited in Illustrator or other vector-graphic softwareresults/{run date}.csv
is a table of all the GenBank accessions used in making the plot, along with when they were collected, how many mutations they have that separate them from Wuhan-1, and what pango lineage they are. This is the final dataset that is used to make Supplemental Figure 2.results/data/subsample_seqs.fasta.xz
- A compressed FASTA file containing the consensus sequences for all GenBank accessions in the figure.results/data/subsample_vcf_files.tar.xz
- a compressed bundle of all the VCFs generated to make the figure.
This workflow was created by Nicholas R. Minor. To report any issues, please visit the GitHub repository for this project.