This GitHub repository analyzes SARS-CoV-2 deep sequencing data recovered from the deleted BioProject PRJNA612766. This analysis corresponds to the work described in this paper.
Specifically:
- this tag corresponds to the initial bioRxiv pre-print
- this tag corresponds to the revised bioRxiv pre-print
- this tag corresponds to the final version accepted by MBE
The analysis is nearly fully automated by the snakemake
pipeline included in Snakefile.
The configuration for the analysis is in config.yaml.
Note that the pipeline is somewhat convoluted and performs a variety of steps only tangentially related to the paper corresponding to this study.
The reason is that the study started simply as an effort to validate the analyses in the joint WHO-China report on COVID-19 origins, but then gradually shifted in goal upon the discovery of the deleted data set.
For this reason, there are still some vestigial parts of the code and analysis structure.
The only required manual step is to download existing coronavirus sequences from GISAID, which must be done manually after creating a GISAID account since GISAID data sharing terms prevent distribution of their sequences.
To get these sequences, download both the *.metadata.tsv.xz
and *.fasta.xz
files for the accessions in data/gisaid_sequences_through_Feb2020/accessions.txt to the subdirectory data/gisaid_sequences_through_Feb2020/, and the same two files for the accessions in data/comparator_genomes_gisaid/accessions.txt to the subdirectory data/comparator_genomes_gisaid/.
After downloading these sequences and ensuring you have installed conda, build the main conda environment for the pipeline with:
conda env create -f environment.yml
Then activate the conda environment with:
conda activate SARS-CoV-2_PRJNA612766
You can then run the entire analysis with:
snakemake -j 1 --use-conda
Note that you need the --use-conda
command because one of the rules in Snakefile uses a separate environment as specified in environment_ete3.yml.
The above command will run the snakemake
pipeline using just one computing core.
If you want to use more cores, adjust the value passed by -j
appropriately.
If you have access to a computing cluster you can distribute the run across the cluster.
For the Fred Hutch computing cluster, that can be done using cluster.yaml by running the pipeline with the commands in run_Hutch_cluster.bash.
The input data needed for the analysis are all available in the ./data/ subdirectory, which contains a README describing the files therein.
The results of running the pipeline are placed in the ./results/ subdirectory. Most of these results are not tracked in this GitHub repo, but some key files are as described in the Methods of the paper associated with this study.
The code used to process the Excel supplementary table of accessions from project PRJNA612766 to generate the information found in config.yaml is in ./manual_analyses/PRJNA612766/.
The LaTex source for the paper and its figures are found in the ./paper/ subdirectory.
In late 2024, I received a request from GISAID compliance to remove certain files in which they said they had been made aware of files in the repo that violated their data sharing. I therefore made the following changes:
On-Nov-21-2024, I made commit e9d972519
which removed these files from the current state of the repository.
On Dec-7-2024, I fully removed the files from the git
history with the following commands:
git filter-repo --path data/gisaid_sequences_through_Feb2020/1622384383620.metadata.tsv.xz --invert-paths
git filter-repo --path data/comparator_genomes_gisaid/1622468911409.metadata.tsv.xz --invert-paths
git filter-repo --path results/early_sequences/deltadist.csv --invert-paths
I then made and committed new version of results/early_sequences/deltadist.csv where I had removed all columns that contained mutations, and committed it and this update to the README.
I then force-pushed these changes with git push --force
.