Skip to content

Commit

Permalink
Merge pull request #118 from sanger-tol/db_install
Browse files Browse the repository at this point in the history
Database installation instructions
  • Loading branch information
tkchafin authored Oct 21, 2024
2 parents 93bda84 + 61c0328 commit b9bcef6
Showing 1 changed file with 26 additions and 25 deletions.
51 changes: 26 additions & 25 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,33 +80,29 @@ It is a good idea to put a date suffix for each database location so you know at

#### 1. NCBI taxdump database

Create the database directory and move into the directory:
Create the database directory, retrieve and decompress the NCBI taxonomy:

```bash
DATE=2023_03
DATE=2024_10
TAXDUMP=/path/to/databases/taxdump_${DATE}
mkdir -p $TAXDUMP
cd $TAXDUMP
```

Retrieve and decompress the NCBI taxdump:

```bash
curl -L ftp://ftp.ncbi.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.tar.gz | tar xzf -
mkdir -p "$TAXDUMP"
curl -L ftp://ftp.ncbi.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.tar.gz | tar -xzf - -C "$TAXDUMP"
```

#### 2. NCBI nucleotide BLAST database

Create the database directory and move into the directory:

```bash
DATE=2023_03
DATE=2024_10
NT=/path/to/databases/nt_${DATE}
mkdir -p $NT
cd $NT
```

Retrieve the NCBI blast nt database (version 5) files and tar gunzip them. We are using the `&&` syntax to ensure that each command completes without error before the next one is run:
Retrieve the NCBI blast nt database (version 5) files and tar gunzip them.
`wget` and the use of the FTP protocol are necessary to resolve the wildcard `nt.???.tar.gz`.
We are using the `&&` syntax to ensure that each command completes without error before the next one is run:

```bash
wget "ftp://ftp.ncbi.nlm.nih.gov/blast/db/v5/nt.???.tar.gz" -P $NT/ &&
Expand All @@ -121,44 +117,49 @@ rm taxdb.tar.gz

#### 3. UniProt reference proteomes database

You need [diamond blast](https://github.com/bbuchfink/diamond) installed for this step. The easiest way is probably using [conda](https://anaconda.org/bioconda/diamond). Make sure you have the latest version of Diamond (>2.x.x) otherwise the `--taxonnames` argument may not work.
You need [diamond blast](https://github.com/bbuchfink/diamond) installed for this step.
The easiest way is probably to install their [pre-compiled release](https://github.com/bbuchfink/diamond/releases).
Make sure you have the latest version of Diamond (>2.x.x) otherwise the `--taxonnames` argument may not work.

Create the database directory and move into the directory:

```bash
DATE=2023_03
DATE=2024_10
UNIPROT=/path/to/databases/uniprot_${DATE}
mkdir -p $UNIPROT
cd $UNIPROT
```

The UniProt `Refseq_Proteomes_YYYY_MM.tar.gz` file is very large (>160 GB) and will take a long time to download. The command below looks complex because it needs to get around the problem of using wildcards with wget and curl.
The UniProt `Refseq_Proteomes_YYYY_MM.tar.gz` file is very large (close to 200 GB) and will take a long time to download.
The command below looks complex because it needs to get around the problem of using wildcards with wget and curl.

```bash
wget -q -O $UNIPROT/reference_proteomes.tar.gz \
ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/$(curl \
-vs ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/ 2>&1 | \
awk '/tar.gz/ {print $9}')
tar xf reference_proteomes.tar.gz
EBI_URL=ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/
mkdir extract
curl -L $EBI_URL/$(curl -vs $EBI_URL 2>&1 | awk '/tar.gz/ {print $9}') | \
tar -xzf - -C extract

# Create a single fasta file with all the fasta files from each subdirectory:
touch reference_proteomes.fasta.gz
find . -mindepth 2 | grep "fasta.gz" | grep -v 'DNA' | grep -v 'additional' | xargs cat >> reference_proteomes.fasta.gz
find extract -type f -name '*.fasta.gz' ! -name '*_DNA.fasta.gz' ! -name '*_additional.fasta.gz' -exec cat '{}' '+' > reference_proteomes.fasta.gz

# create the accession-to-taxid map for all reference proteome sequences:
printf "accession\taccession.version\ttaxid\tgi\n" > reference_proteomes.taxid_map
zcat */*/*.idmapping.gz | grep "NCBI_TaxID" | awk '{print $1 "\t" $1 "\t" $3 "\t" 0}' >> reference_proteomes.taxid_map
find extract -type f -name '*.idmapping.gz' -exec zcat {} + | \
awk 'BEGIN {OFS="\t"; print "accession", "accession.version", "taxid", "gi"} $2=="NCBI_TaxID" {print $1, $1, $3, 0}' > reference_proteomes.taxid_map

# create the taxon aware diamond blast database
diamond makedb -p 16 --in reference_proteomes.fasta.gz --taxonmap reference_proteomes.taxid_map --taxonnodes $TAXDUMP/nodes.dmp --taxonnames $TAXDUMP/names.dmp -d reference_proteomes.dmnd

# clean up
mv extract/{README,STATS} .
rm -r extract
```

#### 4. BUSCO databases

Create the database directory and move into the directory:

```bash
DATE=2023_03
DATE=2024_10
BUSCO=/path/to/databases/busco_${DATE}
mkdir -p $BUSCO
cd $BUSCO
Expand Down

0 comments on commit b9bcef6

Please sign in to comment.