Skip to content

Commit

Permalink
Merge branch 'add/picklist' into add/picklist_selectors
Browse files Browse the repository at this point in the history
  • Loading branch information
ctb committed Jun 14, 2021
2 parents a88b66d + 031522c commit 5ac4671
Show file tree
Hide file tree
Showing 3 changed files with 149 additions and 78 deletions.
97 changes: 24 additions & 73 deletions doc/databases.md
Original file line number Diff line number Diff line change
@@ -1,86 +1,37 @@
# Prepared search databases
# Prepared databases

We provide several databases for download. Note that these databases can
be used with both sourmash v3.5 and sourmash v4.0.
## GTDB R06-rs202 - DNA databases

## RefSeq microbial genomes - SBT
All files below are available under https://osf.io/wxf9z/. The GTDB taxonomy spreadsheet (in a format suitable for `sourmash lca index`) is available [here](https://osf.io/p6z3w/).

These database are formatted for use with `sourmash search` and
`sourmash gather`. They are calculated with a scaled value of 2000.
For each k-mer size, three databases are available.

Approximately 91,000 microbial genomes (including viral and fungal)
from NCBI RefSeq.
* Zipfile collections can be used for a linear search. The signatures were calculated with a scaled of 1000, which robustly supports searches for ~10kb or larger matches.
* SBT databases are indexed versions of the Zipfile collections that support faster search. They are also indexed with scaled=1000.
* LCA databases are indexed versions of the Zipfile collections that also contain taxonomy information and can be used with regular search as well as with [the `lca` subcommands for taxonomic analysis](https://sourmash.readthedocs.io/en/latest/command-line.html#sourmash-lca-subcommands-for-taxonomic-classification). They are indexed with scaled=10,000, which robustly supports searches for 100kb or larger matches.

* [RefSeq k=21, 2018.03.29][0] - 3.3 GB - [manifest](https://osf.io/wamfk/download)
* [RefSeq k=31, 2018.03.29][1] - 3.3 GB - [manifest](https://osf.io/x3aut/download)
* [RefSeq k=51, 2018.03.29][2] - 3.4 GB - [manifest](https://osf.io/zpkau/download)
You can read more about the different database and index types [here](https://sourmash.readthedocs.io/en/latest/command-line.html#indexed-databases).

## Genbank microbial genomes - SBT
Legacy databases are available [here](legacy-databases.md)

These database are formatted for use with `sourmash search` and
`sourmash gather`.
Note that the SBT and LCA databases can be used with sourmash v3.5 and later, while Zipfile collections can only be used with sourmash v4.1.0 and up.

Approximately 98,000 microbial genomes (including viral and fungal)
from NCBI Genbank.
### GTDB genomic representatives (47.8k genomes)

* [Genbank k=21, 2018.03.29][3] - 3.9 GB - [manifest](https://osf.io/vm5kb/download)
* [Genbank k=31, 2018.03.29][4] - 3.9 GB - [manifest](https://osf.io/p87ec/download)
* [Genbank k=51, 2018.03.29][5] - 3.9 GB - [manifest](https://osf.io/cbxg9/download)
The GTDB genomic representatives are a low-redundancy subset of Genbank genomes.

### Details
| K-mer size | Zipfile collection | SBT | LCA |
| -------- | -------- | -------- | ---- |
| 21 | [download (1.3 GB)](https://osf.io/jp5zh/download) | [download (2.6 GB)](https://osf.io/py92w/download) | [download (114 MB)](https://osf.io/gk2za/download) |
| 31 | [download (1.3 GB)](https://osf.io/nqmau/download) | [download (2.6 GB)](https://osf.io/w4bcm/download) | [download (131 MB)](https://osf.io/ypsjq/download) |
| 51 | [download (1.3 GB)](https://osf.io/px6qd/download) | [download (2.6 GB)](https://osf.io/rv9zp/download) | [download (137 MB)](https://osf.io/297dp/download) |

The individual signatures for the above SBTs were calculated as follows:
### GTDB all genomes (258k genomes)

```
sourmash compute -k 4,5 \
-n 2000 \
--track-abundance \
--name-from-first \
-o {output} \
{input}
These databases contain the complete GTDB collection of 258,406 genomes.

sourmash compute -k 21,31,51 \
--scaled 2000 \
--track-abundance \
--name-from-first \
-o {output} \
{input}
```

See [github.com/dib-lab/sourmash_databases](https://github.com/dib-lab/sourmash_databases) for a Snakemake workflow
to build the databases.

[0]: https://sourmash-databases.s3-us-west-2.amazonaws.com/zip/refseq-k21.sbt.zip
[1]: https://sourmash-databases.s3-us-west-2.amazonaws.com/zip/refseq-k31.sbt.zip
[2]: https://sourmash-databases.s3-us-west-2.amazonaws.com/zip/refseq-k51.sbt.zip

[3]: https://sourmash-databases.s3-us-west-2.amazonaws.com/zip/genbank-k21.sbt.zip
[4]: https://sourmash-databases.s3-us-west-2.amazonaws.com/zip/genbank-k31.sbt.zip
[5]: https://sourmash-databases.s3-us-west-2.amazonaws.com/zip/genbank-k51.sbt.zip

## Genbank LCA Database

These databases are formatted for use with `sourmash lca`; they are
v2 LCA databases and will work with sourmash v2.0a11 and later.
They are calculated with a scaled value of 10000 (1e5).

Approximately 87,000 microbial genomes (including viral and fungal)
from NCBI Genbank.

* [Genbank k=21, 2017.11.07](https://osf.io/d7rv8/download), 109 MB
* [Genbank k=31, 2017.11.07](https://osf.io/4f8n3/download), 120 MB
* [Genbank k=51, 2017.11.07](https://osf.io/nemkw/download), 125 MB

### Details

The above LCA databases were calculated as follows:

```
sourmash lca index genbank-genomes-taxonomy.2017.05.29.csv \
genbank-k21.lca.json.gz -k 21 --scaled=10000 \
-f --traverse-directory .sbt.genbank-k21 --split-identifiers
```

See
[github.com/dib-lab/2018-ncbi-lineages](https://github.com/dib-lab/2018-ncbi-lineages)
for information on preparing the genbank-genomes-taxonomy file.
| K-mer size | Zipfile collection | SBT | LCA |
| -------- | -------- | -------- | ---- |
| 21 | [download (7.8 GB)](https://osf.io/vgex4/download) | [download (15 GB)](https://osf.io/ar67j/download) | [download (266 MB)](https://osf.io/hm3c4/download) |
| 31 | [download (7.8 GB)](https://osf.io/94mzh/download) | [download (15 GB)](https://osf.io/dmsz8/download) | [download (286 MB)](https://osf.io/9xdg2/download) |
| 51 | [download (7.8 GB)](https://osf.io/x9cdp/download) | [download (15 GB)](https://osf.io/8fc3t/download) | [download (299 MB)](https://osf.io/3cdp6/download) |
112 changes: 112 additions & 0 deletions doc/legacy-databases.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
# Legacy Databases

Sourmash databases have evolved over time.
We have changed how the database is stored (uncompressed `.zip`) and how we name each signature.
All SBT databases below are in `.sbt.zip` format.
Note that the SBT and LCA databases can be used with sourmash v3.5 and later, while Zipfile collections can only be used with sourmash v4.1.0 and up.
We detail these changes below, and include links to legacy databases.
See [github.com/dib-lab/sourmash_databases](https://github.com/dib-lab/sourmash_databases) for a Snakemake workflow that builds current and legacy databases.

## Sourmash signature names

Earlier versions of sourmash databases were built using individual signatures that were calculated as follows:

```
sourmash compute -k 4,5 \
-n 2000 \
--track-abundance \
--name-from-first \
-o {output} \
{input}
sourmash compute -k 21,31,51 \
--scaled 2000 \
--track-abundance \
--name-from-first \
-o {output} \
{input}
```

We moved away from this strategy because `--name-from-first` named each signature from the name of the first sequence in the FASTA file.
While the species name of the organism was present in this name, the accession number corresponded to the accession of the first sequence fragment in the file, not the genome assembly.
As such, we revised our strategy so that signatures are named by genome assembly accession and species name.
This requires the `assembly_summary.txt` file to be parsed.

## Sourmash database compression

## Legacy databases

### RefSeq microbial genomes - SBT

These database are formatted for use with `sourmash search` and
`sourmash gather`. They are calculated with a scaled value of 2000.

Approximately 91,000 microbial genomes (including viral and fungal)
from NCBI RefSeq.

* [RefSeq k=21, 2018.03.29][0] - 3.3 GB - [manifest](https://osf.io/wamfk/download)
* [RefSeq k=31, 2018.03.29][1] - 3.3 GB - [manifest](https://osf.io/x3aut/download)
* [RefSeq k=51, 2018.03.29][2] - 3.4 GB - [manifest](https://osf.io/zpkau/download)

### Genbank microbial genomes - SBT

These database are formatted for use with `sourmash search` and
`sourmash gather`.

Approximately 98,000 microbial genomes (including viral and fungal)
from NCBI Genbank.

* [Genbank k=21, 2018.03.29][3] - 3.9 GB - [manifest](https://osf.io/vm5kb/download)
* [Genbank k=31, 2018.03.29][4] - 3.9 GB - [manifest](https://osf.io/p87ec/download)
* [Genbank k=51, 2018.03.29][5] - 3.9 GB - [manifest](https://osf.io/cbxg9/download)


[0]: https://sourmash-databases.s3-us-west-2.amazonaws.com/zip/refseq-k21.sbt.zip
[1]: https://sourmash-databases.s3-us-west-2.amazonaws.com/zip/refseq-k31.sbt.zip
[2]: https://sourmash-databases.s3-us-west-2.amazonaws.com/zip/refseq-k51.sbt.zip

[3]: https://sourmash-databases.s3-us-west-2.amazonaws.com/zip/genbank-k21.sbt.zip
[4]: https://sourmash-databases.s3-us-west-2.amazonaws.com/zip/genbank-k31.sbt.zip
[5]: https://sourmash-databases.s3-us-west-2.amazonaws.com/zip/genbank-k51.sbt.zip

### Genbank microbial genomes - LCA

These databases are formatted for use with `sourmash lca`; they are
v2 LCA databases and will work with sourmash v2.0a11 and later.
They are calculated with a scaled value of 10000 (1e5).

Approximately 87,000 microbial genomes (including viral and fungal)
from NCBI Genbank.

* [Genbank k=21, 2017.11.07](https://osf.io/d7rv8/download), 109 MB
* [Genbank k=31, 2017.11.07](https://osf.io/4f8n3/download), 120 MB
* [Genbank k=51, 2017.11.07](https://osf.io/nemkw/download), 125 MB


The above LCA databases were calculated as follows:

```
sourmash lca index genbank-genomes-taxonomy.2017.05.29.csv \
genbank-k21.lca.json.gz -k 21 --scaled=10000 \
-f --traverse-directory .sbt.genbank-k21 --split-identifiers
```

See
[github.com/dib-lab/2018-ncbi-lineages](https://github.com/dib-lab/2018-ncbi-lineages)
for information on preparing the genbank-genomes-taxonomy when signatures are generated using `--name-from-first`.

### GTDB databases - SBT

All files below are available [here](https://osf.io/wxf9z/).

Release 89

* [GTDB k=31, release 89](https://osf.io/5mb9k/download)

Release 95

* [GTDB k=21, scaled=1000](https://osf.io/4yhe2/download)
* [GTDB k=31, scaled=1000](https://osf.io/4n3m5/download)
* [GTDB k=51, scaled=1000](https://osf.io/c8wj7/download)


18 changes: 13 additions & 5 deletions src/sourmash/minhash.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,23 @@
# -*- coding: UTF-8 -*-
"""
sourmash submodule that provides MinHash class and utility functions.
class MinHash - core MinHash class.
class FrozenMinHash - read-only MinHash class.
"""
from __future__ import unicode_literals, division

import math
__all__ = ['get_minhash_default_seed',
'get_minhash_max_hash',
'hash_murmur',
'MinHash',
'FrozenMinHash']

from collections.abc import Mapping

from . import VERSION
from ._lowlevel import ffi, lib
from .utils import RustObject, rustcall, decode_str
from .utils import RustObject, rustcall
from .exceptions import SourmashError
from deprecation import deprecated

Expand Down Expand Up @@ -694,9 +705,6 @@ def add_hash_with_abundance(self, *args, **kwargs):
def clear(self, *args, **kwargs):
raise TypeError('FrozenMinHash does not support modification')

def remove_many(self, *args, **kwargs):
raise TypeError('FrozenMinHash does not support modification')

def set_abundances(self, *args, **kwargs):
raise TypeError('FrozenMinHash does not support modification')

Expand Down

0 comments on commit 5ac4671

Please sign in to comment.