diff --git a/doc/databases.md b/doc/databases.md index d0830635ed..d2b2308473 100644 --- a/doc/databases.md +++ b/doc/databases.md @@ -1,86 +1,37 @@ -# Prepared search databases +# Prepared databases -We provide several databases for download. Note that these databases can -be used with both sourmash v3.5 and sourmash v4.0. +## GTDB R06-rs202 - DNA databases -## RefSeq microbial genomes - SBT +All files below are available under https://osf.io/wxf9z/. The GTDB taxonomy spreadsheet (in a format suitable for `sourmash lca index`) is available [here](https://osf.io/p6z3w/). -These database are formatted for use with `sourmash search` and -`sourmash gather`. They are calculated with a scaled value of 2000. +For each k-mer size, three databases are available. -Approximately 91,000 microbial genomes (including viral and fungal) -from NCBI RefSeq. +* Zipfile collections can be used for a linear search. The signatures were calculated with a scaled of 1000, which robustly supports searches for ~10kb or larger matches. +* SBT databases are indexed versions of the Zipfile collections that support faster search. They are also indexed with scaled=1000. +* LCA databases are indexed versions of the Zipfile collections that also contain taxonomy information and can be used with regular search as well as with [the `lca` subcommands for taxonomic analysis](https://sourmash.readthedocs.io/en/latest/command-line.html#sourmash-lca-subcommands-for-taxonomic-classification). They are indexed with scaled=10,000, which robustly supports searches for 100kb or larger matches. -* [RefSeq k=21, 2018.03.29][0] - 3.3 GB - [manifest](https://osf.io/wamfk/download) -* [RefSeq k=31, 2018.03.29][1] - 3.3 GB - [manifest](https://osf.io/x3aut/download) -* [RefSeq k=51, 2018.03.29][2] - 3.4 GB - [manifest](https://osf.io/zpkau/download) +You can read more about the different database and index types [here](https://sourmash.readthedocs.io/en/latest/command-line.html#indexed-databases). -## Genbank microbial genomes - SBT +Legacy databases are available [here](legacy-databases.md) -These database are formatted for use with `sourmash search` and -`sourmash gather`. +Note that the SBT and LCA databases can be used with sourmash v3.5 and later, while Zipfile collections can only be used with sourmash v4.1.0 and up. -Approximately 98,000 microbial genomes (including viral and fungal) -from NCBI Genbank. +### GTDB genomic representatives (47.8k genomes) -* [Genbank k=21, 2018.03.29][3] - 3.9 GB - [manifest](https://osf.io/vm5kb/download) -* [Genbank k=31, 2018.03.29][4] - 3.9 GB - [manifest](https://osf.io/p87ec/download) -* [Genbank k=51, 2018.03.29][5] - 3.9 GB - [manifest](https://osf.io/cbxg9/download) +The GTDB genomic representatives are a low-redundancy subset of Genbank genomes. -### Details +| K-mer size | Zipfile collection | SBT | LCA | +| -------- | -------- | -------- | ---- | +| 21 | [download (1.3 GB)](https://osf.io/jp5zh/download) | [download (2.6 GB)](https://osf.io/py92w/download) | [download (114 MB)](https://osf.io/gk2za/download) | +| 31 | [download (1.3 GB)](https://osf.io/nqmau/download) | [download (2.6 GB)](https://osf.io/w4bcm/download) | [download (131 MB)](https://osf.io/ypsjq/download) | +| 51 | [download (1.3 GB)](https://osf.io/px6qd/download) | [download (2.6 GB)](https://osf.io/rv9zp/download) | [download (137 MB)](https://osf.io/297dp/download) | -The individual signatures for the above SBTs were calculated as follows: +### GTDB all genomes (258k genomes) -``` -sourmash compute -k 4,5 \ - -n 2000 \ - --track-abundance \ - --name-from-first \ - -o {output} \ - {input} +These databases contain the complete GTDB collection of 258,406 genomes. -sourmash compute -k 21,31,51 \ - --scaled 2000 \ - --track-abundance \ - --name-from-first \ - -o {output} \ - {input} -``` - -See [github.com/dib-lab/sourmash_databases](https://github.com/dib-lab/sourmash_databases) for a Snakemake workflow -to build the databases. - -[0]: https://sourmash-databases.s3-us-west-2.amazonaws.com/zip/refseq-k21.sbt.zip -[1]: https://sourmash-databases.s3-us-west-2.amazonaws.com/zip/refseq-k31.sbt.zip -[2]: https://sourmash-databases.s3-us-west-2.amazonaws.com/zip/refseq-k51.sbt.zip - -[3]: https://sourmash-databases.s3-us-west-2.amazonaws.com/zip/genbank-k21.sbt.zip -[4]: https://sourmash-databases.s3-us-west-2.amazonaws.com/zip/genbank-k31.sbt.zip -[5]: https://sourmash-databases.s3-us-west-2.amazonaws.com/zip/genbank-k51.sbt.zip - -## Genbank LCA Database - -These databases are formatted for use with `sourmash lca`; they are -v2 LCA databases and will work with sourmash v2.0a11 and later. -They are calculated with a scaled value of 10000 (1e5). - -Approximately 87,000 microbial genomes (including viral and fungal) -from NCBI Genbank. - -* [Genbank k=21, 2017.11.07](https://osf.io/d7rv8/download), 109 MB -* [Genbank k=31, 2017.11.07](https://osf.io/4f8n3/download), 120 MB -* [Genbank k=51, 2017.11.07](https://osf.io/nemkw/download), 125 MB - -### Details - -The above LCA databases were calculated as follows: - -``` -sourmash lca index genbank-genomes-taxonomy.2017.05.29.csv \ - genbank-k21.lca.json.gz -k 21 --scaled=10000 \ - -f --traverse-directory .sbt.genbank-k21 --split-identifiers -``` - -See -[github.com/dib-lab/2018-ncbi-lineages](https://github.com/dib-lab/2018-ncbi-lineages) -for information on preparing the genbank-genomes-taxonomy file. +| K-mer size | Zipfile collection | SBT | LCA | +| -------- | -------- | -------- | ---- | +| 21 | [download (7.8 GB)](https://osf.io/vgex4/download) | [download (15 GB)](https://osf.io/ar67j/download) | [download (266 MB)](https://osf.io/hm3c4/download) | +| 31 | [download (7.8 GB)](https://osf.io/94mzh/download) | [download (15 GB)](https://osf.io/dmsz8/download) | [download (286 MB)](https://osf.io/9xdg2/download) | +| 51 | [download (7.8 GB)](https://osf.io/x9cdp/download) | [download (15 GB)](https://osf.io/8fc3t/download) | [download (299 MB)](https://osf.io/3cdp6/download) | diff --git a/doc/legacy-databases.md b/doc/legacy-databases.md new file mode 100644 index 0000000000..0cfd572599 --- /dev/null +++ b/doc/legacy-databases.md @@ -0,0 +1,112 @@ +# Legacy Databases + +Sourmash databases have evolved over time. +We have changed how the database is stored (uncompressed `.zip`) and how we name each signature. +All SBT databases below are in `.sbt.zip` format. +Note that the SBT and LCA databases can be used with sourmash v3.5 and later, while Zipfile collections can only be used with sourmash v4.1.0 and up. +We detail these changes below, and include links to legacy databases. +See [github.com/dib-lab/sourmash_databases](https://github.com/dib-lab/sourmash_databases) for a Snakemake workflow that builds current and legacy databases. + +## Sourmash signature names + +Earlier versions of sourmash databases were built using individual signatures that were calculated as follows: + +``` +sourmash compute -k 4,5 \ + -n 2000 \ + --track-abundance \ + --name-from-first \ + -o {output} \ + {input} + +sourmash compute -k 21,31,51 \ + --scaled 2000 \ + --track-abundance \ + --name-from-first \ + -o {output} \ + {input} +``` + +We moved away from this strategy because `--name-from-first` named each signature from the name of the first sequence in the FASTA file. +While the species name of the organism was present in this name, the accession number corresponded to the accession of the first sequence fragment in the file, not the genome assembly. +As such, we revised our strategy so that signatures are named by genome assembly accession and species name. +This requires the `assembly_summary.txt` file to be parsed. + +## Sourmash database compression + +## Legacy databases + +### RefSeq microbial genomes - SBT + +These database are formatted for use with `sourmash search` and +`sourmash gather`. They are calculated with a scaled value of 2000. + +Approximately 91,000 microbial genomes (including viral and fungal) +from NCBI RefSeq. + +* [RefSeq k=21, 2018.03.29][0] - 3.3 GB - [manifest](https://osf.io/wamfk/download) +* [RefSeq k=31, 2018.03.29][1] - 3.3 GB - [manifest](https://osf.io/x3aut/download) +* [RefSeq k=51, 2018.03.29][2] - 3.4 GB - [manifest](https://osf.io/zpkau/download) + +### Genbank microbial genomes - SBT + +These database are formatted for use with `sourmash search` and +`sourmash gather`. + +Approximately 98,000 microbial genomes (including viral and fungal) +from NCBI Genbank. + +* [Genbank k=21, 2018.03.29][3] - 3.9 GB - [manifest](https://osf.io/vm5kb/download) +* [Genbank k=31, 2018.03.29][4] - 3.9 GB - [manifest](https://osf.io/p87ec/download) +* [Genbank k=51, 2018.03.29][5] - 3.9 GB - [manifest](https://osf.io/cbxg9/download) + + +[0]: https://sourmash-databases.s3-us-west-2.amazonaws.com/zip/refseq-k21.sbt.zip +[1]: https://sourmash-databases.s3-us-west-2.amazonaws.com/zip/refseq-k31.sbt.zip +[2]: https://sourmash-databases.s3-us-west-2.amazonaws.com/zip/refseq-k51.sbt.zip + +[3]: https://sourmash-databases.s3-us-west-2.amazonaws.com/zip/genbank-k21.sbt.zip +[4]: https://sourmash-databases.s3-us-west-2.amazonaws.com/zip/genbank-k31.sbt.zip +[5]: https://sourmash-databases.s3-us-west-2.amazonaws.com/zip/genbank-k51.sbt.zip + +### Genbank microbial genomes - LCA + +These databases are formatted for use with `sourmash lca`; they are +v2 LCA databases and will work with sourmash v2.0a11 and later. +They are calculated with a scaled value of 10000 (1e5). + +Approximately 87,000 microbial genomes (including viral and fungal) +from NCBI Genbank. + +* [Genbank k=21, 2017.11.07](https://osf.io/d7rv8/download), 109 MB +* [Genbank k=31, 2017.11.07](https://osf.io/4f8n3/download), 120 MB +* [Genbank k=51, 2017.11.07](https://osf.io/nemkw/download), 125 MB + + +The above LCA databases were calculated as follows: + +``` +sourmash lca index genbank-genomes-taxonomy.2017.05.29.csv \ + genbank-k21.lca.json.gz -k 21 --scaled=10000 \ + -f --traverse-directory .sbt.genbank-k21 --split-identifiers +``` + +See +[github.com/dib-lab/2018-ncbi-lineages](https://github.com/dib-lab/2018-ncbi-lineages) +for information on preparing the genbank-genomes-taxonomy when signatures are generated using `--name-from-first`. + +### GTDB databases - SBT + +All files below are available [here](https://osf.io/wxf9z/). + +Release 89 + +* [GTDB k=31, release 89](https://osf.io/5mb9k/download) + +Release 95 + +* [GTDB k=21, scaled=1000](https://osf.io/4yhe2/download) +* [GTDB k=31, scaled=1000](https://osf.io/4n3m5/download) +* [GTDB k=51, scaled=1000](https://osf.io/c8wj7/download) + + diff --git a/src/sourmash/minhash.py b/src/sourmash/minhash.py index bd7cc7b23a..28768a8a4f 100644 --- a/src/sourmash/minhash.py +++ b/src/sourmash/minhash.py @@ -1,12 +1,23 @@ # -*- coding: UTF-8 -*- +""" +sourmash submodule that provides MinHash class and utility functions. + +class MinHash - core MinHash class. +class FrozenMinHash - read-only MinHash class. +""" from __future__ import unicode_literals, division -import math +__all__ = ['get_minhash_default_seed', + 'get_minhash_max_hash', + 'hash_murmur', + 'MinHash', + 'FrozenMinHash'] + from collections.abc import Mapping from . import VERSION from ._lowlevel import ffi, lib -from .utils import RustObject, rustcall, decode_str +from .utils import RustObject, rustcall from .exceptions import SourmashError from deprecation import deprecated @@ -694,9 +705,6 @@ def add_hash_with_abundance(self, *args, **kwargs): def clear(self, *args, **kwargs): raise TypeError('FrozenMinHash does not support modification') - def remove_many(self, *args, **kwargs): - raise TypeError('FrozenMinHash does not support modification') - def set_abundances(self, *args, **kwargs): raise TypeError('FrozenMinHash does not support modification')