Skip to content

Releases: DessimozLab/omamer

v2.1.0

25 Nov 11:04
Compare
Choose a tag to compare

What's Changed

This release contains various performance improvements for classification with the focus on single-thread speed and parallel scaling.

v2.0.5

13 Nov 10:40
Compare
Choose a tag to compare

Full Changelog: v2.0.4...v2.0.5

v2.0.4

01 Jul 06:56
Compare
Choose a tag to compare

What's Changed

  • [FIX] freeze numpy dependency to <2 (issue #34)
  • [ADD] experimental support to build omamer databases from orthoxml/fasta files
  • Bump pypa/gh-action-pypi-publish from 1.8.12 to 1.9.0 by @dependabot in #33

Full Changelog: v2.0.3...v2.0.4

v2.0.3

28 Mar 16:32
Compare
Choose a tag to compare

v2.0.3

v2.0.2

10 Nov 19:20
Compare
Choose a tag to compare
  • changed method for hiding taxa in build process. Now takes a file containing taxa to hide on separate lines.
  • checks and improved feedback for root taxon and requested taxa to hide.
  • root taxon set by default to the root level in speciestree.nwk (previously hard-coded to default to LUCA)

v2.0.1

31 Oct 13:53
Compare
Choose a tag to compare

What's Changed

  • remove dependency for filehash library
  • return better error message if build dependencies are not met, but trying to building an omamer database
  • minor fixes
  • Bump actions/checkout from 3 to 4 by @dependabot in #24

Full Changelog: v2.0.0...v2.0.1

v2.0.0

20 Oct 10:51
Compare
Choose a tag to compare
  • Major update of database format and search code to improve overall memory useage. Most standard runs with LUCA-level database will run on a machine with 16GB RAM.
  • Update to the scoring algorithm for root-level HOG / family assignments, to allow for significance testing. This estimates a binomial distribution for each family, so that we can compute the probability of matching at least as many k-mers as we have observed by chance, for each family that has a match to a given query.
  • UX improvements - more feedback during interactive search runs, whilst maintaining small log files.

Brief overview of major changes to OMAmer

The OMAmer placement algorithm consists of two steps: placing a query sequence into a protein family (root level HOG in OMA), before placing it into a sub-family. The original OMAmer publication focused on providing better and faster subfamily-level assignments than methods based on closest-sequence. Recently, the group has developed OMArk, a software package for proteome (protein-coding gene repertoire) quality assessment. The original OMAmer method was developed using a smaller taxonomic range than required for OMArk, which meant that the largest gene families were much smaller and less diverse in k-mer content. The largest HOG in OMA (November 2022 release) contains over 101,000 proteins and represents 53.9% of the k-mer index, based on the 6-mers that OMAmer uses by default. This means that a random protein sequence is very likely to be associated with this HOG.

In order to allow for this, we developed a scoring mechanism based on the binomial distribution. For each family, we estimated the probability of a random k-mer matching. We can then compute the $\textrm{Binomial}(N_{\textrm{query}}, P_{\textrm{family}})$ distribution for each family with matches (with probability $P_{\textrm{family}}$), with the number of draws ($N_{\textrm{query}}$) being the number of k-mers in the query sequence. Computing the complementary CDF (survival function), we can compute the probability of matching at least as many k-mers matches as we have observed by chance, for each family that has a match. Note: the results of this test are computed in negative-log units (natural log) for accuracy.

This is then used to filter the list of families which have an overlap with the query sequence (argument “--family-alpha”, default $10^{-6}$), to give us a list of candidate families. Candidates are then ordered by a normalised k-mer count, in the same way as the original algorithm. The expected count is now computed using the binomial approximation, with any ties broken based on the proportion of the query sequence covered by matching k-mers, then by the p-value computed above. By default, only the top family is taken forward. Sub-family placement is as in the original manuscript. Further software optimisation was performed, but did not affect the underlying method. As an example, it is now possible to run using the LUCA database in under 12GB of memory, whereas before this was using in excess of 40GB.

v0.2.6

14 Jun 07:43
Compare
Choose a tag to compare

What's Changed

  • support for numpy>1.23

Full Changelog: v0.2.5...v0.2.6

v0.2.5

21 Mar 15:29
Compare
Choose a tag to compare

small patch release that fixes an issue with the previous version when building a new database from scratch.

Full Changelog: v0.2.4...v0.2.5

v0.2.4

21 Mar 13:29
ecd2611
Compare
Choose a tag to compare

v0.2.4 -- improvement to runtime for standard scoring by pre-computing statistics to store in the database. Older databases should still work, but will not benefit from the performance improvement.