Skip to content

Customized Control of Clustering

Pre-release
Pre-release
Compare
Choose a tag to compare
@matthewfallan matthewfallan released this 31 Jul 01:57
· 264 commits to main since this release

What's new in 0.20.0

New Features

  • SEISMIC-RNA can now be installed with Conda (Bioconda channel): conda install -c bioconda -c conda-forge seismic-rna
  • In seismic wf, clustering is now enabled using --cluster rather than setting the maximum number of clusters (--max-clusters/-k) to a positive integer.
  • Throughout SEISMIC-RNA, the name for the number of clusters has been renamed from the more confusing "Order" of the clustering to "K", which is a general term for the number of clusters (e.g. as in K-means clustering).
  • Clustering can now find an unlimited number of clusters, specified by setting --max-clusters/-k to 0 (the default). In this case, clustering will continue until either the BIC fails to decrease or the number of clusters does not pass filters (see below).
  • If you specify a maximum number of clusters (with -k), then you can now choose to force clustering using every number of clusters up to that maximum (with --try-all-ks) or stop when the latest number of clusters is not better than the previous number (the only option in former versions).
  • You can now set a minimum number of clusters with --min-clusters, which makes clustering start at that number of clusters (e.g. if you set it to 3, then SEISMIC-RNA will start with 3 clusters and never try 1 or 2).
  • You can also choose whether to keep all numbers of clusters you tried (--keep-all-ks) or only the number of clusters that gave the best BIC among those that passed filters (see below). Note that with the latter option, if no clusters pass the filters (which can happen if you use --min-clusters greater than 1), then no clusters will be output, which will cause an error in the table step.
  • Clustering now includes filters to make sure the clusters are valid, rather than simply sorting by BIC.
    • One set of filters removes individual EM runs:
      • --max-pearson-run: upper limit on the Pearson correlation between any two clusters
      • --min-nrmsd-run: lower limit on the normalized RMSD between any two clusters
    • Another set of filters removes each numbers of clusters (K) where the runs are not sufficiently consistent:
      • --min-pearson-vs-best: requires at least one suboptimal run to have at least this Pearson correlation vs. the best run for that K.
      • --max-nrmsd-vs-best: requires at least one suboptimal run to have at most this normalized RMSD vs. the best run for that K.
      • --max-loglike-vs-best: requires the best suboptimal run to have at most this difference in log likelihood vs. the best run for that K.
  • Scatter plots now print the correlation on the plot; choose the correlation metric using --metric.
  • ROC plots now print the area under each curve.
  • SEISMIC-RNA can now guess the DATAPATH environment variable when RNAstructure is installed either manually from the Mathews Lab website or with Conda.
  • In the Python API, the Header class now accepts arbitrary numbers of clusters, rather than requiring an unbroken range between a minimum and a maximum number of clusters.
  • Accordingly, table files with clusters can now contain arbitrary numbers of clusters, rather than needing to start with 1 cluster and count up to the maximum number of clusters.
  • New unit tests have been added to verify that the new Header class functions properly, that batch counting and accumulation functions work with averages and clusters, and that the entire workflow runs on simulated data.
  • The GitHub Actions workflow now enforces all unit tests to finish successfully; if not, the workflow checks are marked as failing. Previously, it would run the unit tests but do nothing with the test results.
  • The GitHub Actions workflow now builds and deploys the documentation automatically each time the source code is updated, saving the need to build the documentation manually and push it to GitHub with every update (or even to keep the built documentation in the GitHub repo).

Removed Features

  • seismic table no longer generates per-read tables from clustered datasets (i.e. clust-per-read.csv). This is because these tables had been of little to no value and were easy to misinterpret: in fact, generating a histogram of number of mutations per read produced the wrong results, with no straightforward fix.
  • +addclust and +delclust have been removed because they are less useful with the new cluster filter features, while maintaining both of these commands including the new features would be substantially more complicated.
  • The built documentation (seismic-rna/docs) has been removed from the GitHub repository, to reduce its size and to remove the need to manually rebuild the docs each time the documentation source files are updated. Only the documentation source files (seismic-rna/src/userdocs) remain.

Full Changelog: v0.19.2...v0.20.0