Releases: rouskinlab/seismic-rna
Releases · rouskinlab/seismic-rna
Quick on the Draw
What's new in 0.22.1
Draw
- The
draw
module creates vectorized, publication-quality figures, directly from the output of SEISMIC-RNA'sfold
module. - Initially piloted in 0.22.0, drawing can now be run automatically as part of
wf
by including the--draw
flag. - The
draw
command will automatically color bases according to the reactivities used by SEISMIC-RNA to constrain the structure prediction, although this behavior can be disabled with the--no-color
flag. - When multiple structures are predicted, only the one with the best AUROC will be drawn by default, although any structure(s) can be specified with the
--struct-num
flag. - Many thanks to Fabrice Jossinet (@fjossinet) for developing RNArtistCore, the heart the
draw
module. - To install RNArtistCore, first download the latest release and save the
.jar
file in a safe place. Second, set the environmental variableRNARTISTCORE
to the full path to the.jar
file. It is recommended you do this by adding the lineexport RNARTISTCORE="/full/path/to/your/rnartistcore-X.X.X-SNAPSHOT-jar-with-dependencies.jar"
to your .bashrc or .zshrc file. - Installation instructions for RNArtistCore can also be found in a descriptive error message triggered by running the
draw
command when it is not installed.
Bug Fixes
- Fixed incorrect parsing of structure profiles generated from mask tables when using the draw module.
- Implemented safeguards against missing color and table files when drawing structures.
Full Changelog: v0.22.0...v0.22.1
Tour de Workflow
What's new in 0.22.0
Migration
- To convert SEISMIC-RNA outputs from version 0.21 to version 0.22 (which are not compatible with each other), type
seismic migrate out
, whereout
is an output directory. You may give additional output directories as positional arguments.
Align
- The default alignment scores have been made more stringent, from
L,1,0.5
toL,1,0.8
(local mode) and fromL,-1,-0.5
toL,-1,-0.8
(end-to-end mode), to reduce the number of reads with huge numbers of mutations (especially indels) that could previously slip through. - In the align report, references are now sorted from most to fewest reads.
Relate
- Insertions are now marked on either the base immediately 3' of the insertion (default) or immediately 5' (enabled with
--insert5
), rather than on both bases, to make it possible to count insertions using the existing code. - The algorithm that finds ambiguous indels has been rewritten to simplify the code and handle more edge cases. It is consequently slower, but its simpler structure now allows the algorithm to be reimplemented in C (to make it much faster than even the original), which will be done in a subsequent version.
Mask
- The term "section" is renamed to "region" to match the terminology used by most of the field to refer to parts of sequences/genes.
- The mask step runs multiple iterations of filtering reads, then filtering positions, and repeating, to make sure the reads that are ultimately kept meet all the criteria for the positions that are ultimately kept.
- The masking process is repeated until an iteration removes no more reads/positions, no reads/positions remain at all, or an maximum number of iterations is reached (no limit by default, but can be set with
--max-mask-iter
). - Arbitrary reads can be masked out by naming them individually with
--mask-read
, providing a text file of read names to mask out with--mask-read-file
, or both.
Table
- Table files are now written into the relate, mask, and cluster directories instead of into a separate table directory.
- Table files are now created during the relate, mask, and cluster steps; the table command still exists but is needed only for regenerating the table files if they are ever deleted.
- When calculating the data in a table, batches are now processed in parallel, which speeds up tabulation when you have a small number of large samples.
- Specific tables can now be disabled with options, e.g.
--no-mask-pos-table
. - Table files now have more descriptive names, e.g.
mask-position-table
instead ofmask-per-pos
. - The "Unambiguous" column has been renamed to "Informative".
- A bug causing some errors when counting duplicate reads has been fixed.
Graph
- Graphs of data from the mask step have been renamed from
masked
tofiltered
to make it clear that the data are from the reads/positions that remained after filtering, not the ones that were masked out.
Draw
- A new command, draw, has been added to draw RNA structures using RNArtist.
Simulation
- Simulating end coordinates now uses two parameters for variance,
--center-fvar
and--length-fvar
, so that they can be decoupled (e.g. to generate uniform-length reads of variable position).
Workflow
- Commands that are not part of the main workflow all began with a
+
to help distinguish them as auxiliary; with improved documentation, it should now be more clear to distinguish them, so the+
has been removed to make auxiliary commands easier to type. - To reduce the risk of rerunning an earlier step (e.g. mask) and forgetting to rerun a later step (e.g. cluster) based on the new results, an error is raised if the "Time Ended" field of the later step's report file occurs before that of the earlier step. If needed, this check can be disabled with
--no-verify-times
.
Python API
- The logger is now initialized by default when using
import seismicrna
. - Brickle files can be loaded without a checksum by passing
checksum=""
toBrickleIO.load()
. - The
Dataset.load()
class method has been removed;Dataset.__init__()
now accepts a report file and should be used instead. - Class
Qnames
has been renamed toReadNames
. - Arbitrary reads can be masked when tabulating mask and cluster datasets.
- Table classes have been renamed with full names, e.g. from
ClustPosTable
toClusterPositionTable
. - Default fields for Table paths now include
cmd
,table
, andext
.
What's Changed
- 0.22.0 by @matthewfallan in #19
Full Changelog: v0.21.1...v0.22.0
Logging Fixes
What's new in 0.21.1
Bug Fixes
- Changed obsolete "debug" message to "detail" to prevent crash when using
--keep-discontig
- Updated the help text for
--verbose
and--quiet
What's Changed
- 0.21.1 by @matthewfallan in #18
Full Changelog: v0.21.0...v0.21.1
v0.21.0
What's new in 0.21.0
New Features
- Replace Cutadapt and FastQC with fastp for one-step trimming and QC, faster processing, and a single report file.
- Replace the logging framework (using Python's standard library utilities) with a custom logging module that has seven levels (rather than 5) for greater control over verbosity.
Bug Fixes
- Fixed a bug in the jackpotting calculation that caused a systematic bias of roughly 1%.
Documentation
- Added an illustration of what happens during each main step of the workflow.
- Added a note about installing with Mamba.
Miscellaneous
- Changed the graph color palettes for bases (A/C/G/T) and relationships.
- Changed the colors of logging messages to make them easier to read on dark and light backgrounds.
What's Changed
- 0.21.0 by @matthewfallan in #17
Full Changelog: v0.20.1...v0.21.0
Hit the Jackpot
What's new in 0.20.1
New Features
- The clustering step now calculates a "jackpotting quotient" to measure how much bias there is in the dataset. The calculation is expensive and can be toggled with
--jackpot/--no-jackpot
. - The clustering step now outputs the per-run statistics as a CSV file and a graph for every statistic, rather than writing them into the cluster report.
- It also ouputs scatter plots comparing the observed and expected counts for each number of clusters.
- The new command
seismic +datapath
locates and prints the location of the data path for RNAstructure (assuming it has been installed properly), to which theDATAPATH
environment variable should be set. - The
raise_on_error
parameter can now be used to make SEISMIC-RNA raise an Exception instead of logging an error message any timelogger.error()
orlogger.critical()
is called. This is useful for testing the whole workflow, when errors should cause the test to fail. Only available through the Python API. - SEISMIC-RNA now has a new logo!
Bug Fixes
- Fixed a bug where clustering would run by default in
seismic wf
. Now, clustering is enabled with the flag--cluster
. - Fixed a bug causing
seismic align
andseismic wf
to delete BAM files when aligning demultiplexed FASTQ files. - Assume that the folding energy for RNA structures missing an explicit energy is 0 rather than NaN.
- Added
kaleido
to list of dependencies to fix a bug about a missing optional dependency for Plotly. - Fixed a bug where
seismic +sim ends
could produce out-of-bounds 5' or 3' ends.
Full Changelog: v0.20.0...v0.20.1
Customized Control of Clustering
What's new in 0.20.0
New Features
- SEISMIC-RNA can now be installed with Conda (Bioconda channel):
conda install -c bioconda -c conda-forge seismic-rna
- In
seismic wf
, clustering is now enabled using--cluster
rather than setting the maximum number of clusters (--max-clusters/-k
) to a positive integer. - Throughout SEISMIC-RNA, the name for the number of clusters has been renamed from the more confusing "Order" of the clustering to "K", which is a general term for the number of clusters (e.g. as in K-means clustering).
- Clustering can now find an unlimited number of clusters, specified by setting
--max-clusters/-k
to 0 (the default). In this case, clustering will continue until either the BIC fails to decrease or the number of clusters does not pass filters (see below). - If you specify a maximum number of clusters (with
-k
), then you can now choose to force clustering using every number of clusters up to that maximum (with--try-all-ks
) or stop when the latest number of clusters is not better than the previous number (the only option in former versions). - You can now set a minimum number of clusters with
--min-clusters
, which makes clustering start at that number of clusters (e.g. if you set it to 3, then SEISMIC-RNA will start with 3 clusters and never try 1 or 2). - You can also choose whether to keep all numbers of clusters you tried (
--keep-all-ks
) or only the number of clusters that gave the best BIC among those that passed filters (see below). Note that with the latter option, if no clusters pass the filters (which can happen if you use--min-clusters
greater than 1), then no clusters will be output, which will cause an error in the table step. - Clustering now includes filters to make sure the clusters are valid, rather than simply sorting by BIC.
- One set of filters removes individual EM runs:
--max-pearson-run
: upper limit on the Pearson correlation between any two clusters--min-nrmsd-run
: lower limit on the normalized RMSD between any two clusters
- Another set of filters removes each numbers of clusters (K) where the runs are not sufficiently consistent:
--min-pearson-vs-best
: requires at least one suboptimal run to have at least this Pearson correlation vs. the best run for that K.--max-nrmsd-vs-best
: requires at least one suboptimal run to have at most this normalized RMSD vs. the best run for that K.--max-loglike-vs-best
: requires the best suboptimal run to have at most this difference in log likelihood vs. the best run for that K.
- One set of filters removes individual EM runs:
- Scatter plots now print the correlation on the plot; choose the correlation metric using
--metric
. - ROC plots now print the area under each curve.
- SEISMIC-RNA can now guess the
DATAPATH
environment variable when RNAstructure is installed either manually from the Mathews Lab website or with Conda. - In the Python API, the Header class now accepts arbitrary numbers of clusters, rather than requiring an unbroken range between a minimum and a maximum number of clusters.
- Accordingly, table files with clusters can now contain arbitrary numbers of clusters, rather than needing to start with 1 cluster and count up to the maximum number of clusters.
- New unit tests have been added to verify that the new Header class functions properly, that batch counting and accumulation functions work with averages and clusters, and that the entire workflow runs on simulated data.
- The GitHub Actions workflow now enforces all unit tests to finish successfully; if not, the workflow checks are marked as failing. Previously, it would run the unit tests but do nothing with the test results.
- The GitHub Actions workflow now builds and deploys the documentation automatically each time the source code is updated, saving the need to build the documentation manually and push it to GitHub with every update (or even to keep the built documentation in the GitHub repo).
Removed Features
seismic table
no longer generates per-read tables from clustered datasets (i.e.clust-per-read.csv
). This is because these tables had been of little to no value and were easy to misinterpret: in fact, generating a histogram of number of mutations per read produced the wrong results, with no straightforward fix.+addclust
and+delclust
have been removed because they are less useful with the new cluster filter features, while maintaining both of these commands including the new features would be substantially more complicated.- The built documentation (
seismic-rna/docs
) has been removed from the GitHub repository, to reduce its size and to remove the need to manually rebuild the docs each time the documentation source files are updated. Only the documentation source files (seismic-rna/src/userdocs
) remain.
Full Changelog: v0.19.2...v0.20.0
Mixed Messages
What's new in 0.19.2
Bug Fixes
- Fixed a bug where single-end reads would be discarded using
--sep-strands
. - Fixed bugs causing the relate step to fail with a mixture of two-mate and one-mate paired-end reads (possible with Bowtie2 mixed mode,
--bt2-mixed
). - Updated the read counter and the SAM parser to handle mixed mode properly; added unit tests.
- Implemented a more robust method of creating temporary directories.
- Changed the FASTA parser and writer so that they raise errors if any name/sequence is invalid, rather than skipping invalid sequences and returning the rest, to avoid a potential bug where if two references had the same name, it could be unclear which sequence would actually be used (which has become more of a concern with
--sep-strand
because minus-strand references can be created automatically).
Full Changelog: v0.19.1...v0.19.2
Stranded Workflow
What's new in 0.19.1
New Features
seismic wf
now accepts--sep-strands
. This change is possible becauseseismic relate
now also accepts--sep-strands
and, if that option is given, will auto-generate a FASTA file of both strands. Thus, you can also runseismic align
andseismic relate
separately, as long as you pass the same FASTA file and the same--sep-stands
and--minus-label
options to both commands.seismic +splitbam
is a new command that accepts one or more BAM (or CRAM) files and splits each file into one BAM file for each reference, just like the last phase ofseismic align
. It also accepts the--sep-strands
option if you want to split each reference further into plus and minus strands. This feature is useful when you generate or obtain a BAM file outside of SEISMIC-RNA that contains multiple references, and you need to split it into one file for each reference so that you can pass those files intoseismic relate
orseismic wf
.
Removed Features
seismic align
with--sep-strands
no longer generates a FASTA file of both strands because it is no longer necessary due to the first new feature (above).
Bug Fixes
- Fixed bug where an empty directory called
align-xxxxxxxx
would appear in a sample's output directory if thealign
output directory did not yet exist.
Full Changelog: v0.19.0...v0.19.1
Stranded in the Best Way
What's new in 0.19.0
New Features
Strand-aware alignment
- Each BAM file can be split into separate files of reads originating from the plus- versus the minus-strand of the RNA using
--sep-strands
. - The option
--f1r2-plus / --f1r2-minus
controls whether paired-end reads whose mate 1 aligns in the forward orientation and mate 2 in the reverse orientation are considered to originate from the plus or the minus strand (for the Illumina library prep kits that our lab uses, they come from the minus strand, so--f1r2-minus
is the default). Reads where mates 1 and 2 align in the reverse and forward orientations, respectively, are considered to originate from the other strand. - For single-end reads, the behavior is the same as for read 1: in
--f1r2-minus
mode, single-end reads that align in the forward orientation are considered to have come from the minus strand. - The option
--minus-label
controls the label appended to the minus strand of each reference (by default, it is the name of the reference followed by-minus
). - In strand-aware mode,
seismic align
also writes a FASTA file of all reference sequences (including their minus strands) whose BAM files received a sufficient number of reads (controlled by--min-reads
) into the same directory as the BAM files and align report, with the same name as the original FASTA file. - Currently, strand-aware alignment is only available through
seismic align
, notseismic wf
. This limitation arises because separating strands actually generates new reference sequences (namely, the minus strands); if those sequences are missing from the FASTA given to therelate
step, then any BAM files aligned to the minus strands will not be able to be processed. It is straightforward to runseismic align
in strand-aware mode, then use the FASTA file it generates as input forseismic relate
orseismic wf
. However, switching the FASTA file automatically withinseismic wf
will require non-trivial re-engineering of how the pipeline works (or some other hacks).
Bug Fixes
- The mechanism to release files to the output directory now keeps a backup of any existing output files until it is sure that the new files have been written. This setup avoids potentially deleting existing output files but then failing to write the new files, causing data loss.
Full Changelog: v0.18.2...v0.19.0
Hard Limits to Soft Clips
What's new in 0.18.2
Performance Enhancements
- The
samtools sort
by query name step has been replaced bysamtools collate
to increase speed (for details, see https://www.htslib.org/doc/samtools-collate.html).
Bug Fixes
- Name-sorting/collation now occurs at the beginning of
relate
instead of the end ofalign
in order to fix a bug wherein passing an unsorted/uncollated SAM/BAM/CRAM file (i.e. created without or modified outside ofseismic align
) torelate
would cause it to treat paired-end reads as single-end. - The algorithm for finding ambiguous indels has been updated to keep indels from entering soft-clipped regions of the read when both the indel and the soft-clipped region lie within a contiguous stretch of low-quality bases (although this situation is very rare). Before this fix, a read with such a long stretch of low-quality bases including an indel and a soft-clip could have been erroneously assigned indels at positions outside of the region to which it had aligned to the reference, which would have resulted in incorrect (occasionally negative) counts of matches in the
table
step. - The
mask
andtable
steps have been updated to raise errors upon finding negative counts, so that any other such bugs do not go unnoticed. - In
demult
, reports are now named after samples, and an issue with obtaining barcodes from sequences with different barcode coordinates has also been fixed.
What's Changed
- 0.18.2 by @matthewfallan in #16
Full Changelog: v0.18.1...v0.18.2