Releases: COMBINE-lab/salmon
Salmon v1.10.1
This release is a very minor update, intended entirely to address #835 (a problem raised by deb med maintainers running into build problems upstream). This release bumps the included version of the cereal
headers in the corresponding pufferfish
tag to v1.3.2
and also updates the required version for salmon
to match this (i.e. cereal v1.3.2
). Since the prior version included in pufferfish
in past releases, the cereal
library had made 2 patch releases which, nonetheless, were not backwards compatible. This lead to problems when mixing cereal v1.3.2
with v1.3.0
. This release bumps everything to v1.3.2
to match the latest package on debian testing. If salmon 1.10.0
is working fine for you, there's no need to update to this release (but obviously no harm in doing so). It adds no new features or bug fixes within salmon
itself.
Salmon v1.10.0
Fixes
-
This releases addresses a bug in deserializing the
compact_vector
(discovered by @jamshed) that could lead to undefined behavior. In fact, this bug was underlying a relatively rare but longstanding issue with the previous biconda build of salmon where a segmentation fault could occur during indexing. -
This release addresses #806, where several output counters used 32-bit values and could produce incorrect values if they exceeded the maximum representable 32-bit integer. These counters have been changed to be 64-bits wide. It is worth noting that this was an issue with the reported values with the output report, but not with the internal representations (i.e. the actual quantifications were not affected).
-
This release incorporates PR #817 by @Gaura that addresses an issue in the processing of some sci-sea 3 data where having a read1 length of 33 or 34 would result in error while being valid lengths. This resulted in salmon refusing to process this data; this has now been fixed (addresses #813).
Improvements
-
Substantial refactoring has been made to parts of the mapping code to clean up redundant code and to make future additions easier.
-
Substantial improvements have been made to the CMake files to reduce the need for redundant copies of files and to propagate target properties more faithfully.
-
Several dependencies have been updated, including
libstadenio
anditlib
.
Full Changelog: v1.9.0...v1.10.0
Salmon v1.9.0
New features
- Salmon learned the ability to optionally write quality values in output SAM files. If the
--writeQualities
flag is passed tosalmon
when mappings are also being written (i.e. with--writeMappings=
), then the SAM records for reads will contain the corresponding quality values. Note: You should not pass this flag tosalmon
if you are providingFASTA
rather thanFASTQ
files as input; those files have no quality values, and so this flag is not compatible withFASTA
input. Note: The default behavior remains to not write quality values, as they are not necessary for many downstream applications and they consume considerable extra space in the output. This addresses the feature request in #756; thanks to @A-N-Other for the suggestion.
Fixes
- Addressing #748, raised by @taylorreiter - In single-end mode, all unmapped reads were being reported with the code
u
, including those mapped to decoys. This release fixes the output so the proper coded
, is reported for those fragments best mapping to decoys.
Improvements
-
When
salmon
alevin
was being run upstream ofalevin-fry
for generating a RAD file, it was possible for the file to be truncated if there was insufficient disk space for the output. This release ofsalmon
adds a final check of theofstream
after the call toclose
to determine if the stream is in a bad state. This should lead to better error reporting and proper exit codes if the RAD output ofsalmon
alevin
is unexpectedly truncated. Thanks to @allyhawkins for helping to uncover this issue. -
The use of multi-stage builds has greatly reduced the size of the Docker image to ~101MB (from ~1.38G); thanks to @kaczmarj for contributing this improvement.
-
Improvements to the documentation have been made and some typos fidex thanks to @molecules.
Full Changelog: v1.8.0...v1.9.0
Salmon v1.8.0
New features & improvements
Note (June 7, 2022) : Updated release tarball to remove problematic libm
that was causing illegal instruction on some architectures.
-
The index command now optionally accepts a flag
-n
/--no-clip
that will disable homopolymer clipping during reference indexing. -
Addressed an offset miscalculation; this results in further improved specificity in alevin's
--sketch
mode.
Fixes
- No other particular bug fixes are noted for this release.
Notes
- Legacy and deprecated
Intel TBB
functionality has now been removed, andsalmon
(andpufferfish
upon which it depends) have been updated tooneAPI TBB
. The current release requires a recent version ofoneAPI TBB
(>= 2021.4.0) library.
Full Changelog: v1.7.0...v1.8.0
Salmon 1.7.0
New features & improvements
-
This release includes a refactoring and optimization of the mapping code in
--sketch
mode, further increasing speed; output should remain identical. -
This release adds the
--splitSeqV1
and--splitSeqV2
flags, that have been the development release for a bit, as simple alternatives to custom geometry when processing SPLiT-seq data foralevin-fry
oralevin
processing.
Fixes
- No particular bug fixes are noted for this release.
Other changes / enhancements
- Explicitly check for valid value of
k
before calling out to the indexer. This leads to a more informative error message and exit if the user passes an unacceptable value ofk
.
Notes
- The
Intel TBB
library used internally bysalmon
(and used as well inTwoPaCo
that is relied upon for compacted reference de Bruijn graph construction) has evolved into theoneAPI TBB
. Recent releases of this library (2021.1 and forward) make certain backward incompatible changes and therefore cannot be used to buildsalmon
. We anticipate working toward replacing the deprecated and removed functions with the correspondingoneAPI
replacements and idioms, hopefully in the next release ofsalmon
. Therefore, we anticipate that this will be the last — or close to the last —salmon
release to use (and be compatible with) the legacyIntel TBB
library. Future releases will likely require a newer version of theoneAPI TBB
library instead.
Full Changelog: v1.6.0...v1.7.0
Salmon 1.6.0
New features
- This release introduces specific flags for two new single-cell protocols (which can be processed using either
alevin
or that can be used to produce a RAD file foralevin-fry
). Specifically, these new protocols are special because they mark the initial support within this framework for variable-length barcodes. In the next release, we hope to have an update to our generic barcode, umi, read geometry specification mini-language to expose this feature more generally there, but for the time being, these are implemented as new single-cell protocol flags. The new protocols supported aresci-RNA-seq3
andinDrop v2
. These are exposed, respectively with the--sciseq3
and--indropV2
flags. In addition to the custom geometry specification, the list of geometries / protocols with pre-specified flags has now been added to the documentation.
Fixes
-
This release fixes #691, where an extra
:
was present in thecmd_info.json
file inrad
andsketch
mode where thesalmon_version
was recored. Thanks to @allyhawkins for reporting this issue. -
This releases fixes a rare corner case in cell barcode rescue (recovering cell barcodes with an
N
) where, if a barcode could not be properly extracted, a rescue attempt would be made for the previous barcode, which could result in the wrong barcode / umi pairing for that read. Thanks to @Gaura for finding this bug and the PR to fix it.
Other changes / enhancements
Full Changelog: v1.5.2...v1.6.0
Salmon 1.5.2
This is a minor release and introduced no new features. However, this release addresses the issue raised in #688. Specifically, when run in RAD mode (i.e. with --rad
or --sketch
), salmon alevin
did not output a cmd_info.json
or meta_info.json
file. While not strictly required for subsequent processing with alevin-fry, having this information can be useful for provenance tracking and bookkeeping. Now, both of these files are properly generated when running salmon alevin
in RAD mode.
Salmon 1.5.1
Note: If you downloaded the pre-compiled linux binary from this release page for v1.5.1 before 19:47 UTC on June 14, please check your version with salmon -v
. For a short period of time, the executable posted here was actually v1.5.0. Other distribution mechanism (e.g. bioconda, docker hub, etc.) were not affected by this.
New features (in 1.5.0)
This release introduces an --ont
flag, that is designed to improve quantification from Oxford Nanopore Technologies (ONT) long-reads (both cDNA and direct RNA). The main effect of this flag is twofold:
-
First, it enables an alignment error model designed to work with long-read alignments. Until this point, the recommendation when using salmon to quantify aligned long reads had been to disable the error model, since salmon's default error model is designed for short reads and did not work well with long read alignments. However, the error model enabled with the
--ont
flag is designed specifically for the alignment characteristics of long reads and should improve the quantification estimates produced for this data by providing a better estimate of the conditional probability of a read arising from a particular transcript given its alignment to that transcript (the testing for this feature has been done mostly using minimap2). -
Second, it disables the length effect in the generative model when computing the conditional probability of observing a fragment given that it arises from a specific transcript. This is because in long-read sequencing, we do not expect to observe (i.e. sequence) multiple fragments from the same molecule, and thus we do not expect the transcript length to directly affect the observed fragment count directly. A consequence of this change is that the "EffectiveLength" of transcripts is not currently computed and used in the model in this mode, and this field in the output will be populated with a sentinel value of 100.
Other improvements (in 1.5.0)
-
When running
alevin
to generate a RAD file foralevin-fry
(specifically when using--sketch
mode), the sensitivity of mapping has been improved by allowing for reads that have only highly-repetitive seeds and map to a large number of loci. -
It is no longer necessary to provide a transcript-to-gene
--tgMap
to thealevin
command if alevin is being run with the--rad
and/or--sketch
flags. -
Automatically detect and exit if alevin is run with an index including decoy sequences when using the
--rad
and/or--sketch
flags. This functionality is not currently supported, and mapping against such an index can cause (cryptic) errors in downstream processing. Now, if such an index is passed when using these flags, an informative error message is printed and the program will exit with a return code of 1. -
Support for the custom single-cell features (
end, barcodeLength umiLength
) simultaneously with the--citeseq
command-line flags has been dropped, although they can still be used independently. A user has to either use the--citeseq
flag with predefined sets of features (CB: 16, UMI: 10) or use theumi-geometry
,bc-geometry
,read-geometry
flags for a customized extraction of the barcode sequences. Note, in thegeometry
mode, the user has to explicitly providekeepCBFraction 1.0
and atgMap
file, while it's not necessary to provide either inciteseq
based mode.
Bug fixes
-
Fix an issue where the size of the representation used for the barcode length and UMI length when writing output to a RAD file was mistakenly linked. As most current protocols use a 32-bit integer for both, most runs are not affected.
-
Fix an issue where the barcode and UMI length may not be properly set when using the custom geometry format (addresses #670).
salmon 1.4.0
salmon 1.4.0 : Thanksgiving release 🦃
Bug fixes
- Fixed a very rare bug whereby, on certain operating systems, under certain types of system load, and with specific versions of the C++ standard library, the
default
standard device would fail to produce a pseudorandom seed and would raise an exception. On these systems, "/dev/urandom" is explicitly substituted for the default random device. Unfortunately, it is not possible / easy to make the appropriate source changes at runtime. So, if you are experiencing this issue (which, again, looks to be exceedingly rare), it may be best to compile from source on the machine causing the issue.
salmon-related changes
- salmon should now compile and run on ARM machines. It has been tested on an AWS aarch64 node (running Ubuntu 20.10), but presumably should work on many ARM machines. It is assumed that NEON intrinsics are available. This support for ARM was made immensely easier by SIMDe. Thanks to @mr-c and @BenLangmead for pointing out SIMDe project and to @mr-c, @lh3 and lead developer of SIMDe @nemequ who all gave useful advice on the initial expansion to ARM support.
alevin-related changes
Support for RAD file creation and the alevin-fry pipeline
-
--rad
/--justAlign
flag : Salmon/alevin 1.4.0 coincides with the initial release of alevin-fry, a flexible and efficient framework for single-cell quantification. Alevin-fry handles barcode-detection and quantification, providing the methods developed as part of alevin, as well as a number of other possibilities. Alevin-fry is computationally efficient, flexible, and very memory efficient, processing single-cell experiments in 2-3GB of memory (see more details in the poster introducing alevin-fry). Moving forward, we plan for alevin-fry to be the primary development platform for new single-cell quantification methods. Nonetheless, alevin-fry currently, and for the forseeable future, will rely on alevin to perform the actual barcode / umi extraction, and mapping of sequencing reads. alevin communicates with alevin-fry via an intermediate binary file called a RAD (Reduced Alignment Data) file. To process data with alevin-fry (documentation available here), you must first map the reads to the reference transcriptome to generate a RAD file. This is done by running alevin as you would normally do, and by additionally passing the flag--rad
or--justAlign
. This flag will tell alevin to just align the reads and to write the appropriate information to a RAD file in the output directory (with a pre-determined name). -
--sketch
/--sketchMode
flag : Alevin learned the--sketch
/--sketchMode
flag. This flag is currently relevant only in RAD mode. In fact, this flag currently implies RAD mode (that is--sketch
is currently the same as--rad --sketch
). The--sketch
flag is meant to prioritize mapping speed at the potential cost of reduced specificity. It turns off selective-alignment and instead maps the reads using a custom implementation of psuedoalignment [1] with structural constraints (PASC). This consists of executing the k-mer collecting part of a pseudoalignment [1] algorithm to collect potentially compatible targets for a fragment, represented by a series of "hits". The targets are then filtered to ensure that the collected hits are consistent in their orientation, and co-linear in their placement on the fragment and reference (these are the enforced structural constraints). This algorithm is distinct from the seeding step of selective alignment or the quasi-mapping algorithm, and prioritizes speed. For an overview of how--sketch
mode affects downstream results, please check out our poster Accurate, efficient, and uncertainty-aware expression quantification of single-cell RNA-seq data.
Other alevin-related changes
-
--noWhitelist
flag : Alevin learned the--noWhitelist
flag. Passing this flag to alevin (in classic mode; this flag has no effect in RAD mode) stops the pipeline after UMI deduplication and quantification. The second-round intelligent whitelisting operation will not be performed. -
generic barcode / umi / read geometry syntax : Alevin learned to support a generic syntax to specify the read sequence that should be used for barcodes, UMIs and the read sequence. The syntax allows one to specify how the pattern corresponding to the barcode, UMI, and read sequence should be pieced together, and the syntax is meant to be intuitive and general. For example, one can specify the 10Xv2 geometry in the following manner using the generic syntax:
--read-geometry 2[1-end] --bc-geometry 1[1-16] --umi-geometry 1[17-26]
This specifies that the "sequence" read (the biological sequence to be aligned) comes from read
2
, and it spans from the first index1
(this syntax used 1-based indexing) until theend
of the read. Likewise, the barcode derives from read1
and occupies positions1-16
, and the UMI comes from read1
and occupies positions17-26
. The syntax can specify multiple ranges, and they will simply be concatenated together to produce the string. For example, one could specify--bc-geometry 1[1-8,16-23]
to designate that the barcode should be taken from the substring in positions 1-8 of read 1 followed by the substring in positions 16-23 of read 1. It is even possible to have the string pieced together across both reads, but that functionality is only available if you are running with--rad
or--sketch
and preparing a RAD file for alevin-fry. If you are running classic alevin, the barcode must reside on a single read. The robust parsing of the flexible geometry syntax is made possible by the cpp-peglib project. -
Alevin learned the ability to annotate output SAM files with the
CB
andUR
tags. If you write a SAM file by running alevin with--writeMappings
, then the resulting SAM file will haveCB
andUR
tags in the alignment records to record the cell barcode and UMI for the fragment. -
A new command-line flag
--noWhitelist
is added to explicitly disable the 'intelligent-whitelist' by alevin. It helps with a still-unresolved issue on HPC running on old centOS, where alevin fails to gain access to virtual memory.
References
[1] Bray NL, Pimentel H, Melsted P, Pachter L. Near-optimal probabilistic RNA-seq quantification. Nat Biotechnol. 2016;34(5):525-527.
salmon 1.3.0
salmon 1.3.0 Release notes
- Happy 4th of July ( 🇺🇸 🎆 )
Bug fixes & improvements
🎁 Improvements
-
Fragments that best-map to decoys are now written in the output SAM file if the
--writeMappings
option is provided. In order to make filtering of decoy and non-decoy alignments easier, all alignments now include a tag in their SAM record. Alignments to a valid (non-decoy) target are tagged withXT:A:T
, and those to decoys are tagged withXT:A:D
. This allows easy filtering of decoy mappings. The conditions for a decoy mapping to be written to the file are as follows:- There is no valid mapping to a non-decoy target. That is, all mappings to valid (non-decoy) targets must have alignment score <
decoyThreshold
* bestDecoyScore. - Only best-scoring decoy alignments are written to file. Thus, if there are sub-optimal decoy alignments that are still better than alignments to valid targets, they will not appear in the output SAM file.
- If decoy alignments are written (condition 1 is satisfied), then all equally-best decoy alignments are written to file (i.e. a decoy fragment can still multi-map).
- There is no valid mapping to a non-decoy target. That is, all mappings to valid (non-decoy) targets must have alignment score <
-
In the SAM file produced with the
--writeMappings
option, the header lines now include tags to designate each reference sequence as being a decoy or not. Sequence lines (@SQ
lines) that correspond to valid targets contain the tagDS:T
, while those corresponding to decoys contain the tagDS:D
. Note: In alignment-based mode, salmon will not process SAM/BAM files with decoy entries (to avoid usage errors, since decoy alignment is not intended for quantification). So, if, for some reason you are using a salmon-generated SAM file containing decoy sequences and alignment records, you must remove them before quantifying using alignment-based mode (i.e. removing all headers withDS:D
and all alignment records withXT:A:D
). Details about how to perform that transformation can be found here. -
This release enables some considerable improvements to speed in the case of aligning poor quality reads. Specifically, this is enabled due to upstream changes in pufferfish implemented by @mohsenzakeri. Now, the aligner can exit early if it becomes clear at any point during alignment that a valid score cannot be obtained. This reduces the computation used to evaluate poor alignments that will not pass subsequent filtering (addresses #527 adn #537).
-
Homopolymer seeds are now skipped during mapping and alignment. In pathological datasets, this could cause unnecessarily slow mapping without any improvements to the actual mapping rate (i.e. it could generate many poor mappings that would fail alignment). This change can speed up mapping in such datasets (addresses #527 adn #537).
-
Three new filtering flags have been added to both improve sensitivity and speed. They determine how mappings are filtered at different stages. The previous behavior (that of salmon v1.0.0 — 1.2.1) can be obtained by setting
--preMergeChainSubThresh 1.0
,--postMergeChainSubThresh x
,--orphanChainSubThresh x
where x is (1.0 ---consensusSlack
) — by default this corresponds to x = 0.65.--perMergeChainSubThresh
: The threshold of sub-optimal chains, compared to the best chain on a given target, that will be retained and passed to the next phase of mapping. Specifically, if the best chain for a read (or read-end in paired-end mode) to target t has score X_t, then all chains for this read with score >= X_t * preMergeChainSubThresh will be retained and passed to subsequent mapping phases. This value must be in the range [0, 1]. It's default value is 0.75 for paired-end data and 1.0 for single-end data.--postMergeChainSubThresh
: The threshold of sub-optimal chain pairs, compared to the best chain pair on a given target, that will be retained and passed to the next phase of mapping. This is different than preMergeChainSubThresh, because this is applied to pairs of chains (from the ends of paired-end reads) after merging (i.e. after checking concordancy constraints etc.). Specifically, if the best chain pair to target t has score X_t, then all chain pairs for this read pair with score >= X_t * postMergeChainSubThresh will be retained and passed to subsequent mapping phases. This value must be in the range [0, 1]. The default value for this parameter is 0.9. Note: This option is only meaningful for paired-end libraries, and is ignored for single-end libraries.--orphanChainSubThresh
: This threshold sets a global sub-optimality threshold for chains corresponding to orphan mappings. That is, if the merging procedure results in no concordant mappings then only orphan mappings with a chain score >= orphanChainSubThresh * bestChainScore will be retained and passed to subsequent mapping phases. This value must be in the range [0, 1]. Unlike the--preMergeChainSubThresh
and--postMergeChainSubThresh
options, this threshold is global with respect to all orphan chains (not simply per-target). From that perspective, you can view it as overriding the value of--consensusSlack
in the case of orphan mappings. Note: This option is only meaningful for paired-end libraries, and is ignored for single-end libraries.
-
The default
--mismatchSeedSkip
was changed from 5 to 3. -
Updated the required LibGFF dependency to v2.0.0. If you already have this installed on your system, you can pass the hint to the location to
cmake
using-DLIB_GFF_PATH
or-DGFF_ROOT
. -
Add the "CellRanger" standard tags,
CB:Z
andUR:Z
tags to the alignment records reported by alevin if the user passes the--writeMappings
flag when running alevin. -
Moved from (deprecated)
tbb::atomic<double>
tostd::atomic<double>
throughout the codebase, including accounting for the lack of acompare_and_swap
method on the latter. -
Changed the default gap-open penalty to 6 (from 4). This makes any gap less preferred compared to a mismatch. Note: How to properly set the default scoring scheme, as well as how to set an ideal alignment quality threshold (i.e. what is the lowest quality alignment one should allow) is not a straightforward question. This change in default accords with our belief that gaps should be penalized more in typical data. However, the ideal settings for such parameters is certainly worthy of more in-depth study, and we are looking into both empirical and theoretical mechanisms for determining how these parameters can be best determined. To obtain the old (pre 1.3.0) scoring scheme, simply pass
--go 4
on the command line. You can also experiment with even more stringent gap penalties by increasing--go
for gap open (current default6
) and--ge
for gap extend (current default2
). -
Changed warning message color from yellow to magenta to make it readable on both light and dark background (address #541).
-
Emojis in release notes 😃.
🐛 Bug fixes
-
Improved selective-alignment speed in pathological case involving isolated homopolymer MEM chains. Thanks to @red-plant for raising the issue (with reproducible data) in 527.
-
Custom barcode lengths for the
--citeseq
mode was disabled. It has been fixed in #531 and--citeseq
single-cell protocol can be used along with--end --barcodeLength --umiLength
triplets. Thanks @rfarouni for reporting this. -
The variance estimates reported by
--numCellBootstraps
command in alevin were not corrected for bias. It has been corrected to reported unbiased estimates by multiplying the variance matrix by(n/n-1)
. -
Fixed linking order issue that could, on rare custom compiles of salmon, cause memory to be allocated by TBB and freed by jemalloc (resulting in a segfault). Thanks to @mathog and davidtgoldblatt for helping to track down and resolve this one!
-
Fixed an error (regression) that could cause an overhanging read in a read pair to be improperly not marked as a dovetail (when it is). This could result in assignment preference for transcripts where the dovetailing read overhangs the transcript start.
-
Fixed a bug that could occur in certain cases of between-mem alignment where too high of an alignment score could be attributed to a mapping. This could occur when there were overlapping MEMs in the chain on the reference (a bit uncommon), and when the size of the overlap was different on the read and reference. This bug has been fixed by properly adjusting the score in all cases.
-
The dynamic and asynchronous update of the fragment length distribution could cause the fluctuations in fragment-level conditional probabilities within the set of alignments for a given fragment. For duplicate transcripts this could lead to an unexpected result where sequence-duplicate transcripts could be inferred to have unequal abundance. The current release addresses this behavior by employing a fragment length distribution cache to ensure there are no fluctuation in conditional fragment length probabilities among the set of alignments for a given fragment. Note: This behavior is expected only to have affected atypical salmon usage, as duplicate transcripts are collapsed / discarded by default during indexing.