salmon v0.14.0
Salmon 0.14.0 release notes
In addition to the changes and enhancements listed below, this release of salmon implements the decoy-aware selective-alignment strategy described in the manuscript Alignment and mapping methodology influence transcript abundance estimation. For reasons explored in depth in the manuscript, we recommend making use of this decoy-aware selective alignment strategy when not providing pre-aligned reads to salmon. Because of the changes required to implement this indexing strategy, salmon v0.14.0 is not compatible with the indices of previous versions, and so you must re-build the index for this version of salmon (which must be done anyway, if one is adding decoy sequence).
Adding decoy sequence to the salmon index.
Adding decoy sequence to the salmon index is simple, but salmon is specific about the manner in which the sequence is added. To ease this process, we have created a script that allows the automated creation of a decoy-enhanced transcriptome from a genome FASTA, transcriptome FASTA, and annotation GTF file. The script, as well as detailed instructions on how to run it an use its output, is provided in the SalmonTools repository.
Note: Because making effective use of the decoy sequence requires having accurate mapping scores, the decoys are only used when salmon is run with selective alignment (i.e. with the flags --validateMappings
, --mimicBT2
or --mimicStrictBT2
).
Detailed description of decoy requirements
It is not necessary to use the script we provide to extract decoy sequences, and if you'd like to add your own decoys to the file you wish to index, the process is fairly straightforward. All records for decoy sequence must come at the end of the FASTA file being indexed, and you must provide a file with all of the names (one name per line) of the records that should be treated as decoys (they need not be in the same order as in the FASTA file). Consider that you have the files txome.fa
and decoys.fa
, where decoys.fa
are the decoy sequences you want to add to your index. Also, assume that decoys.txt
is the file containing the names of the decoy records. You can create a valid input files as:
$ grep "^>" decoys.fa | cut -d ">" -f2 > decoys.txt
$ cat txome.fa decoys.fa > txome_combined.fa
Now, you can build the decoy-aware salmon index using the command:
$ salmon index -t txome_combined.fa -d decoys.txt -i combined_index
Changes to default behavior and new behavior
-
Dovetailing mappings and alignments are considered discordant and discarded by default --- this is the same behavior that is adopted by default in Bowtie2. This is a change from the older behavior of salmon where dovetailing mappings were considered concordant and counted by default. If you wish to consider dovetailing mappings as concordant (the previous behavior), you can do so by passing the
--allowDovetail
flag tosalmon quant
. Exotic library types (e.g. MU, MSF, MSR) are no longer supported. If you need support for such a library type, please submit a feature request describing the use-case. -
The version check information is now written to stderr rather than stdout. This enables directly redirecting the SAM output, when using the
-z
/--writeMappings
flag with the implicit argument that writes that output to stdout. NOTE: If you are having difficulty using the-z
/--writeMappings
flag to write output to a file (e.g using-z <file.sam>
or--writeMappings <file.sam>
), try using-z=<file.sam>
or--writeMappings=<file.sam>
instead --- this appears to be an issue with Boost's argument parsing library for flags that have implicit as well as default values. -
Salmon now automatically detects, during indexing, if it believes that the transcriptome being indexed is in GENCODE format and the
--gencode
flag has not been passed. In this case, it issues a warning, since we generally recommend to use this flag when indexing GENCODE transcriptomes (to avoid the very long transcript names in the output). This implements feature request 366; thanks @alexvpickering. -
The default setting for
--numPreAuxModelSamples
has been lowered from 1,000,000 to 5,000. This simply means that the basic models (and cruically the read alignment error model) will start being applied much earlier on in the online algorithm. This has very little effect on samples with a decent number of fragments, but can considerably improve estimates (especially in alignment-based mode) for samples with only a small number of fragments. -
The definition of
--consensusSlack
has changed. Instead of being an absolute number, it is now a fractional value (between 0 and 1) the describes the number of "hits" (i.e. suffix array intervals) that a mapping may miss and still be consdered valid for chaining.
Improvements and new flags for bulk mode
When writing out mappings in conjunction with
The flags below are either new, or only present since v0.13.0 and are therefore highlighted again below for completeness:
-
--mimicBT2
: This flag is a "meta-flag" that sets the parameters related to mapping and selective alignment to mimic alignment using Bowtie2 (with the flags--no-discordant
and--no-mixed
), but using the default scoring scheme and allowing both mismatches and indels in alignments. -
--mimicStrictBT2
: This flag is a "meta-flag" that sets the parameters related to mapping and selective alignment to mimic alignment using Bowtie2 (with the flags suggested by RSEM), but using the default scoring scheme and allowing both mismatches and indels in alignments. These setting essentially disallow indels in the resulting alignments.
In addition to these "meta-flags", a few other flags have been introduced that can alter the behavior of mapping:
-
--recoverOrphans
: This flag (which should only be used in conjunction with selective alignment), performs orphan "rescue" for reads. That is, if mappings are discovered for only one end of a fragment, or if the mappings for the ends of the fragment don't fall on the same transcript, then this flag will cause salmon to look upstream or downstream of the discovered mapping (anchor) for a match for the opposite end of the given fragment. This is done by performing "infix" alignment within the maximum fragment length upstream of downstream of the anchor mapping using edlib. -
--hardFilter
: This flag (which should only be used with selective alignment) turns off soft filtering and range-factorized equivalence classes, and removes all but the equally highest scoring mappings from the equivalence class label for each fragment. While we recommend using soft filtering (the default) for quantification, this flag can produce easier-to-understand equivalence classes if that is the primary object of study. -
--skipQuant
: Related to the above, this flag will stop execution before the actual quantification algorithm is run. -
--bandwidth
: This flag (which is only meaningful in conjunction with selective alignment), sets the bandwidth parameter of the relevant calls toksw2
's alignment function. This determines how wide an area around the diagonal in the DP matrix should be calculated. -
--maxMMPExtension
: This flag (which should only be used with selective alignment) limits the length that a mappable prefix of a fragment may be extended before another search along the fragment is started. Smaller values for this flag can improve the sensitivity of mapping, but could increase run time.
Through broad benchmarking across many samples, we have worked to considerably improve the selective-alignment algorithm and its sensitivity. We note that it is likely selective alignment will turned on by default in future releases, and we strongly encourage all users to make use of this feature and report their experiences with it.
Along with the default selective alignment (enabled via --validateMappings
), there are two "meta" flags that enable selective alignment parameters meant to mimic configurations in which users might be interested.
New information available in meta_info.json
- The following fields have been added to
meta_info.json
:num_valid_targets
: The number of non-decoy targets in the index used for mapping.num_decoy_targets
: The number of decoy targets in the index used for mapping (only meaningful in mapping-based mode).num_decoy_fragments
: The number of fragments that were discarded from quantification because they best-aligned to a decoy target rather than a valid transcript.num_dovetail_fragments
: which denotes the number of fragments that have only dovetailing mappings. If the--allowDovetail
flag was passed, these are counted toward quantification, otherwise they are discarded (but this number is still reported). This field only has a meaningful value in quasi-mapping mode (with or without selective alignment).num_fragments_filtered_vm
: which denotes the number of fragments that had a mapping to the transcriptome, but which were discarded because none of the mappings for the fragments exceeded the minimum selective alignment score. This field only has a meaningful value in conjunction with selective alignment (otherwise it is 0).num_alignments_below_threshold_for_mapped_fragments_vm
: which denotes the number of mappings discarded because they failed to reach the minimum selective alignment score, but for which the corresponding fragment had at least a single valid mapping. This field only has a meaningful value in conjunction with selective alignment (otherwise it is 0).
Improvements in single cell mode
-
Alevin supports decoy genomic alignments. NOTE: If you have a previous version of salmon index, with the release of 0.14, you will have to update to the latest salmon index.
-
The data of the file
filtered_cb_frequency.txt
, along with other features, will be dumped in the filefeatureDump.txt
by default i.e. you don't need--dumpFeature
flag to get CB level features except theraw_cb_frequency.txt
. The list of the features in the features file is as follows:- Cellular Barcode (CB) Sequence
- Number of sequence corrected reads assigned to the CB
- Number of mapped reads assigned to the CB
- Number of deduplicated reads assigned to the CB
- Mapping rate i.e. #mapped reads / #sequence corrected reads
- Deduplication rate i.e. 1 - (#deduplicated reads / #mapped reads)
- Mean / Max of the expressed gene quantification estimates.
- Number of expressed genes.
- Number of genes with the count estimates more than the mean.
- Average Number of Reads deduplicated in each Arboresnce
-
Command line flag
--dumpUmiGraph
, along with the per cell level UMI graphs also dumps the frequency of the number of reads used for deduplicating an arborescence. It is added to the last column of thefeatureDump.txt
as#Reads:#arborescence
pairs separated by tab. -
The binary output format of alevin,
quants_mat.gz
, has been changed into a sparse single precision format. In pratice we saw the file size reduced to as big as half the size of the original file. -
New command line flag
--dumpMtx
is added to dump the quants in matrix-market-exchange(mtx) sparse format. -
In case of encountered errors in different stages of the alevin pipeline, instead of default error-code of 1, following four categories of error-codes will be reported by alevin for automated debugging:
- 1: Error while mapping reads and/or generic errors.
- 64: Error in knee estimation / Cellular Barcode sequence correction.
- 74: Error while deduplicating UMI and/or EM optimization.
- 84: Error while intelligent whitelisting.
Bug fixes, deprecations and removals
-
A bug in the
quantmerge
command (issue 356) that could cause the output ofquantmerge
to be truncated was fixed (the bug was first introduced in v0.13.0). -
Added missing explicit initialization for variable that could affect the initialization condition of the optimization; thank @come-raczy.
-
Following developer(hidden) flags have been deprecated:
- --dumpUmitoolsMap (permanently disabled)
- --noSoftMap (Always assumed True)
- --dumpBarcodeMap (permanently disabled)
- --noBarcode (Always assumed False)
-
Following user flags have been deprecated:
- --debug (Always assumed True)
- --useCorrelation (permanently disabled)
- --dumpCsvCounts (swapped in favor of mtx with the flag --dumpMtx)