Alevin: incorrect indices in MTX output #431

pinin4fjords · 2019-10-01T14:15:01Z

Is the bug primarily related to salmon (bulk mode) or alevin (single-cell mode)?

Alevin

Describe the bug

I have an Alevin run which indicates 17136 genes in the MTX output:

%%MatrixMarket	matrix	coordinate	real	general
667389	17136	13000897
1	4	0.000000
1	5	0.000000
1	9	0.000000
1	11	0.000000
1	17	0.000000
...

quants_mat_cols.txt also has 17136 rows.

However the maximum value in the second column of the MTX is out range at 17144, so the matrix is invalid. There are thousands of matrix cells like this- not just one dodgy one.

To Reproduce
Steps and data to reproduce the behavior:

Version 0.14.1
Bioconda
Gallus Gallus cDNA from Ensembl, file Gallus_gallus.GRCg6a.cdna.all.98.fa.gz
These files from ENA, analysed together:

SAMN11526602	SRR8985046	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR898/006/SRR8985046/SRR8985046_1.fastq.gz	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR898/006/SRR8985046/SRR8985046_2.fastq.gz
SAMN11526602	SRR8985045	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR898/005/SRR8985045/SRR8985045_1.fastq.gz	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR898/005/SRR8985045/SRR8985045_2.fastq.gz
SAMN11526602	SRR8985044	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR898/004/SRR8985044/SRR8985044_1.fastq.gz	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR898/004/SRR8985044/SRR8985044_2.fastq.gz
SAMN11526602	SRR8985043	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR898/003/SRR8985043/SRR8985043_1.fastq.gz	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR898/003/SRR8985043/SRR8985043_2.fastq.gz
SAMN11526602	SRR8985042	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR898/002/SRR8985042/SRR8985042_1.fastq.gz	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR898/002/SRR8985042/SRR8985042_2.fastq.gz
SAMN11526602	SRR8985041	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR898/001/SRR8985041/SRR8985041_1.fastq.gz	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR898/001/SRR8985041/SRR8985041_2.fastq.gz
SAMN11526602	SRR8985040	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR898/000/SRR8985040/SRR8985040_1.fastq.gz	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR898/000/SRR8985040/SRR8985040_2.fastq.gz
SAMN11526602	SRR8985039	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR898/009/SRR8985039/SRR8985039_1.fastq.gz	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR898/009/SRR8985039/SRR8985039_2.fastq.gz

(renamed to cdna*.fastq.gz and barcodes.fastq.gz internal to workflow)

Which which program options were used?

# Do a pre-run to derive a starting whitelist, see https://github.com/COMBINE-lab/salmon/issues/362

salmon alevin -l ISR -1 $(ls barcodes*.fastq.gz | tr '\n' ' ') -2 $(ls cdna*.fastq.gz | tr '\n' ' ')         --chromium -i salmon_index -p 12 -o SAMN11526602_pre --tgMap transcript_to_gene.txt --dumpFeatures --noQuant

# Derive a relaxed whitelist, removing only the most obvious junk 

if [ $? -eq 0 ]; then 
    awk '{ if ($2 > 10) { print $1} }' SAMN11526602_pre/alevin/raw_cb_frequency.txt > pre_whitelist.txt
fi

# Supply the whitelist to the main Alevin run

salmon alevin -l ISR -1 $(ls barcodes*.fastq.gz | tr '\n' ' ') -2 $(ls cdna*.fastq.gz | tr '\n' ' ')         --chromium -i salmon_index -p 12 -o SAMN11526602_tmp --tgMap transcript_to_gene.txt --whitelist pre_whitelist.txt         --forceCells $(cat pre_whitelist.txt | wc -l | tr -d '\n') --dumpMtx

Expected behavior
MTX files without references to out-of-range columns.

Screenshots
If applicable, add screenshots or terminal output to help explain your problem.

Desktop (please complete the following information):

OS: [e.g. Ubuntu Linux, OSX]
Version [ If you are on OSX, the output of sw_vers. If you are on linux the output of uname -a and lsb_release -a]

Linux hx-noah-63-15.ebi.ac.uk 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 13 10:46:25 EDT 2017 x86_64 GNU/Linux

LSB Version:	:core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
Distributor ID:	RedHatEnterpriseServer
Description:	Red Hat Enterprise Linux Server release 7.4 (Maipo)
Release:	7.4
Codename:	Maipo

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

pinin4fjords · 2019-10-01T14:31:36Z

Apologies- not quite true that I use the above reference. I actually process the above cDNAs and remove any (if any) that don't have mappings in the associated GTF (from which I derived the transcript-to-gene mappings). Here's the R code:

#!/usr/bin/env Rscript

suppressPackageStartupMessages(library(Biostrings))
suppressPackageStartupMessages(library(rtracklayer))

cl <- commandArgs(trailingOnly = TRUE)

cdna_file <- cl[1]
gtf_file <- cl[2]
transcript_to_gene_file <- cl[3]
filtered_cdna_file <- cl[4]

# Derive a transcript-to-gene mapping from the GTF

annotation <- elementMetadata(import(gtf_file))
annotation <- subset(annotation, ! (is.na(transcript_id) | is.na(gene_id)))

if ('transcript_version' %in% colnames(annotation)){
  mapping <- cbind(paste(annotation$transcript_id, annotation$transcript_version, sep='.'), annotation$gene_id)
}else{
  mapping <- annotation[,c('transcript_id', 'gene_id')]
}

# Read the cDNA

cdna <- readDNAStringSet(cdna_file)
cdna_transcript_names <- unlist(lapply(names(cdna), function(x) unlist(strsplit(x, ' '))[1]  ))

# Filter out cDNAs without matching transcript entries in the GTF

cdna <- cdna[which(cdna_transcript_names %in% mapping[,1])]

# Write transcript-to-gene mapping and filtered cDNA file

write.table(unique(mapping), file= transcript_to_gene_file, row.names=FALSE, col.names=FALSE, quote=FALSE, sep="\t")
writeXStringSet(x = cdna, filepath = filtered_cdna_file, compress = 'gzip')

pinin4fjords · 2019-10-01T14:40:39Z

Already fixed @k3yavi ? Is there any chance of a 0.14.2 release containing the fix?

k3yavi · 2019-10-01T14:43:41Z

HI @pinin4fjords ,

Thanks for reporting the this. There was bug associated with binary format to mtx format conversion we fixed in the upcoming release. The problem was associated with the last index of the matrix which can be off by max 8 indices because we were using uint_8 for storing the bit vectors.
My apologies for the trouble, there are two ways to solve this issue:
1.) Rerun the pipeline with the latest beta release of 0.99, unfortunately this is not in conda yet, but we plan to release it soon with awesome new updates like consuming both genome and transcriptome for read mapping.
2.) If you wan't to avoid rerunning the full pipeline, try this. Try cloning this repo and do a cargo install --release for the code in src-rs folder. Note: You might have to install Rust for this, it's just one liner install. Once compiled the EDS code, you can just do the following to generate the correct mtx file.

./target/release/eds convert -i <Path to output/alevin/quants_mat.gz> --mtx -c <num_cells> -f <num_genes>

Please let me know if it works out for you.

pinin4fjords · 2019-10-01T14:47:14Z

Thanks @k3yavi - do I need to assume that all of our existing matrices could be corrupted?

We've built Alevin into our production processes, so I'm loath to hack my way to a solution. When is the 0.99 release due?

pinin4fjords · 2019-10-01T14:57:21Z

A back port of this fix to a released 0.14.2 would be the most awesome solution ;-)

k3yavi · 2019-10-01T14:59:36Z

Hi @pinin4fjords ,
The error is very subtle, It happens when the total number of genes are exactly a multiple of 8, which in your case it is. It'd be great that before moving forward, if you can quickly verify if 0.99 actually fixes the issue or is it something else. Unfortunately, there is a chance that the mtx format output is corrupted for the last 8 genes (in the order they present in the quants_mat_cols.txt), however, the baseline output of alevin quants_mat.gz would still be correct and already be in the output folder of your pipeline. The EDS repo I forwarded you just convert the binary output into mtx format correctly.

Due to these subtle issues making into production environment, we changed the release cycle to be like a beta to stable type. Currently 0.99 is a beta release, which has been out for some time. If I have to guess, we plan to release 1.0 (the stable release) in approximately 2 weeks.

k3yavi · 2019-10-01T15:03:12Z

A back port of this fix to a released 0.14.2 would be the most awesome solution ;-)

Let me take this to the salmon team and get back to you. My apologies again for the stupid bug.

pinin4fjords · 2019-10-01T15:05:15Z

Don't apologise, it happens. Just looking for a fix I can apply neatly without a major software change.

pinin4fjords · 2019-10-01T15:08:27Z

... but will try to test the 0.99 too

k3yavi · 2019-10-01T15:11:36Z

Sounds good, let me know if 0.99 fixes it for you. If it does, I'll check if we can do something about 14.2.

pinin4fjords · 2019-10-03T07:47:29Z

Hi @k3yavi - confirmed, fixed in 0.99!

pinin4fjords · 2019-10-04T10:37:19Z

Could you let me know if a minor release is on the cards (or not) @k3yavi ? Would be nice if at least one Bioconda-available release had the fix.

k3yavi · 2019-10-04T11:33:01Z

Glad to know, that 0.99 fixes it.
I have a meeting today with the salmon team, I'd definitely let you know before eod.

k3yavi · 2019-10-04T18:19:20Z

Hi @pinin4fjords ,

I am gonna make the hot-fix release for the 0.14.2 now.
We plan to release v1.0.0 mid October, so with the v0.14.2 I can make just the bioconda and github release from the master branch to fix the mtx issue. Just wanted to give you heads up, I am making 14.2 from master, so the only difference from 14.1 would be the hotfix i.e. there would be no Genome+txome capability as that's planned with 1.0.

I'll keep you updated once it's available through bioconda.

pinin4fjords · 2019-10-04T21:24:53Z

Thanks very much @k3yavi that's great

k3yavi · 2019-10-04T22:34:59Z

Ok, pushed to bioconda, should be available in a couple of hours. Once it's available I'll make the official release too on the github. It'd be great if you can quickly test the new release for the bug once it's available. Thanks again !

k3yavi · 2019-10-04T22:59:11Z

All right, v0.14.2 is online now.

pinin4fjords · 2019-10-08T07:59:09Z

Thanks very much @k3yavi ! 0.14.2 works in my example too

k3yavi self-assigned this Oct 1, 2019

k3yavi added alevin issue is primarily related to alevin bug labels Oct 1, 2019

k3yavi added the fixed in develop this bug has been fixed in develop and the issue will be closed when merged into master label Oct 1, 2019

pinin4fjords mentioned this issue Oct 29, 2019

Update Alevin to 0.14.2 ebi-gene-expression-group/scxa-droplet-quantification-workflow#4

Merged

k3yavi closed this as completed Oct 31, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alevin: incorrect indices in MTX output #431

Alevin: incorrect indices in MTX output #431

pinin4fjords commented Oct 1, 2019 •

edited

Loading

pinin4fjords commented Oct 1, 2019

pinin4fjords commented Oct 1, 2019

k3yavi commented Oct 1, 2019 •

edited

Loading

pinin4fjords commented Oct 1, 2019

pinin4fjords commented Oct 1, 2019

k3yavi commented Oct 1, 2019

k3yavi commented Oct 1, 2019

pinin4fjords commented Oct 1, 2019

pinin4fjords commented Oct 1, 2019

k3yavi commented Oct 1, 2019

pinin4fjords commented Oct 3, 2019

pinin4fjords commented Oct 4, 2019

k3yavi commented Oct 4, 2019

k3yavi commented Oct 4, 2019

pinin4fjords commented Oct 4, 2019

k3yavi commented Oct 4, 2019

k3yavi commented Oct 4, 2019

pinin4fjords commented Oct 8, 2019

Alevin: incorrect indices in MTX output #431

Alevin: incorrect indices in MTX output #431

Comments

pinin4fjords commented Oct 1, 2019 • edited Loading

pinin4fjords commented Oct 1, 2019

pinin4fjords commented Oct 1, 2019

k3yavi commented Oct 1, 2019 • edited Loading

pinin4fjords commented Oct 1, 2019

pinin4fjords commented Oct 1, 2019

k3yavi commented Oct 1, 2019

k3yavi commented Oct 1, 2019

pinin4fjords commented Oct 1, 2019

pinin4fjords commented Oct 1, 2019

k3yavi commented Oct 1, 2019

pinin4fjords commented Oct 3, 2019

pinin4fjords commented Oct 4, 2019

k3yavi commented Oct 4, 2019

k3yavi commented Oct 4, 2019

pinin4fjords commented Oct 4, 2019

k3yavi commented Oct 4, 2019

k3yavi commented Oct 4, 2019

pinin4fjords commented Oct 8, 2019

pinin4fjords commented Oct 1, 2019 •

edited

Loading

k3yavi commented Oct 1, 2019 •

edited

Loading