-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alevin didn't find barcode in frequency table #253
Comments
Hi @habilzare , If I may ask a quick question, how did we get the file I agree that we can work on putting a more informative error so that user can resolve it by themselves. |
Thinking more about it, we can actually throw away a whitelisted CB with 0 frequency w/ a warning . Thanks again @habilzare for pointing this out . |
Hi @k3yavi , |
Hi @k3yavi ,
The newest version, available at https://github.com/COMBINE-lab/salmon/releases Logs will be written to alevin_output/logs salmon (single-cell-based) v0.10.2[ program ] => salmon[ command ] => alevin[ libType ] => { ISR }[ mates1 ] => { SRR6327122_1.fastq.gz }[ mates2 ] => { SRR6327122_2.fastq.gz }[ chromium ] => { }[ index ] => { index }[ threads ] => { 2 }[ output ] => { alevin_output }[ tgMap ] => { transposon_sequence_set.fa.tsv }[ whitelist ] => { barcode_seq_5K.txt }[ dumpCsvCounts ] => { }[2018-07-19 22:53:27.714] [alevinLog] [info] Processing barcodes files (if Present) processed 87 Million barcodes [2018-07-19 22:55:37.299] [alevinLog] [info] Done barcode density calculation. [2018-07-19 22:55:38.385] [alevinLog] [info] Done with Barcode Processing; Moving to Quantify [2018-07-19 22:55:38.385] [alevinLog] [info] parsing read library format [2018-07-19 23:03:35.740] [jointLog] [info] Computed 150 rich equivalence classes for further processing [2018-07-19 23:03:35.741] [jointLog] [info] Mapping rate = 0.469385% [2018-07-19 23:03:35.741] [jointLog] [info] finished quantifyLibrary() Analyzed 5238 cells (100% of all). |
Hi @habilzare , However, I reckon that alevin hasn't successfully completed. It looks like one of the whitelisted CB ended up having no read at all after deduplicating. We usually assume it should not happen since a whitelisted CB should have at least some count after deduplicating UMIs, which in your case can happen since the mapping rate looks like is surprisingly low
|
Slight Correction on the above statement "It looks like one of the whitelisted CB ended up having no read at all after |
The low rate of mapping is expected because in this study I am interested only in the transposons. |
Yep that should work too, although a couple of minor things, you might wanna use 10 reads since that's the lower bound. Secondly, those reads should be mapped too, I guess copy the 98 length sequence from the transposons's region. Lastly the UMI, if the UMI sequence overlap w/ already present UMI sequence then it can potentially effect the deduplication of cell having more than 10 counts, you might wanna chose a disjoint UMI sequence not in your dataset and reduce the count by 1 in the count matrix since if all newly added UMI are same and get mapped to same gene then it will be deduplicated to 1. |
I wrote the 'make.dummy.fastq() function in R, which addressed the issue as planned in above. For some of the samples, I got the "umi indexing of jellyfish failing" error, which I realized from this report must be related to shorter than 26 bps reads. I used trimommatic to exclude those short reads, and finally Alevin produced the count matrix in CSV format for me. Thank you Salmon team! Minor: The output CSV file has an extra comma at the end of each line, which I ignored in my post-processing step. |
@habilzare Glad to hear that it worked out and thanks for fast and useful moderation from default alevin pipeline. We will work on correcting the suggestions made in the next release of Salmon. |
Hi, I have the exact same issue, in that it is quite likely that some barcodes will be empty and I don't obtain any results now. |
Hi @patrickvdb , Thanks for pointing this out.
The solution of the first problem is pretty straight forward and since it was the subject of the issue I thought closing this one and opening a new w/ the second problem would be better thing to do. |
Just a heads up, issue #266 has been added and the solution is currently available in the source build from the develop branch. We will include this to master with the next planned release of Salmon v0.11.3. Thanks again for the useful feedbacks and comments. |
Is the bug primarily related to salmon (bulk mode) or alevin (single-cell mode)?
alevin
Describe the bug
I am trying to use Alevin to quantify a single cell RNA-Seq 10x Genomics CHROMIUM dataset. I am using Salmon 0.10.2, and it does not produce the count matrix. The output folder contains nothing but the log files. Specifically, I get the following error:
[2018-07-19 16:27:46.916] [alevinLog] [error] Barcode not found in frequency table
The full messages are bellow.
To Reproduce
Steps and data to reproduce the behavior:
I explain how to download the data and reproduce the issue in the following.
Specifically, please provide at least the following information:
Which version of salmon was used?
0.10.2
How was salmon installed (compiled, downloaded executable, through bioconda)?
conda config --append channels conda-forge
conda config --append channels bioconda
conda install salmon=0.10.2
Which reference (e.g. transcriptome) was used?
I am interested only in the transposons. Therefore, I am using the "canonical DNA sequences of the transposable elements from species in the genus Drosophila", which are available from the Bergman Lab. Specifically, I use this fasta file. The first 3 lines are:
Which read files were used?
The SRR6327122 run. The fastq files can be downloaded in a couple of hours using the SRA Toolkit, e.g.,
fastq-dump --split-files --gzip --outdir ./ SRR6327122
The first few lines of the fastq files are as follows:
$ zcat SRR6327122_1.fastq.gz |head -n 4
@SRR6327122.1 1 length=36
CTGATCCTCGAGAGCACTCACAGAGTTTTTTGTTTT
+SRR6327122.1 1 length=36
AAFFFJJJJJJJJJJJJJJJJJJJJJ7-----A<-<
And
$ zcat SRR6327122_2.fastq.gz |head -n 4
@SRR6327122.1 1 length=88
CGGCCACAAGATCGCCTTTTTATCCCTCGCCCAGAGCACCCCCCGACCCCACATCCCCTGCTTCACGGCCCCCCTCGCGGCCTACCCG
+SRR6327122.1 1 length=88
--7-<7----7---77----77A-7--7-A7-7---7-A-A7<F-777-77-A---A<A----77--77------------7------
Which which program options were used?
One can download the data and results tarball
bcNotFound-2018-07-19.tar.gz.
First, I indexed the reference using:
salmon index -k 31 -t transposon_sequence_set.fa -i index
Then, I ran Alevin using the following command:
salmon alevin -l ISR -1 SRR6327122_1.fastq.gz -2 SRR6327122_2.fastq.gz --chromium -i index -p 2 -o alevin_output --tgMap transposon_sequence_set.fa.tsv --whitelist cell_barcode_seq.txt --dumpCsvCounts
Expected behavior
A clear and concise description of what you expected to happen.
The count matrix be saved when Alevin is done.
Screenshots
The full output messages are:
Version Info: ### A newer version of Salmon is available. ####
The newest version, available at https://github.com/COMBINE-lab/salmon/releases
contains new features, improvements, and bug fixes; please upgrade at your
earliest convenience.
Logs will be written to alevin_output/logs
salmon (single-cell-based) v0.10.2
[ program ] => salmon
[ command ] => alevin
[ libType ] => { ISR }
[ mates1 ] => { SRR6327122_1.fastq.gz }
[ mates2 ] => { SRR6327122_2.fastq.gz }
[ chromium ] => { }
[ index ] => { index }
[ threads ] => { 2 }
[ output ] => { alevin_output }
[ tgMap ] => { transposon_sequence_set.fa.tsv }
[ whitelist ] => { cell_barcode_seq.txt }
[ dumpCsvCounts ] => { }
[2018-07-19 18:24:03.053] [jointLog] [info] Fragment incompatibility prior below threshold. Incompatible fragments will be ignored.
[2018-07-19 18:24:03.059] [alevinLog] [info] Processing barcodes files (if Present)
processed 87 Million barcodes
[2018-07-19 18:26:13.307] [alevinLog] [info] Done barcode density calculation.
[2018-07-19 18:26:13.307] [alevinLog] [info] # Barcodes Used: 86885223 / 87959276.
[2018-07-19 18:26:13.334] [alevinLog] [info] Done importing white-list Barcodes
[2018-07-19 18:26:13.334] [alevinLog] [info] Total 54879 white-listed Barcodes
[2018-07-19 18:26:18.285] [alevinLog] [info] Done populating Z matrix
[2018-07-19 18:26:18.300] [alevinLog] [info] Done indexing Barcodes
[2018-07-19 18:26:18.301] [alevinLog] [info] Total Unique barcodes found: 978816
[2018-07-19 18:26:18.301] [alevinLog] [info] Used Barcodes except Whitelist: 26208
[2018-07-19 18:26:18.504] [alevinLog] [info] Done with Barcode Processing; Moving to Quantify
[2018-07-19 18:26:18.505] [alevinLog] [info] parsing read library format
[2018-07-19 18:26:18.632] [stderrLog] [info] Loading Suffix Array
[2018-07-19 18:26:18.641] [stderrLog] [info] Loading Transcript Info
[2018-07-19 18:26:18.647] [stderrLog] [info] Loading Rank-Select Bit Array
[2018-07-19 18:26:18.648] [stderrLog] [info] There were 179 set bits in the bit array
[2018-07-19 18:26:18.648] [stderrLog] [info] Computing transcript lengths
[2018-07-19 18:26:18.648] [stderrLog] [info] Waiting to finish loading hash
[2018-07-19 18:26:18.720] [stderrLog] [info] Done loading index
[2018-07-19 18:26:18.506] [jointLog] [info] There is 1 library.
[2018-07-19 18:26:18.629] [jointLog] [info] Loading Quasi index
[2018-07-19 18:26:18.631] [jointLog] [info] Loading 32-bit quasi index
[2018-07-19 18:26:18.720] [jointLog] [info] done
[2018-07-19 18:26:18.720] [jointLog] [info] Index contained 179 targets
[2018-07-19 18:26:18.728] [alevinLog] [error] Barcode not found in frequency table
Desktop (please complete the following information):
$ uname -a Linux login1 3.0.101-0.47.86.1.11753.0.PTF-default #1 SMP Wed Oct 19 14:11:00 UTC 2016 (56c73f1) x86_64 x86_64 x86_64 GNU/Linux
$ lsb_release -a LSB Version: core-2.0-noarch:core-3.2-noarch:core-4.0-noarch:core-2.0-x86_64:core-3.2-x86_64:core-4.0-x86_64:desktop-4.0-amd64:desktop-4.0-noarch:graphics-2.0-amd64:graphics-2.0-noarch:graphics-3.2-amd64:graphics-3.2-noarch:graphics-4.0-amd64:graphics-4.0-noarch Distributor ID: SUSE LINUX Description: SUSE Linux Enterprise Server 11 (x86_64) Release: 11 Codename: n/a
Additional context
I included a 10K subset of reads in the tarball, which leads to the same behavior by Alevin.
The text was updated successfully, but these errors were encountered: