Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alevin didn't find barcode in frequency table #253

Closed
habilzare opened this issue Jul 19, 2018 · 13 comments
Closed

Alevin didn't find barcode in frequency table #253

habilzare opened this issue Jul 19, 2018 · 13 comments
Labels
alevin issue is primarily related to alevin feature request

Comments

@habilzare
Copy link

Is the bug primarily related to salmon (bulk mode) or alevin (single-cell mode)?
alevin

Describe the bug
I am trying to use Alevin to quantify a single cell RNA-Seq 10x Genomics CHROMIUM dataset. I am using Salmon 0.10.2, and it does not produce the count matrix. The output folder contains nothing but the log files. Specifically, I get the following error:
[2018-07-19 16:27:46.916] [alevinLog] [error] Barcode not found in frequency table
The full messages are bellow.

To Reproduce
Steps and data to reproduce the behavior:
I explain how to download the data and reproduce the issue in the following.

Specifically, please provide at least the following information:

  • Which version of salmon was used?
    0.10.2

  • How was salmon installed (compiled, downloaded executable, through bioconda)?
    conda config --append channels conda-forge
    conda config --append channels bioconda
    conda install salmon=0.10.2

  • Which reference (e.g. transcriptome) was used?
    I am interested only in the transposons. Therefore, I am using the "canonical DNA sequences of the transposable elements from species in the genus Drosophila", which are available from the Bergman Lab. Specifically, I use this fasta file. The first 3 lines are:

FBte0000104
GTGACATATCCATAAGTCCCTAAGACTTAAGCATATGCCTACATACTAATACACTTACAA
CACATACACCCCAATACAACATACACTACTCCGGATGTACCCAACAGATACCAGATAAGA
In another study, I have successfully used Salmon to quantify transposon expression from bulk RNA-Seq data with a mapping rate of 2%, which is enough for the kind of analysis that I am interested in.

  • Which read files were used?
    The SRR6327122 run. The fastq files can be downloaded in a couple of hours using the SRA Toolkit, e.g.,
    fastq-dump --split-files --gzip --outdir ./ SRR6327122
    The first few lines of the fastq files are as follows:
    $ zcat SRR6327122_1.fastq.gz |head -n 4
    @SRR6327122.1 1 length=36
    CTGATCCTCGAGAGCACTCACAGAGTTTTTTGTTTT
    +SRR6327122.1 1 length=36
    AAFFFJJJJJJJJJJJJJJJJJJJJJ7-----A<-<
    And
    $ zcat SRR6327122_2.fastq.gz |head -n 4
    @SRR6327122.1 1 length=88
    CGGCCACAAGATCGCCTTTTTATCCCTCGCCCAGAGCACCCCCCGACCCCACATCCCCTGCTTCACGGCCCCCCTCGCGGCCTACCCG
    +SRR6327122.1 1 length=88
    --7-<7----7---77----77A-7--7-A7-7---7-A-A7<F-777-77-A---A<A----77--77------------7------

  • Which which program options were used?
    One can download the data and results tarball
    bcNotFound-2018-07-19.tar.gz.
    First, I indexed the reference using:
    salmon index -k 31 -t transposon_sequence_set.fa -i index
    Then, I ran Alevin using the following command:
    salmon alevin -l ISR -1 SRR6327122_1.fastq.gz -2 SRR6327122_2.fastq.gz --chromium -i index -p 2 -o alevin_output --tgMap transposon_sequence_set.fa.tsv --whitelist cell_barcode_seq.txt --dumpCsvCounts

Expected behavior
A clear and concise description of what you expected to happen.
The count matrix be saved when Alevin is done.

Screenshots
The full output messages are:
Version Info: ### A newer version of Salmon is available. ####

The newest version, available at https://github.com/COMBINE-lab/salmon/releases
contains new features, improvements, and bug fixes; please upgrade at your
earliest convenience.

Logs will be written to alevin_output/logs

salmon (single-cell-based) v0.10.2

[ program ] => salmon

[ command ] => alevin

[ libType ] => { ISR }

[ mates1 ] => { SRR6327122_1.fastq.gz }

[ mates2 ] => { SRR6327122_2.fastq.gz }

[ chromium ] => { }

[ index ] => { index }

[ threads ] => { 2 }

[ output ] => { alevin_output }

[ tgMap ] => { transposon_sequence_set.fa.tsv }

[ whitelist ] => { cell_barcode_seq.txt }

[ dumpCsvCounts ] => { }

[2018-07-19 18:24:03.053] [jointLog] [info] Fragment incompatibility prior below threshold. Incompatible fragments will be ignored.
[2018-07-19 18:24:03.059] [alevinLog] [info] Processing barcodes files (if Present)

processed 87 Million barcodes

[2018-07-19 18:26:13.307] [alevinLog] [info] Done barcode density calculation.
[2018-07-19 18:26:13.307] [alevinLog] [info] # Barcodes Used: 86885223 / 87959276.
[2018-07-19 18:26:13.334] [alevinLog] [info] Done importing white-list Barcodes
[2018-07-19 18:26:13.334] [alevinLog] [info] Total 54879 white-listed Barcodes
[2018-07-19 18:26:18.285] [alevinLog] [info] Done populating Z matrix
[2018-07-19 18:26:18.300] [alevinLog] [info] Done indexing Barcodes
[2018-07-19 18:26:18.301] [alevinLog] [info] Total Unique barcodes found: 978816
[2018-07-19 18:26:18.301] [alevinLog] [info] Used Barcodes except Whitelist: 26208
[2018-07-19 18:26:18.504] [alevinLog] [info] Done with Barcode Processing; Moving to Quantify

[2018-07-19 18:26:18.505] [alevinLog] [info] parsing read library format
[2018-07-19 18:26:18.632] [stderrLog] [info] Loading Suffix Array
[2018-07-19 18:26:18.641] [stderrLog] [info] Loading Transcript Info
[2018-07-19 18:26:18.647] [stderrLog] [info] Loading Rank-Select Bit Array
[2018-07-19 18:26:18.648] [stderrLog] [info] There were 179 set bits in the bit array
[2018-07-19 18:26:18.648] [stderrLog] [info] Computing transcript lengths
[2018-07-19 18:26:18.648] [stderrLog] [info] Waiting to finish loading hash
[2018-07-19 18:26:18.720] [stderrLog] [info] Done loading index
[2018-07-19 18:26:18.506] [jointLog] [info] There is 1 library.
[2018-07-19 18:26:18.629] [jointLog] [info] Loading Quasi index
[2018-07-19 18:26:18.631] [jointLog] [info] Loading 32-bit quasi index
[2018-07-19 18:26:18.720] [jointLog] [info] done
[2018-07-19 18:26:18.720] [jointLog] [info] Index contained 179 targets
[2018-07-19 18:26:18.728] [alevinLog] [error] Barcode not found in frequency table

Desktop (please complete the following information):

  • OS: Linux
  • Version:
    $ uname -a Linux login1 3.0.101-0.47.86.1.11753.0.PTF-default #1 SMP Wed Oct 19 14:11:00 UTC 2016 (56c73f1) x86_64 x86_64 x86_64 GNU/Linux
    $ lsb_release -a LSB Version: core-2.0-noarch:core-3.2-noarch:core-4.0-noarch:core-2.0-x86_64:core-3.2-x86_64:core-4.0-x86_64:desktop-4.0-amd64:desktop-4.0-noarch:graphics-2.0-amd64:graphics-2.0-noarch:graphics-3.2-amd64:graphics-3.2-noarch:graphics-4.0-amd64:graphics-4.0-noarch Distributor ID: SUSE LINUX Description: SUSE Linux Enterprise Server 11 (x86_64) Release: 11 Codename: n/a

Additional context
I included a 10K subset of reads in the tarball, which leads to the same behavior by Alevin.

@k3yavi k3yavi added the alevin issue is primarily related to alevin label Jul 20, 2018
@k3yavi
Copy link
Member

k3yavi commented Jul 20, 2018

Hi @habilzare ,
Thanks a lot for using alevin and sharing a subset of the dataset which can reproduce the error effortlessly, it really saves us a lot of time !

If I may ask a quick question, how did we get the file cell_barcode_seq.txt for whitelist ?
It looks like the file contains ~50k whitelisted CB, and only a very tiny fraction of them were actually found in the read/seq Fastq file.

I agree that we can work on putting a more informative error so that user can resolve it by themselves.
The current assumption for alevin for whitelist CB is -- if provided with external whitelist then user is sure of the presence of the CB w/ good enough frequency.

@k3yavi
Copy link
Member

k3yavi commented Jul 20, 2018

Thinking more about it, we can actually throw away a whitelisted CB with 0 frequency w/ a warning .
Once we discuss this w/ the alevin team, I'd happy to add this filter to the alevin pipeline .

Thanks again @habilzare for pointing this out .

@habilzare
Copy link
Author

Hi @k3yavi ,
Thanks for looking at this isssue. The CB whitelist corresponds to a larger study, which includes the cells in these fastq files and many more. I'll rerun with only the 2K-4K cells that are related to this specific sample, and then I send you and update.

@habilzare
Copy link
Author

habilzare commented Jul 20, 2018

Hi @k3yavi ,
Using the 5238 barcodes that are specific to this experiment, and also removing the quotes from the transcript to gene map file bcNotFound-2018-07-19b.tar.gz, this time Alevin finished with no error. However, I did not get a count matrix in csv format. Also, the quants_mat_cols.txt file is missing, and I do not know how to read the binary quants.mat file.

salmon alevin -l ISR -1 SRR6327122_1.fastq.gz -2 SRR6327122_2.fastq.gz --chromium -i index -p 2 -o alevin_output --tgMap transposon_sequence_set.fa.tsv --whitelist barcode_seq_5K.txt --dumpCsvCounts
Version Info: ### A newer version of Salmon is available. ####

The newest version, available at https://github.com/COMBINE-lab/salmon/releases
contains new features, improvements, and bug fixes; please upgrade at your
earliest convenience.

Logs will be written to alevin_output/logs
[2018-07-19 22:53:27.709] [jointLog] [info] Fragment incompatibility prior below threshold. Incompatible fragments will be ignored.

salmon (single-cell-based) v0.10.2

[ program ] => salmon

[ command ] => alevin

[ libType ] => { ISR }

[ mates1 ] => { SRR6327122_1.fastq.gz }

[ mates2 ] => { SRR6327122_2.fastq.gz }

[ chromium ] => { }

[ index ] => { index }

[ threads ] => { 2 }

[ output ] => { alevin_output }

[ tgMap ] => { transposon_sequence_set.fa.tsv }

[ whitelist ] => { barcode_seq_5K.txt }

[ dumpCsvCounts ] => { }

[2018-07-19 22:53:27.714] [alevinLog] [info] Processing barcodes files (if Present)

processed 87 Million barcodes

[2018-07-19 22:55:37.299] [alevinLog] [info] Done barcode density calculation.
[2018-07-19 22:55:37.299] [alevinLog] [info] # Barcodes Used: 86885223 / 87959276.
[2018-07-19 22:55:37.303] [alevinLog] [info] Done importing white-list Barcodes
[2018-07-19 22:55:37.303] [alevinLog] [info] Total 5238 white-listed Barcodes
[2018-07-19 22:55:37.675] [alevinLog] [info] Done populating Z matrix
[2018-07-19 22:55:37.683] [alevinLog] [info] Done indexing Barcodes
[2018-07-19 22:55:37.683] [alevinLog] [info] Total Unique barcodes found: 978816
[2018-07-19 22:55:37.683] [alevinLog] [info] Used Barcodes except Whitelist: 20705
[2018-07-19 22:55:38.386] [jointLog] [info] There is 1 library.
[2018-07-19 22:55:38.493] [jointLog] [info] Loading Quasi index
[2018-07-19 22:55:38.494] [jointLog] [info] Loading 32-bit quasi index
[2018-07-19 22:55:38.549] [jointLog] [info] done
[2018-07-19 22:55:38.549] [jointLog] [info] Index contained 179 targets

[2018-07-19 22:55:38.385] [alevinLog] [info] Done with Barcode Processing; Moving to Quantify

[2018-07-19 22:55:38.385] [alevinLog] [info] parsing read library format
[2018-07-19 22:55:38.495] [stderrLog] [info] Loading Suffix Array
[2018-07-19 22:55:38.498] [stderrLog] [info] Loading Transcript Info
[2018-07-19 22:55:38.499] [stderrLog] [info] Loading Rank-Select Bit Array
[2018-07-19 22:55:38.500] [stderrLog] [info] There were 179 set bits in the bit array
[2018-07-19 22:55:38.501] [stderrLog] [info] Computing transcript lengths
[2018-07-19 22:55:38.501] [stderrLog] [info] Waiting to finish loading hash
processed 87 Million fragmentserrLog] [info] Done loading index
hits: 468892, hits per frag: 0.00535907

[2018-07-19 23:03:35.740] [jointLog] [info] Computed 150 rich equivalence classes for further processing
[2018-07-19 23:03:35.740] [jointLog] [info] Counted 412868 total reads in the equivalence classes
[2018-07-19 23:03:35.741] [jointLog] [warning] Only 412868 fragments were mapped, but the number of burn-in fragments was set to 5000000.
The effective lengths have been computed using the observed mappings.

[2018-07-19 23:03:35.741] [jointLog] [info] Mapping rate = 0.469385%

[2018-07-19 23:03:35.741] [jointLog] [info] finished quantifyLibrary()
[2018-07-19 23:03:35.755] [alevinLog] [info] Starting optimizer

Analyzed 5238 cells (100% of all).
Skipped Barcodes are from High Confidence Region
$ls -ltrha alevin_output/alevin/
total 256K
drwxrwx--- 6 zare G-816158 4.0K Jul 19 22:36 ..
-rw-rw---- 1 zare G-816158 960 Jul 19 23:03 alevin.log
drwxrwx--- 2 zare G-816158 4.0K Jul 19 23:03 .
-rw-rw---- 1 zare G-816158 81K Jul 19 23:03 quants_mat_rows.txt
-rw-rw---- 1 zare G-816158 160K Jul 19 23:03 quants_mat.gz

@k3yavi
Copy link
Member

k3yavi commented Jul 20, 2018

Hi @habilzare ,
re: importing binary format: this tutorial might help more of how to read the binary output. Specifically, if you are familiar with python then this function can be used for importing the binary.

However, I reckon that alevin hasn't successfully completed. It looks like one of the whitelisted CB ended up having no read at all after deduplicating. We usually assume it should not happen since a whitelisted CB should have at least some count after deduplicating UMIs, which in your case can happen since the mapping rate looks like is surprisingly low 0.469385% (is this expected ?).

Skipped Barcodes are from High Confidence Region is actually an error, again we need more informative and elegant exit from the pipeline once Alevin fails.

@k3yavi
Copy link
Member

k3yavi commented Jul 20, 2018

Slight Correction on the above statement "It looks like one of the whitelisted CB ended up having no read at all after deduplicating mapping." We might have to tweak a bit in the current version of Alevin for use cases like yours where we don't wan't pipeline to fail if the whitelisted CB are either w/ less frequency / mapping rate / deduplication rate.

@habilzare
Copy link
Author

The low rate of mapping is expected because in this study I am interested only in the transposons.
Thanks @k3yavi for requesting the feature. I need this analysis to be done very urgently. Until you add the feature, I am going to implement the following workaround:
I will create a pair of dummy fastq files that have exactly 1 read for each barcode in my whitelist. I will add these to my current fastq file when running Alevin. After I get the count matrix, I will subtract 1 from each entry.

@k3yavi
Copy link
Member

k3yavi commented Jul 20, 2018

Yep that should work too, although a couple of minor things, you might wanna use 10 reads since that's the lower bound. Secondly, those reads should be mapped too, I guess copy the 98 length sequence from the transposons's region. Lastly the UMI, if the UMI sequence overlap w/ already present UMI sequence then it can potentially effect the deduplication of cell having more than 10 counts, you might wanna chose a disjoint UMI sequence not in your dataset and reduce the count by 1 in the count matrix since if all newly added UMI are same and get mapped to same gene then it will be deduplicated to 1.

@habilzare
Copy link
Author

I wrote the 'make.dummy.fastq() function in R, which addressed the issue as planned in above.

For some of the samples, I got the "umi indexing of jellyfish failing" error, which I realized from this report must be related to shorter than 26 bps reads. I used trimommatic to exclude those short reads, and finally Alevin produced the count matrix in CSV format for me. Thank you Salmon team!

Minor: The output CSV file has an extra comma at the end of each line, which I ignored in my post-processing step.

@k3yavi
Copy link
Member

k3yavi commented Jul 25, 2018

@habilzare Glad to hear that it worked out and thanks for fast and useful moderation from default alevin pipeline. We will work on correcting the suggestions made in the next release of Salmon.

@k3yavi k3yavi closed this as completed Jul 25, 2018
@patrickvdb
Copy link
Contributor

Hi, I have the exact same issue, in that it is quite likely that some barcodes will be empty and I don't obtain any results now.
It seems that you (@k3yavi) closed the feature request when you closed this issue. was that your intention? I'll try to get the work around working, but it would be nice if there is a flag we could set in the future to ignore this error when running salmon.

@k3yavi
Copy link
Member

k3yavi commented Aug 3, 2018

Hi @patrickvdb , Thanks for pointing this out.
My impression was there were two issues here:

  1. The presence of whitelist CB w/ 0 frequency. (Subject of this issue).
  2. The presence of whitelist CB w/ 0 mapping reads.

The solution of the first problem is pretty straight forward and since it was the subject of the issue I thought closing this one and opening a new w/ the second problem would be better thing to do.
We are in the process of fixing these two issues and my plan was to open and close a new issue w/ the fix but you are right, either this issue should not have been closed or a new should have been created. We'll update this soon, thanks for bringing this to our attention.

@k3yavi
Copy link
Member

k3yavi commented Aug 12, 2018

Just a heads up, issue #266 has been added and the solution is currently available in the source build from the develop branch. We will include this to master with the next planned release of Salmon v0.11.3. Thanks again for the useful feedbacks and comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
alevin issue is primarily related to alevin feature request
Projects
None yet
Development

No branches or pull requests

3 participants