Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

long running time with space separated list of files as input #329

Closed
rbenel opened this issue Dec 11, 2018 · 17 comments
Closed

long running time with space separated list of files as input #329

rbenel opened this issue Dec 11, 2018 · 17 comments
Assignees
Labels
alevin issue is primarily related to alevin fixed in develop this bug has been fixed in develop and the issue will be closed when merged into master

Comments

@rbenel
Copy link

rbenel commented Dec 11, 2018

alevin (single-cell mode)

I am trying to run alevin using a space separated list of (20) files as input. The fastq files we received from sequencing, were separated arbitrarily to keep them at about ~200 MB a file, but they are all the same sample and I wish to treat them as one library. There is no error produced, but it has been running for ~15 hours, and the log files are blank. As a side note, running each "pair" works just fine.

v0.12.1
compiled from source
OS - Ubuntu Linux, x86_64 x86_64 x86_64 GNU/Linux

Alevin is supposed to be able to run with multiple read files, as specified here: https://github.com/COMBINE-lab/salmon/blob/master/doc/source/salmon.rst#providing-multiple-read-files-to-salmon

Logs will be written to /alevin_outputSingleLibrary/quantSC/logs
### alevin (dscRNA-seq quantification) v0.12.1
### [ program ] => salmon 
### [ command ] => alevin 
### [ index ] => { AlevinIndex/ }
### [ libType ] => { ISR }
### [ mates1 ] => { 12_CTTGTA_L001_R1_001.fastq.gz 12_CTTGTA_L001_R1_002.fastq.gz 12_CTTGTA_L001_R1_003.fastq.gz 12_CTTGTA_L001_R1_004.fastq.gz 12_CTTGTA_L001_R1_005.fastq.gz 12_CTTGTA_L001_R1_006.fastq.gz 12_CTTGTA_L001_R1_007.fastq.gz 12_CTTGTA_L001_R1_008.fastq.gz 12_CTTGTA_L001_R1_009.fastq.gz 12_CTTGTA_L001_R1_010.fastq.gz 12_CTTGTA_L002_R1_001.fastq.gz 12_CTTGTA_L002_R1_002.fastq.gz 12_CTTGTA_L002_R1_003.fastq.gz 12_CTTGTA_L002_R1_004.fastq.gz 12_CTTGTA_L002_R1_005.fastq.gz 12_CTTGTA_L002_R1_006.fastq.gz 12_CTTGTA_L002_R1_007.fastq.gz 12_CTTGTA_L002_R1_008.fastq.gz 12_CTTGTA_L002_R1_009.fastq.gz 12_CTTGTA_L002_R1_010.fastq.gz }
### [ mates2 ] => { 12_CTTGTA_L001_R2_001.fastq.gz 12_CTTGTA_L001_R2_002.fastq.gz 12_CTTGTA_L001_R2_003.fastq.gz 12_CTTGTA_L001_R2_004.fastq.gz 12_CTTGTA_L001_R2_005.fastq.gz 12_CTTGTA_L001_R2_006.fastq.gz 12_CTTGTA_L001_R2_007.fastq.gz 12_CTTGTA_L001_R2_008.fastq.gz 12_CTTGTA_L001_R2_009.fastq.gz 12_CTTGTA_L001_R2_010.fastq.gz 12_CTTGTA_L002_R2_001.fastq.gz 12_CTTGTA_L002_R2_002.fastq.gz 12_CTTGTA_L002_R2_003.fastq.gz 12_CTTGTA_L002_R2_004.fastq.gz 12_CTTGTA_L002_R2_005.fastq.gz 12_CTTGTA_L002_R2_006.fastq.gz 12_CTTGTA_L002_R2_007.fastq.gz 12_CTTGTA_L002_R2_008.fastq.gz 12_CTTGTA_L002_R2_009.fastq.gz 12_CTTGTA_L002_R2_010.fastq.gz }
### [ threads ] => { 8 }
### [ celseq2 ] => { }
### [ numCellBootstraps ] => { 100 }
### [ dumpCsvCounts ] => { }
### [ output ] => { alevin_outputSingleLibrary/quantSC }
### [ tgMap ] => { gencode.primary_assembly.v29.tsv }
### [ whitelist ] => { my_barcode.tsv }


[2018-12-11 10:23:56.120] [jointLog] [info] Fragment incompatibility prior below threshold.  Incompatible fragments will be ignored.
[2018-12-11 10:23:56.120] [jointLog] [info] Usage of --validateMappings implies use of minScoreFraction. Since not explicitly specified, it is being set to 0.65
[2018-12-11 10:23:56.120] [jointLog] [info] Usage of --validateMappings implies use of range factorization. rangeFactorizationBins is being set to 4
[2018-12-11 10:23:56.120] [jointLog] [info] Usage of --validateMappings implies a default consensus slack of 1. Setting consensusSlack to 1.
[2018-12-11 10:23:56.120] [jointLog] [info] Using default value of 0.8 for minScoreFraction in Alevin
[2018-12-11 10:23:56.126] [alevinLog] [info] Processing barcodes files (if Present) 

 
processed 32 Million barcodes

any help or advice would be appreciated :)

@k3yavi
Copy link
Member

k3yavi commented Dec 11, 2018

Hey @rbenel, I am wondering are you interested in using bootstrap estimate of Alevin? If not then removing 'numCellBootstrap' flag will dramatically improve the running time.

@rbenel
Copy link
Author

rbenel commented Dec 11, 2018

Hey @k3yavi, bootstrapping really improved my population studies so I figured I would try it with sc, but I haven't even seen the run get there when I use the multiple files... after processed X Million barcodes there are no more logs produced on-screen..

@k3yavi
Copy link
Member

k3yavi commented Dec 11, 2018

Oh yes, you are right it never started mapping let alone quantification. I'll take a look in a bit. Thanks for raising the issue .

@rbenel
Copy link
Author

rbenel commented Dec 11, 2018

Correct,

Thanks @k3yavi :)

@k3yavi
Copy link
Member

k3yavi commented Dec 11, 2018

Hi @rbenel , I just tested it on a couple of datasets we have, it seems to work fine.
Can you check if you can replicate the issue with two pairs and if possible forward some data to replicate the issue?

@k3yavi k3yavi self-assigned this Dec 11, 2018
@k3yavi k3yavi added the alevin issue is primarily related to alevin label Dec 11, 2018
@rbenel
Copy link
Author

rbenel commented Dec 11, 2018

Hi @k3yavi,
So there is no issue when I manually add 2 pairs. Is there a max amount of input files?
Below is the simple bash script to organize the data and find the correct files

#!/bin/bash
#this script calls alevin for multiple library pairs of files 

#where salmon located
salmon="/usr/local/bin/salmon"

index="path/to/gencode_annot/AlevinIndex/"
echo ${index}

#output folder path
output_folder="path/to/alevin_outputTest"
echo ${output_folder}

#where the raw files are
samples_folder="path/to/Raw_data/Sample_cells/"

cd ${samples_folder}

sample1=$(ls *R1*.fastq.gz -p | grep -v / | tr '\n' ' ') #this gives us a space seperated list of all the R1 files
#this is from the alevin tutorial "Often, a single library may be split into multiple FASTA/Q files. Also, sometimes one may wish to quantify multiple
#replicates or samples together, treating them as if they are one library"
echo "Value of sample1:"
echo ${sample1}

echo "Value of sample2:"
sample2=${sample1//R1/R2} #switch the R1 with R2 to find the second pair for ALL (//) occurances
echo ${sample2}

tgMap="path/to/gencode.primary_assembly.v29.tsv"
#this is a transcript --> gene map tsv file. Can create this using tximport

whitelist="path/to/my_barcode.tsv"
#a list of true barcodes
salmonCommand="${salmon} alevin -i $index -lISR -1 ${sample1} -2 ${sample2} -p 8 --celseq2 --dumpCsvCounts -o ${output_folder}/quantSC --tgMap ${tgMap} --whitelist ${whitelist}"
#--numCellBootstraps 100
#numCellBootstraps args -- generate a mean and varience for cell x
#dumpCSVcounts - dumps cell v. transcripts count matrix in csv format
echo ${salmonCommand}
if ${salmonCommand}
then
    touch ${output_folder}/qauntSC_complete.txt
else
    touch ${output_folder}/quantSC_failed_to_complete.txt
fi

@k3yavi
Copy link
Member

k3yavi commented Dec 11, 2018

The bash script looks good to me, and I am not aware of any hard limit on the number of files as input. However, I just did tested on 24 files as an input and it seems to work. Hard to tell what's wrong, without being able to replicate the issue.

@rbenel
Copy link
Author

rbenel commented Dec 11, 2018

Happy to hear the bash script looks good.
How can I help replicate the issue?

@rob-p
Copy link
Collaborator

rob-p commented Dec 11, 2018

Does each individual file fail? Is there a particular file pair that fails to run (an ill formed file)?

@rbenel
Copy link
Author

rbenel commented Dec 11, 2018

I have already run them all (successfully) separately as pairs, but for downstream analysis I need them to be a single library, so I thought it would be simpler to run them as multiple input files..?

@k3yavi
Copy link
Member

k3yavi commented Dec 11, 2018

Hey @rbenel,
If you can replicate the issue on a subset of files and if possible forward the data then I can take a look. I am not really sure what's going wrong here.

@rbenel
Copy link
Author

rbenel commented Dec 11, 2018

I have narrowed down the issue, the first 8 files are fine, adding a 9th reproduces the issue.

Will send you a link with a subset of the files

@k3yavi k3yavi added the fixed in develop this bug has been fixed in develop and the issue will be closed when merged into master label Dec 12, 2018
@k3yavi
Copy link
Member

k3yavi commented Dec 12, 2018

Hi @rbenel ,
Thanks a lot for bringing this to our attention and forwarding the relevant data to replicate the issue.
Apparently it was an extremely complicated and rare corner case but thanks to @rob-p we were able to resolve it.

Since you are using Alevin by compiling from source our latest commit (c3eeec9) on the develop branch should solve your issue. We will eventually merge the fix to master in the next release or as a hot-fix sometime later in the future.

@rbenel
Copy link
Author

rbenel commented Dec 12, 2018

Hi @k3yavi,
My local repository now contains the latest commit, and the run proceeds past the processed X Million barcodes however, I have been stuck at Analyzed 95 cells (100% of all) for the past few hours.. I am sorry this is giving you guys such issues :(

I also tried this on cat *R1*.fq.gz of the files, and had the same issue.
alevin.log
salmon_quant.log

Logs will be written to path/to/alevin_outputSingleLibrary/quantSC/logs
Check for upgrades manually at https://combine-lab.github.io/salmon
[2018-12-12 15:07:42.022] [jointLog] [info] Fragment incompatibility prior below threshold.  Incompatible fragments will be ignored.
### alevin (dscRNA-seq quantification) v0.12.1
### [ program ] => salmon 
### [ command ] => alevin 
### [ index ] => { /path/to/gencode_annot/AlevinIndex/ }
### [ libType ] => { ISR }
### [ mates1 ] => { 12_CTTGTA_L001_R1_001.fastq.gz 12_CTTGTA_L001_R1_002.fastq.gz 12_CTTGTA_L001_R1_003.fastq.gz 12_CTTGTA_L001_R1_004.fastq.gz 12_CTTGTA_L001_R1_005.fastq.gz 12_CTTGTA_L001_R1_006.fastq.gz 12_CTTGTA_L001_R1_007.fastq.gz 12_CTTGTA_L001_R1_008.fastq.gz 12_CTTGTA_L001_R1_009.fastq.gz 12_CTTGTA_L001_R1_010.fastq.gz 12_CTTGTA_L002_R1_001.fastq.gz 12_CTTGTA_L002_R1_002.fastq.gz 12_CTTGTA_L002_R1_003.fastq.gz 12_CTTGTA_L002_R1_004.fastq.gz 12_CTTGTA_L002_R1_005.fastq.gz 12_CTTGTA_L002_R1_006.fastq.gz 12_CTTGTA_L002_R1_007.fastq.gz 12_CTTGTA_L002_R1_008.fastq.gz 12_CTTGTA_L002_R1_009.fastq.gz 12_CTTGTA_L002_R1_010.fastq.gz }
### [ mates2 ] => { 12_CTTGTA_L001_R2_001.fastq.gz 12_CTTGTA_L001_R2_002.fastq.gz 12_CTTGTA_L001_R2_003.fastq.gz 12_CTTGTA_L001_R2_004.fastq.gz 12_CTTGTA_L001_R2_005.fastq.gz 12_CTTGTA_L001_R2_006.fastq.gz 12_CTTGTA_L001_R2_007.fastq.gz 12_CTTGTA_L001_R2_008.fastq.gz 12_CTTGTA_L001_R2_009.fastq.gz 12_CTTGTA_L001_R2_010.fastq.gz 12_CTTGTA_L002_R2_001.fastq.gz 12_CTTGTA_L002_R2_002.fastq.gz 12_CTTGTA_L002_R2_003.fastq.gz 12_CTTGTA_L002_R2_004.fastq.gz 12_CTTGTA_L002_R2_005.fastq.gz 12_CTTGTA_L002_R2_006.fastq.gz 12_CTTGTA_L002_R2_007.fastq.gz 12_CTTGTA_L002_R2_008.fastq.gz 12_CTTGTA_L002_R2_009.fastq.gz 12_CTTGTA_L002_R2_010.fastq.gz }
### [ threads ] => { 8 }
### [ celseq2 ] => { }
### [ dumpCsvCounts ] => { }
### [ output ] => { /path/to/alevin_outputSingleLibrary/quantSC }
### [ tgMap ] => { /path/to/gencode_annot/gencode.primary_assembly.v29.tsv }
### [ whitelist ] => { /path/to/salmon/my_barcode.tsv }


[2018-12-12 15:07:42.022] [jointLog] [info] Usage of --validateMappings implies use of minScoreFraction. Since not explicitly specified, it is being set to 0.65
[2018-12-12 15:07:42.022] [jointLog] [info] Usage of --validateMappings implies use of range factorization. rangeFactorizationBins is being set to 4
[2018-12-12 15:07:42.022] [jointLog] [info] Usage of --validateMappings implies a default consensus slack of 1. Setting consensusSlack to 1.
[2018-12-12 15:07:42.022] [jointLog] [info] Using default value of 0.8 for minScoreFraction in Alevin
[2018-12-12 15:07:42.028] [alevinLog] [info] Processing barcodes files (if Present) 

 
processed 74 Million barcodes

[2018-12-12 15:08:51.135] [alevinLog] [info] Done barcode density calculation.
[2018-12-12 15:08:51.135] [alevinLog] [info] # Barcodes Used: 74376522 / 74376522.
[2018-12-12 15:08:51.141] [alevinLog] [info] Done importing white-list Barcodes
[2018-12-12 15:08:51.141] [alevinLog] [warning] Skipping 1 Barcodes with 0 reads
 Assuming this is the required behavior.
[2018-12-12 15:08:51.141] [alevinLog] [info] Total 95 white-listed Barcodes
[2018-12-12 15:08:51.144] [alevinLog] [info] Done populating Z matrix
[2018-12-12 15:08:51.146] [alevinLog] [info] Done indexing Barcodes
[2018-12-12 15:08:51.146] [alevinLog] [info] Total Unique barcodes found: 4096
[2018-12-12 15:08:51.146] [alevinLog] [info] Used Barcodes except Whitelist: 1864
[2018-12-12 15:08:51.272] [alevinLog] [info] Done with Barcode Processing; Moving to Quantify

[2018-12-12 15:08:51.272] [alevinLog] [info] parsing read library format
[2018-12-12 15:08:51.375] [stderrLog] [info] Loading Suffix Array 
[2018-12-12 15:08:51.272] [jointLog] [info] There is 1 library.
[2018-12-12 15:08:51.375] [jointLog] [info] Loading Quasi index
[2018-12-12 15:08:51.375] [jointLog] [info] Loading 32-bit quasi index
[2018-12-12 15:09:10.216] [stderrLog] [info] Loading Transcript Info 
[2018-12-12 15:09:15.719] [stderrLog] [info] Loading Rank-Select Bit Array
[2018-12-12 15:09:16.330] [stderrLog] [info] There were 205,870 set bits in the bit array
[2018-12-12 15:09:16.343] [stderrLog] [info] Computing transcript lengths
[2018-12-12 15:09:16.343] [stderrLog] [info] Waiting to finish loading hash
[2018-12-12 15:09:21.460] [stderrLog] [info] Done loading index
[2018-12-12 15:09:21.460] [jointLog] [info] done
[2018-12-12 15:09:21.460] [jointLog] [info] Index contained 205,870 targets




processed 0 Million fragments
processed 1 Million fragments
processed 1 Million fragments
..............
processed 74 Million fragments
hits: 111594303, hits per frag:  1.50848[2018-12-12 15:12:07.666] [jointLog] [info] Thread saw mini-batch with a maximum of 5.34% zero probability fragments
[2018-12-12 15:12:07.677] [jointLog] [info] Thread saw mini-batch with a maximum of 5.48% zero probability fragments




[2018-12-12 15:12:07.721] [jointLog] [info] Computed 173,365 rich equivalence classes for further processing
[2018-12-12 15:12:07.721] [jointLog] [info] Counted 27,831,508 total reads in the equivalence classes 
[2018-12-12 15:12:07.721] [jointLog] [warning] Found 31347 reads with `N` in the UMI sequence and ignored the reads.
Please report on github if this number is too large
[2018-12-12 15:12:07.721] [jointLog] [info] Mapping rate = 37.4197%

[2018-12-12 15:12:07.721] [jointLog] [info] finished quantifyLibrary()
[2018-12-12 15:12:07.904] [alevinLog] [info] Starting optimizer


Analyzed 7 cells (7% of all).
Analyzed 8 cells (8% of all).
Analyzed 9 cells (9% of all).
Analyzed 10 cells (11% of all).
Analyzed 11 cells (12% of all).
Analyzed 12 cells (13% of all).
Analyzed 13 cells (14% of all).
Analyzed 14 cells (15% of all).
Analyzed 15 cells (16% of all).
Analyzed 16 cells (17% of all).
Analyzed 17 cells (18% of all).
Analyzed 18 cells (19% of all).
Analyzed 19 cells (20% of all).
Analyzed 20 cells (21% of all).
Analyzed 21 cells (22% of all).
Analyzed 22 cells (23% of all).
Analyzed 23 cells (24% of all).
Analyzed 24 cells (25% of all).
Analyzed 25 cells (26% of all).
Analyzed 26 cells (27% of all).
Analyzed 27 cells (28% of all).
Analyzed 28 cells (29% of all).
Analyzed 29 cells (31% of all).
Analyzed 30 cells (32% of all).
Analyzed 31 cells (33% of all).
Analyzed 32 cells (34% of all).
Analyzed 33 cells (35% of all).
Analyzed 34 cells (36% of all).
Analyzed 35 cells (37% of all).
Analyzed 36 cells (38% of all).
Analyzed 37 cells (39% of all).
Analyzed 38 cells (40% of all).
Analyzed 39 cells (41% of all).
Analyzed 40 cells (42% of all).
Analyzed 41 cells (43% of all).
Analyzed 42 cells (44% of all).
Analyzed 43 cells (45% of all).
Analyzed 44 cells (46% of all).
Analyzed 45 cells (47% of all).
Analyzed 46 cells (48% of all).
Analyzed 47 cells (49% of all).
Analyzed 48 cells (51% of all).
Analyzed 49 cells (52% of all).
Analyzed 50 cells (53% of all).
Analyzed 51 cells (54% of all).
Analyzed 52 cells (55% of all).
Analyzed 53 cells (56% of all).
Analyzed 54 cells (57% of all).
Analyzed 55 cells (58% of all).
Analyzed 56 cells (59% of all).
Analyzed 57 cells (60% of all).
Analyzed 58 cells (61% of all).
Analyzed 59 cells (62% of all).
Analyzed 60 cells (63% of all).
Analyzed 61 cells (64% of all).
Analyzed 62 cells (65% of all).
Analyzed 63 cells (66% of all).
Analyzed 64 cells (67% of all).
Analyzed 65 cells (68% of all).
Analyzed 66 cells (69% of all).
Analyzed 67 cells (71% of all).
Analyzed 68 cells (72% of all).
Analyzed 69 cells (73% of all).
Analyzed 70 cells (74% of all).
Analyzed 71 cells (75% of all).
Analyzed 72 cells (76% of all).
Analyzed 73 cells (77% of all).
Analyzed 74 cells (78% of all).
Analyzed 75 cells (79% of all).
Analyzed 76 cells (80% of all).
Analyzed 77 cells (81% of all).
Analyzed 78 cells (82% of all).
Analyzed 79 cells (83% of all).
Analyzed 80 cells (84% of all).
Analyzed 81 cells (85% of all).
Analyzed 82 cells (86% of all).
Analyzed 83 cells (87% of all).
Analyzed 84 cells (88% of all).
Analyzed 85 cells (89% of all).
Analyzed 86 cells (91% of all).
Analyzed 87 cells (92% of all).
Analyzed 88 cells (93% of all).
Analyzed 89 cells (94% of all).
Analyzed 90 cells (95% of all).
Analyzed 91 cells (96% of all).
Analyzed 92 cells (97% of all).
Analyzed 93 cells (98% of all).
Analyzed 94 cells (99% of all).
Analyzed 95 cells (100% of all).
Analyzed 95 cells (100% of all).
Analyzed 95 cells (100% of all).
Analyzed 95 cells (100% of all).
Analyzed 95 cells (100% of all).
Analyzed 95 cells (100% of all).

Thanks!!

@k3yavi
Copy link
Member

k3yavi commented Dec 12, 2018

Hi @rbenel ,
Interestingly I also observe the similar behavior, unlike the last issue this time thread are not in the sleep state and the processors are working. My guess would be this is due to relatively complex UMI network of celseq2 data. I'd say let it run for a while and see what's the overall turn around. I'll work on the dataset you provided earlier to analyze the UMI network and would let you know one we make some progress, although it might take a tad longer to do a thorough analysis for this one.

@rbenel
Copy link
Author

rbenel commented Dec 13, 2018

It finished running after a few hours!

Thanks for all of the help!

@k3yavi
Copy link
Member

k3yavi commented Dec 16, 2018

Hi @rbenel ,
I think the issue with the long waiting time before starting mapping has been solved, hence closing this issue. I'll open a new issue with complex UMI network connectivity graph for celseq2 and tag you for more analysis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
alevin issue is primarily related to alevin fixed in develop this bug has been fixed in develop and the issue will be closed when merged into master
Projects
None yet
Development

No branches or pull requests

3 participants