-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
memory error while generating bus file with large index #16
Comments
The neuron 10k example dataset is from mouse while you used an intron index from human, and that could be the cause of your problem. I built a mouse intron index and used it for the neuron 10k dataset and did not get this issue. I also built a human intron index and used it for the 5k pbmc dataset which simultaneously has RNA and protein quantification and I only used the RNA part here and did not get this problem either. Here's the output I got:
The number of equivalence classes did not blow up. |
Thank you, would you also mind sharing the code you used to generate the human intron index? |
I used the same code as you did to generate the human intron index. |
Did you also use the BSgenome.Hsapiens.UCSC.hg38 genome to make the cDNA_introns.fa file? Or did you just use an RNA version of the genome for:
Otherwise I am not sure as to why my index is so large (7 Gb). Thank you for your help! |
I used BSgenome.Hsapiens.UCSC.hg38 for the human intron index, literally copied and pasted your code. It's not surprising that the index is large because introns are usually much larger than exons. The kb website provides pre-computed intron indices for human and mouse, and the human one there is also like 7 GB, so you didn't do this step wrong. But pay attention to which species you are using since the neuron 10k dataset is from mouse. |
Yes, I used different code not shown in my original post that used the human intron index and human pbmcs from one sample (not the mouse demo cells) but still ended up with the large amount of equivalence classes. The four fasta files from this sample only totaled to about 15Gb. |
I realized that the actual kallisto index with introns for human is over 40 GB and the number of equivalence classes did not blow up. That's the case for the index built from the outputs of BUSpaRse, and it still is the case when I built the index from scratch with the kb wrapper (the pre-computed one is probably compressed). There's probably something wrong with your index. What's the size of your |
Hi,
I have been following this tutorial on generating spliced and unspliced matrices for RNA velocity analysis: https://bustools.github.io/BUS_notebooks_R/velocity.html#generate_spliced_and_unspliced_matrices
where intronic sequences have to be included in the index to differentiate between spliced and unspliced transcripts. The kallisto index (7 Gb in size) was made from the genome "BSgenome.Hsapiens.UCSC.hg38".
Then I get this output when I try to generate the initial bus file using this index:
Is there a way to reduce the number of equivalence classes produced and reduce memory usage while retaining the intronic sequences in the index file? I tried submitting this job to a cluster with 200Gb of memory and it still wasn't enough to complete the job.
This was the code I used to make the index:
In the tutorial from https://bustools.github.io/BUS_notebooks_R/velocity.html#generate_spliced_and_unspliced_matrices
, they also generate the velocity files using a genome:
since the goal was to "build a kallisto index for cDNAs as reads are pseudoaligned to cDNAs. Here, for RNA velocity, as reads are pseudoaligned to the flanked intronic sequences in addition to the cDNAs, the flanked intronic sequences should also be part of the kallisto index." Was I supposed to do something else? Thank you
The text was updated successfully, but these errors were encountered: