memory error while generating bus file with large index #16

tangybat · 2022-08-21T04:15:31Z

Hi,

I have been following this tutorial on generating spliced and unspliced matrices for RNA velocity analysis: https://bustools.github.io/BUS_notebooks_R/velocity.html#generate_spliced_and_unspliced_matrices

where intronic sequences have to be included in the index to differentiate between spliced and unspliced transcripts. The kallisto index (7 Gb in size) was made from the genome "BSgenome.Hsapiens.UCSC.hg38".

Then I get this output when I try to generate the initial bus file using this index:

kallisto bus -i ../../output/mm_cDNA_introns_97.idx
-o ../../output/neuron10k_velocity -x 10xv3 -t8
neuron_10k_v3_S1_L002_R1_001.fastq.gz neuron_10k_v3_S1_L002_R2_001.fastq.gz
neuron_10k_v3_S1_L001_R1_001.fastq.gz neuron_10k_v3_S1_L001_R2_001.fastq.gz

[bus] Note: Strand option was not specified; setting it to --fr-stranded for specified technology
[index] k-mer length: 31
[index] number of targets: 1,378,373
[index] number of k-mers: 1,560,141,285
[index] number of equivalence classes: 4,890,909,195,324,358,656
terminate called after throwing an instance of 'std::length_error'

Is there a way to reduce the number of equivalence classes produced and reduce memory usage while retaining the intronic sequences in the index file? I tried submitting this job to a cluster with 200Gb of memory and it still wasn't enough to complete the job.

This was the code I used to make the index:

ah <- AnnotationHub() 
query(ah, pattern = c("Ensembl", "97", "Homo sapiens", "EnsDb"))

AnnotationHub with 1 record
# snapshotDate(): 2021-10-20
# names(): AH73881
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: EnsDb
# $rdatadateadded: 2019-05-02
# $title: Ensembl 97 EnsDb for Homo sapiens
# $description: Gene and protein annotations for Homo sapiens based o...
# $taxonomyid: 9606
# $genome: GRCh38
# $sourcetype: ensembl
# $sourceurl: http://www.ensembl.org
# $sourcesize: NA
# $tags: c("97", "AHEnsDbs", "Annotation", "EnsDb", "Ensembl",
#   "Gene", "Protein", "Transcript") 
# retrieve record with 'object[["AH73881"]]' 

edb <- ah[["AH73881"]]
library(BSgenome.Hsapiens.UCSC.hg38)
library(BUSpaRse)
get_velocity_files(ah[["AH73881"]], L = 91, Genome = BSgenome.Hsapiens.UCSC.hg38, 
                   out_path = "./16_Azi_velocity", 
                   isoform_action = "separate")
kallisto index -i ./hg_cDNA_introns_97.idx ./16_Azi_velocity/cDNA_introns.fa

In the tutorial from https://bustools.github.io/BUS_notebooks_R/velocity.html#generate_spliced_and_unspliced_matrices
, they also generate the velocity files using a genome:

get_velocity_files(edb, L = 91, Genome = BSgenome.Mmusculus.UCSC.mm10, 
                   out_path = "./output/neuron10k_velocity", 
                   isoform_action = "separate")

since the goal was to "build a kallisto index for cDNAs as reads are pseudoaligned to cDNAs. Here, for RNA velocity, as reads are pseudoaligned to the flanked intronic sequences in addition to the cDNAs, the flanked intronic sequences should also be part of the kallisto index." Was I supposed to do something else? Thank you

The text was updated successfully, but these errors were encountered:

lambdamoses · 2022-08-23T03:40:21Z

The neuron 10k example dataset is from mouse while you used an intron index from human, and that could be the cause of your problem. I built a mouse intron index and used it for the neuron 10k dataset and did not get this issue. I also built a human intron index and used it for the 5k pbmc dataset which simultaneously has RNA and protein quantification and I only used the RNA part here and did not get this problem either. Here's the output I got:

~/kallisto/kallisto bus -i ~/BUSpaRse/hg_cDNA_introns_97.idx -o ~/BUSpaRse/hg_pbmc -x 10xv3 -t8 5k_pbmc_protein_v3_gex_S1_L001_R1_001.fastq.gz 5k_pbmc_protein_v3_gex_S1_L001_R2_001.fastq.gz 5k_pbmc_protein_v3_gex_S1_L002_R1_001.fastq.gz 5k_pbmc_protein_v3_gex_S1_L002_R2_001.fastq.gz

[bus] Note: Strand option was not specified; setting it to --fr-stranded for specified technology
[index] k-mer length: 31
[index] number of targets: 1,378,373
[index] number of k-mers: 1,560,141,285
[index] number of equivalence classes: 12,887,284
[quant] will process sample 1: 5k_pbmc_protein_v3_gex_S1_L001_R1_001.fastq.gz
                               5k_pbmc_protein_v3_gex_S1_L001_R2_001.fastq.gz
[quant] will process sample 2: 5k_pbmc_protein_v3_gex_S1_L002_R1_001.fastq.gz
                               5k_pbmc_protein_v3_gex_S1_L002_R2_001.fastq.gz
[quant] finding pseudoalignments for the reads ...

The number of equivalence classes did not blow up.

tangybat · 2022-08-23T22:58:35Z

Thank you, would you also mind sharing the code you used to generate the human intron index?

lambdamoses · 2022-08-24T00:17:48Z

I used the same code as you did to generate the human intron index.

tangybat · 2022-08-24T01:35:47Z

Did you also use the BSgenome.Hsapiens.UCSC.hg38 genome to make the cDNA_introns.fa file? Or did you just use an RNA version of the genome for:

get_velocity_files(edb, L = 91, Genome = BSgenome.Mmusculus.UCSC.mm10, 
                   out_path = "./output/neuron10k_velocity", 
                   isoform_action = "separate")

Otherwise I am not sure as to why my index is so large (7 Gb). Thank you for your help!

lambdamoses · 2022-08-24T02:09:13Z

I used BSgenome.Hsapiens.UCSC.hg38 for the human intron index, literally copied and pasted your code. It's not surprising that the index is large because introns are usually much larger than exons. The kb website provides pre-computed intron indices for human and mouse, and the human one there is also like 7 GB, so you didn't do this step wrong. But pay attention to which species you are using since the neuron 10k dataset is from mouse.

tangybat · 2022-08-24T04:40:32Z

Yes, I used different code not shown in my original post that used the human intron index and human pbmcs from one sample (not the mouse demo cells) but still ended up with the large amount of equivalence classes. The four fasta files from this sample only totaled to about 15Gb.

lambdamoses · 2022-08-24T22:53:25Z

I realized that the actual kallisto index with introns for human is over 40 GB and the number of equivalence classes did not blow up. That's the case for the index built from the outputs of BUSpaRse, and it still is the case when I built the index from scratch with the kb wrapper (the pre-computed one is probably compressed). There's probably something wrong with your index. What's the size of your cDNA_introns.fa file for human? Mine is 8.5 GB.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

memory error while generating bus file with large index #16

memory error while generating bus file with large index #16

tangybat commented Aug 21, 2022

lambdamoses commented Aug 23, 2022

tangybat commented Aug 23, 2022

lambdamoses commented Aug 24, 2022

tangybat commented Aug 24, 2022

lambdamoses commented Aug 24, 2022

tangybat commented Aug 24, 2022

lambdamoses commented Aug 24, 2022

memory error while generating bus file with large index #16

memory error while generating bus file with large index #16

Comments

tangybat commented Aug 21, 2022

lambdamoses commented Aug 23, 2022

tangybat commented Aug 23, 2022

lambdamoses commented Aug 24, 2022

tangybat commented Aug 24, 2022

lambdamoses commented Aug 24, 2022

tangybat commented Aug 24, 2022

lambdamoses commented Aug 24, 2022