Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memory error while generating bus file with large index #16

Open
tangybat opened this issue Aug 21, 2022 · 7 comments
Open

memory error while generating bus file with large index #16

tangybat opened this issue Aug 21, 2022 · 7 comments

Comments

@tangybat
Copy link

Hi,

I have been following this tutorial on generating spliced and unspliced matrices for RNA velocity analysis: https://bustools.github.io/BUS_notebooks_R/velocity.html#generate_spliced_and_unspliced_matrices

where intronic sequences have to be included in the index to differentiate between spliced and unspliced transcripts. The kallisto index (7 Gb in size) was made from the genome "BSgenome.Hsapiens.UCSC.hg38".

Then I get this output when I try to generate the initial bus file using this index:

kallisto bus -i ../../output/mm_cDNA_introns_97.idx
-o ../../output/neuron10k_velocity -x 10xv3 -t8
neuron_10k_v3_S1_L002_R1_001.fastq.gz neuron_10k_v3_S1_L002_R2_001.fastq.gz
neuron_10k_v3_S1_L001_R1_001.fastq.gz neuron_10k_v3_S1_L001_R2_001.fastq.gz

[bus] Note: Strand option was not specified; setting it to --fr-stranded for specified technology
[index] k-mer length: 31
[index] number of targets: 1,378,373
[index] number of k-mers: 1,560,141,285
[index] number of equivalence classes: 4,890,909,195,324,358,656
terminate called after throwing an instance of 'std::length_error'

Is there a way to reduce the number of equivalence classes produced and reduce memory usage while retaining the intronic sequences in the index file? I tried submitting this job to a cluster with 200Gb of memory and it still wasn't enough to complete the job.

This was the code I used to make the index:

ah <- AnnotationHub() 
query(ah, pattern = c("Ensembl", "97", "Homo sapiens", "EnsDb"))

AnnotationHub with 1 record
# snapshotDate(): 2021-10-20
# names(): AH73881
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: EnsDb
# $rdatadateadded: 2019-05-02
# $title: Ensembl 97 EnsDb for Homo sapiens
# $description: Gene and protein annotations for Homo sapiens based o...
# $taxonomyid: 9606
# $genome: GRCh38
# $sourcetype: ensembl
# $sourceurl: http://www.ensembl.org
# $sourcesize: NA
# $tags: c("97", "AHEnsDbs", "Annotation", "EnsDb", "Ensembl",
#   "Gene", "Protein", "Transcript") 
# retrieve record with 'object[["AH73881"]]' 

edb <- ah[["AH73881"]]
library(BSgenome.Hsapiens.UCSC.hg38)
library(BUSpaRse)
get_velocity_files(ah[["AH73881"]], L = 91, Genome = BSgenome.Hsapiens.UCSC.hg38, 
                   out_path = "./16_Azi_velocity", 
                   isoform_action = "separate")
kallisto index -i ./hg_cDNA_introns_97.idx ./16_Azi_velocity/cDNA_introns.fa

In the tutorial from https://bustools.github.io/BUS_notebooks_R/velocity.html#generate_spliced_and_unspliced_matrices
, they also generate the velocity files using a genome:

get_velocity_files(edb, L = 91, Genome = BSgenome.Mmusculus.UCSC.mm10, 
                   out_path = "./output/neuron10k_velocity", 
                   isoform_action = "separate")

since the goal was to "build a kallisto index for cDNAs as reads are pseudoaligned to cDNAs. Here, for RNA velocity, as reads are pseudoaligned to the flanked intronic sequences in addition to the cDNAs, the flanked intronic sequences should also be part of the kallisto index." Was I supposed to do something else? Thank you

@lambdamoses
Copy link
Collaborator

The neuron 10k example dataset is from mouse while you used an intron index from human, and that could be the cause of your problem. I built a mouse intron index and used it for the neuron 10k dataset and did not get this issue. I also built a human intron index and used it for the 5k pbmc dataset which simultaneously has RNA and protein quantification and I only used the RNA part here and did not get this problem either. Here's the output I got:

~/kallisto/kallisto bus -i ~/BUSpaRse/hg_cDNA_introns_97.idx -o ~/BUSpaRse/hg_pbmc -x 10xv3 -t8 5k_pbmc_protein_v3_gex_S1_L001_R1_001.fastq.gz 5k_pbmc_protein_v3_gex_S1_L001_R2_001.fastq.gz 5k_pbmc_protein_v3_gex_S1_L002_R1_001.fastq.gz 5k_pbmc_protein_v3_gex_S1_L002_R2_001.fastq.gz

[bus] Note: Strand option was not specified; setting it to --fr-stranded for specified technology
[index] k-mer length: 31
[index] number of targets: 1,378,373
[index] number of k-mers: 1,560,141,285
[index] number of equivalence classes: 12,887,284
[quant] will process sample 1: 5k_pbmc_protein_v3_gex_S1_L001_R1_001.fastq.gz
                               5k_pbmc_protein_v3_gex_S1_L001_R2_001.fastq.gz
[quant] will process sample 2: 5k_pbmc_protein_v3_gex_S1_L002_R1_001.fastq.gz
                               5k_pbmc_protein_v3_gex_S1_L002_R2_001.fastq.gz
[quant] finding pseudoalignments for the reads ...

The number of equivalence classes did not blow up.

@tangybat
Copy link
Author

Thank you, would you also mind sharing the code you used to generate the human intron index?

@lambdamoses
Copy link
Collaborator

I used the same code as you did to generate the human intron index.

@tangybat
Copy link
Author

Did you also use the BSgenome.Hsapiens.UCSC.hg38 genome to make the cDNA_introns.fa file? Or did you just use an RNA version of the genome for:

get_velocity_files(edb, L = 91, Genome = BSgenome.Mmusculus.UCSC.mm10, 
                   out_path = "./output/neuron10k_velocity", 
                   isoform_action = "separate")

Otherwise I am not sure as to why my index is so large (7 Gb). Thank you for your help!

@lambdamoses
Copy link
Collaborator

I used BSgenome.Hsapiens.UCSC.hg38 for the human intron index, literally copied and pasted your code. It's not surprising that the index is large because introns are usually much larger than exons. The kb website provides pre-computed intron indices for human and mouse, and the human one there is also like 7 GB, so you didn't do this step wrong. But pay attention to which species you are using since the neuron 10k dataset is from mouse.

@tangybat
Copy link
Author

Yes, I used different code not shown in my original post that used the human intron index and human pbmcs from one sample (not the mouse demo cells) but still ended up with the large amount of equivalence classes. The four fasta files from this sample only totaled to about 15Gb.

@lambdamoses
Copy link
Collaborator

I realized that the actual kallisto index with introns for human is over 40 GB and the number of equivalence classes did not blow up. That's the case for the index built from the outputs of BUSpaRse, and it still is the case when I built the index from scratch with the kb wrapper (the pre-computed one is probably compressed). There's probably something wrong with your index. What's the size of your cDNA_introns.fa file for human? Mine is 8.5 GB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants