Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GetPileupSummaries runs out of memory even with >200 available RAM #5918

Closed
bhanugandham opened this issue May 6, 2019 · 7 comments
Closed
Assignees
Labels

Comments

@bhanugandham
Copy link
Contributor

bhanugandham commented May 6, 2019

A user reported that when running GetPileupSummaries on gnomad vcf, the tool runs out of java heap memory. Xmx value was set to -Xmx30G and the machine has >200G RAM.

User Report: I'm trying to run the cross sample contamination check on my samples, but GetPileupSummaries (4.1.1.0) keeps running out of memory, even when running a single sample on a VM that has >200GB of RAM available.

14:35:16.874 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/gatk-package-4.1.1.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
14:35:17.116 INFO  GetPileupSummaries - ------------------------------------------------------------
14:35:17.117 INFO  GetPileupSummaries - The Genome Analysis Toolkit (GATK) v4.1.1.0
14:35:17.117 INFO  GetPileupSummaries - For support and documentation go to https://software.broadinstitute.org/gatk/
14:35:17.118 INFO  GetPileupSummaries - Executing as root@c64bec8aea6f on Linux v4.15.0-47-generic amd64
14:35:17.118 INFO  GetPileupSummaries - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_191-8u191-b12-0ubuntu0.16.04.1-b12
14:35:17.118 INFO  GetPileupSummaries - Start Date/Time: April 24, 2019 2:35:16 PM UTC
14:35:17.118 INFO  GetPileupSummaries - ------------------------------------------------------------
14:35:17.119 INFO  GetPileupSummaries - ------------------------------------------------------------
14:35:17.119 INFO  GetPileupSummaries - HTSJDK Version: 2.19.0
14:35:17.119 INFO  GetPileupSummaries - Picard Version: 2.19.0
14:35:17.120 INFO  GetPileupSummaries - HTSJDK Defaults.COMPRESSION_LEVEL : 2
14:35:17.120 INFO  GetPileupSummaries - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
14:35:17.120 INFO  GetPileupSummaries - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
14:35:17.120 INFO  GetPileupSummaries - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
14:35:17.120 INFO  GetPileupSummaries - Deflater: IntelDeflater
14:35:17.120 INFO  GetPileupSummaries - Inflater: IntelInflater
14:35:17.121 INFO  GetPileupSummaries - GCS max retries/reopens: 20
14:35:17.121 INFO  GetPileupSummaries - Requester pays: disabled
14:35:17.121 WARN  GetPileupSummaries -

   !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

   Warning: GetPileupSummaries is a BETA tool and is not yet ready for use in production

   !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!


14:35:17.121 INFO  GetPileupSummaries - Initializing engine
14:35:17.456 INFO  FeatureManager - Using codec VCFCodec to read file file:///gatk/data/gnomad/vcf/genomes/liftover_grch38/gnomad.b38.biallelic_only.concat.sorted.filtered.vcf.gz
14:35:17.586 INFO  FeatureManager - Using codec VCFCodec to read file file:///gatk/data/gnomad/vcf/genomes/liftover_grch38/gnomad.b38.biallelic_only.concat.sorted.filtered.vcf.gz
16:39:08.359 INFO  IntervalArgumentCollection - Processing 236373212 bp from intervals
16:41:01.520 INFO  GetPileupSummaries - Done initializing engine
16:41:01.521 INFO  ProgressMeter - Starting traversal
16:41:01.521 INFO  ProgressMeter -        Current Locus  Elapsed Minutes        Loci Processed      Loci/Minute
02:44:42.116 INFO  GetPileupSummaries - Shutting down engine
[April 25, 2019 2:44:42 AM UTC] org.broadinstitute.hellbender.tools.walkers.contamination.GetPileupSummaries done. Elapsed time: 729.42 minutes.
Runtime.totalMemory()=23243784192
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:3181)
        at java.util.ArrayList.grow(ArrayList.java:265)
        at java.util.ArrayList.ensureExplicitCapacity(ArrayList.java:239)
        at java.util.ArrayList.ensureCapacityInternal(ArrayList.java:231)
        at java.util.ArrayList.add(ArrayList.java:462)
        at htsjdk.samtools.BinningIndexContent.getChunksOverlapping(BinningIndexContent.java:131)
        at htsjdk.samtools.CachingBAMFileIndex.getSpanOverlapping(CachingBAMFileIndex.java:75)
        at htsjdk.samtools.BAMFileReader.getFileSpan(BAMFileReader.java:935)
        at htsjdk.samtools.BAMFileReader.createIndexIterator(BAMFileReader.java:952)
        at htsjdk.samtools.BAMFileReader.query(BAMFileReader.java:612)
        at htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter.query(SamReader.java:533)
        at htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter.queryOverlapping(SamReader.java:405)
        at org.broadinstitute.hellbender.utils.iterators.SamReaderQueryingIterator.loadNextIterator(SamReaderQueryingIterator.java:125)
        at org.broadinstitute.hellbender.utils.iterators.SamReaderQueryingIterator.(SamReaderQueryingIterator.java:66)
        at org.broadinstitute.hellbender.engine.ReadsDataSource.prepareIteratorsForTraversal(ReadsDataSource.java:404)
        at org.broadinstitute.hellbender.engine.ReadsDataSource.iterator(ReadsDataSource.java:330)
        at java.lang.Iterable.spliterator(Iterable.java:101)
        at org.broadinstitute.hellbender.utils.Utils.stream(Utils.java:1098)
        at org.broadinstitute.hellbender.engine.GATKTool.getTransformedReadStream(GATKTool.java:321)
        at org.broadinstitute.hellbender.engine.LocusWalker.traverse(LocusWalker.java:159)
        at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:984)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:138)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205)
        at org.broadinstitute.hellbender.Main.main(Main.java:291)
Using GATK jar /gatk/gatk-package-4.1.1.0-local.jar

The samples I'm running it on are hg38 aligned, ~200GB bam files that have been merged from multiple lanes, and sometimes two different flowcells. Other than than, nothing special about them. I have been able to run the contamination check successfully on other, non-merged samples.

With this particular run, I tried defining --java-options "-Xmx30G" for the GetPileupSummaries process.

This Issue was generated from your [forums]
[forums]: https://gatkforums.broadinstitute.org/gatk/discussion/23931/getpileupsummaries-runs-out-of-memory/p1

@davidbenjamin
Copy link
Contributor

@bhanugandham It looks like the user is not using our provided resource file for GetPileupSummaries. It seems that the user selected only biallelic sites from gnomAD by hand, without removing all the extraneous info fields and restricting the exonic variants as we do in our resource. This means that the tool needs to load a huge amount of gnomAD into memory at any given time and is probably causing the crash.

If the user follows our best practices I expect the problem to go away.

@davidbenjamin
Copy link
Contributor

We haven't heard back from these users on the forum, so I assume the problem went away upon using the correct resource file.

@ghost
Copy link

ghost commented Mar 17, 2021

I am also getting this hanging issue at INFO ProgressMeter - Current Locus Elapsed Minutes Loci Processed Loci/Minute with 128GB of RAM on node and to -Xmx. Version 4.2.0.0 and 3.6

@davidbenjamin
Copy link
Contributor

@bolton-lab What are your GATK commands and what resource files are you using?

@LedaKatopodi
Copy link

LedaKatopodi commented May 26, 2022

hello @davidbenjamin,

Could you point me to the location of your provided resource file for GetPileupSummaries?

I have been using the af-only-gnomad.hg38.vcf.gz VCF file provided by GATK for my analyses.

For the GetPileupSummaries I had to modify the af-gnomad file, following the GATK directions from here to extract only the AF field, and here to extract only the Biallelic SNPs, and I further had to remove some contigs that are not present in my reference FASTA. Still I could not get GetPileupSummaries to run properly, it remains frozen for hours at this step:

13:05:29.711 INFO  ProgressMeter - Starting traversal
13:05:29.712 INFO  ProgressMeter -        Current Locus  Elapsed Minutes        Loci Processed      Loci/Minute

I found this issue and was about to test the RAM requirements, but I would also like to give your provided resource file for GetPileupSummaries a look if you could point me to it, in case this solves my problem once and for all.

Thank you very much.

Cheers,
Leda

@davidbenjamin
Copy link
Contributor

davidbenjamin commented Jun 13, 2022

Hi @LedaKatopodi,

This is probably fixed in a recent PR #7664. Additionally, though, CalculateContamination, which looks for its signal in homozygous alt sites, does not need to find every last hom alt in the genome to do its job. Therefore, you can get equally good results in much less time by using only common exonic variants in GetPileupSummaries. Our best practices files for this is gs://gatk-best-practices/somatic-hg38/small_exac_common_3.hg38.vcf.gz, which is about 1/3000 the size of the corresponding gnomAD file.

Regards,
David

PS If this does not resolve your issue, please re-open this ticket!

@LedaKatopodi
Copy link

Hi @davidbenjamin,

Thank you very much for the reply, the additional information on how to best use CalculateContamination and GetPileupSummaries, and for confirming the location of the best practices file!

Best,
Leda

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants