Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GermlineCNVCaller -XL gives error "Intervals for read-count file do not contain all specified intervals" #5388

Closed
sooheelee opened this issue Nov 2, 2018 · 7 comments
Assignees

Comments

@sooheelee
Copy link
Contributor

Bug Report

Affected tool(s) or class(es)

GermlineCNVCaller

Affected version(s)

v4.0.4.0 and v4.0.11.0 tested with same result

Description

screenshot 2018-11-02 14 50 17

java.lang.IllegalArgumentException: Intervals for read-count file /home/shlee/gcnv/cvg/HG00096_chr20XY.hdf5 do not contain all specified intervals.
        at org.broadinstitute.hellbender.utils.Utils.validateArg(Utils.java:724)
        at org.broadinstitute.hellbender.tools.copynumber.GermlineCNVCaller.writeIntervalSubsetReadCountFiles(GermlineCNVCaller.java:390)
        at org.broadinstitute.hellbender.tools.copynumber.GermlineCNVCaller.doWork(GermlineCNVCaller.java:285)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
        at org.broadinstitute.hellbender.Main.main(Main.java:289)

Command runs fine sans -XL parameter. The contents of -XL are simply:

screenshot 2018-11-02 14 51 58

Expected behavior

It would be great to be able to iterate GermlineCNVCaller on coverage data while excluding various regions, e.g. centromeric regions, to test the impact of such regions on the denoising. Currently, the hypothetical workaround would be to collect coverage while excluding regions or to manually remove such intervals from the coverage data. Having to collect coverage once over all of the data is preferable to collecting coverage again and again over slightly variable regions.

@sooheelee
Copy link
Contributor Author

@samuelklee, I tested using v4.0.11.0 and see the same error with -XL so here is the bug report as promised.

@samuelklee
Copy link
Contributor

samuelklee commented Nov 2, 2018

Can you describe the inputs you are passing to -L and -XL? (I'm guessing the latter is just a BED file of centromeric regions, but how about the former?)

Is there any issue when you use FilterIntervals as is done in the WDL, i.e., running FilterIntervals with -L = the interval list of bins output by PreprocessIntervals and -XL = your blacklist, and then passing the corresponding interval list to the -L argument of GermlineCNVCaller? This allows one to mask intervals at the modeling steps without recollecting coverage as you describe.

@samuelklee
Copy link
Contributor

Just to be clear, the gCNV code expects that the intervals obtained after resolving the inputs of -L and -XL via the engine should specify the bins that the user wishes to retain from the coverage files.

@sooheelee
Copy link
Contributor Author

Yes, I see the same error when using FilterIntervals in the context of the WDL when supplying the BED file with the -XL param.

@sooheelee
Copy link
Contributor Author

sooheelee commented Nov 8, 2018

Thanks for clarifying the -XL intervals should match the bins! This is the problem as the BED file presents centromeric regions that do not match the binning. I will think of a way to solve this for the tutorial.

@samuelklee
Copy link
Contributor

samuelklee commented Nov 8, 2018

Hmm, actually, I think there might be more to it. The current FilterIntervals code also assumes that the annotated bins match the bins resolved from -L/-XL (this is a bug). So if you pass the same -L/-XL upstream to AnnotateIntervals, then you should be OK. However, you might still get into trouble if your coverage bins don't match exactly (which might happen if -XL splits some bins).

Let me try and clean some of this up in a future PR. (I think the underlying issue is that we are trying to leverage some of the -L/-XL machinery provided by the engine, which should hopefully be familiar to users, but some of the assumptions the engine makes aren't really in line with what we need for CNV. This is also reflected in the awkward need for -imr OVERLAPPING_ONLY in all of the CNV tools.) I think for the tutorial you can just go ahead and blacklist at the coverage collection step.

@samuelklee
Copy link
Contributor

Did some more thinking about this issue. Ideally, we'd drop all -L bins that overlap at all with any -XL regions, then check that the remaining bins are a subset of the annotated intervals and/or count files, if available. This seems most natural, in that -L/-XL would specify the desired set of intervals for filtering, and we'd fail if all of these are not available in the other inputs.

However, due to the way intervals are resolved by the engine, I don't think it's easy to identify which bins overlap with -XL regions---the engine will instead split bins and retain the parts that don't overlap. So alternatively, if we assume that in typical use the annotated intervals and count files will contain the desired intervals as a subset, we can simply take the intersection of all intervals to drop these partial bins. However, if a user screws up and provides annotated intervals or count files with bins that don't match those specified via -L, then we don't really have a good strategy for failing---probably the only fair check we can do is fail if no bins remain after intersection.

If we assume that users will typically be using or following the WDL, I think I'm OK with the second strategy. Any objections or thoughts, @sooheelee?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants