Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gCNV nan errors #4824

Closed
mwalker174 opened this issue May 29, 2018 · 14 comments · Fixed by #6245
Closed

gCNV nan errors #4824

mwalker174 opened this issue May 29, 2018 · 14 comments · Fixed by #6245

Comments

@mwalker174
Copy link
Contributor

@samuelklee @asmirnov239 @mbabadi I tried to run a 30-sample cohort through gCNV on all canonical chromosomes with 250bp bins sharded in 10k-interval blocks, but PostprocessGermlineCNVCalls gave the following error:

19:26:14.967 INFO  PostprocessGermlineCNVCalls - Analyzing shard 223...
19:26:15.107 INFO  PostprocessGermlineCNVCalls - Analyzing shard 224...
19:26:15.259 INFO  PostprocessGermlineCNVCalls - Analyzing shard 225...
19:26:15.260 INFO  PostprocessGermlineCNVCalls - Shutting down engine
[May 29, 2018 7:26:15 PM UTC] org.broadinstitute.hellbender.tools.copynumber.PostprocessGermlineCNVCalls done. Elapsed time: 3.34 minutes.
Runtime.totalMemory()=39753089024
***********************************************************************

A USER ERROR has occurred: Bad input: Validation error occurred on line %d of the posterior file: Posterior probabilities for at at least one posterior record do not sum up to one.

After inspecting the output from shard 225, it seems that the model starts producing nan values after ~1600 warmup iterations (looking at the ELBO log). This shard corresponds to a pericentromeric region chr3:91540501-94090250.

It would be nice to have the option to bypass this error in PostprocessGermlineCNVCalls.

Here is the model config for the shard:

 "p_active": 0.01,
 "cnv_coherence_length": 10000.0,
 "class_coherence_length": 10000.0,
 "max_copy_number": 5,
 "num_calling_processes": 1,
 "num_copy_number_states": 6,
 "num_copy_number_classes": 2
 "max_bias_factors": 5,
 "mapping_error_rate": 0.01,
 "psi_t_scale": 0.001,
 "psi_s_scale": 0.0001,
 "depth_correction_tau": 10000.0,
 "log_mean_bias_std": 0.1,
 "init_ard_rel_unexplained_variance": 0.1,
 "num_gc_bins": 20,
 "gc_curve_sd": 1.0,
 "q_c_expectation_mode": "hybrid",
 "active_class_padding_hybrid_mode": 50000,
 "enable_bias_factors": false,
 "enable_explicit_gc_bias_modeling": false,
 "disable_bias_factors_in_active_class": false
 "version": "0.7"
@ldgauthier
Copy link
Contributor

How did you guys generate the target list for hg38? I've been having some problems with regions near the centromeres for SNPs and indels as well. The centromeres.bed for hg38 from UCSC seems to include the computationally generated centromeres, but not the additional gross regions nearby that we excluded from b37. Laurent was excluding a fair amount of territory beyond the "official" centromeres for his QC based on the density of multi-allelic variant calls.

@samuelklee
Copy link
Contributor

@mbabadi We should look into ways to be more robust against NaNs, but I think we should just go ahead and blacklist these regions. This can be done from the outset of the pipeline via the -XL argument to PreprocessIntervals. Does SV team have a canonical list we can start recommending? Looks like http://cf.10xgenomics.com/supp/genome/GRCh38/sv_blacklist.bed may also be a good option. Perhaps we can add some padding if necessary.

@mwalker174
Copy link
Contributor Author

For SVs, we are not blacklisting any regions except sometimes gaps and centromeres. Unfortunately many of the events occur in messy areas like this and I think it’s going to be a major issue if we can’t guarantee that the model will be robust in such regions.

@samuelklee
Copy link
Contributor

@mwalker174 the region you found above is included in the 10X SV blacklist. What list is the SV team currently using?

If the read-depth data is not reliable in these regions, I would not expect the model fit to be very good, even if we made the model more robust against nans. So I wouldn't think the results would be very useful for SV integration. Is there a way to make CNV-SV integration "Bayesian" in the sense that we could fall back on a prior in the case of missing CNV data?

@mwalker174
Copy link
Contributor Author

@samuelklee We aren't using any blacklists currently. I am less concerned about noisy calls because they usually don't line up well with read pair evidence. That said, I did not get nan errors with 1kbp bins - perhaps we could bump up the bin sizes in regions where this happens?

@cwhelan
Copy link
Member

cwhelan commented May 30, 2018

@samuelklee In my experience the 10x blacklist is very very conservative and will likely exclude many regions of common copy number variation. I'd recommend something less restrictive for general use.

@mwalker174
Copy link
Contributor Author

There were more nans in these other chunks, all of which overlap the UCSC centromeres regions:

chr9:43318001-60923750
chr10:39050001-41716500
chr11:50778501-53535250
chr17:22749251-25299000
chr18:17216251-19716250

I'm going to try again with centromeres blacklisted.

@mwalker174
Copy link
Contributor Author

Blacklisting centromeres resolved the NaN errors except for one block on chrY that roughly corresponds to region q11.23. I blacklisted that block and got no more errors.

@samuelklee
Copy link
Contributor

My guess is that these regions have unusually high coverage, which is probably yielding NaN likelihoods. @mwalker174 any way you can check this from your previous runs?

Our philosophy so far has been to keep the tools relatively agnostic by allowing generic blacklisting via -XL and pushing the responsibility to the users.

@samuelklee
Copy link
Contributor

@mwalker174 can you go back and check whether high coverage was causing the NaNs?

@samuelklee samuelklee removed their assignment Feb 1, 2019
@samuelklee
Copy link
Contributor

@mwalker174 does this need to be addressed?

@mwalker174
Copy link
Contributor Author

@samuelklee Yes. Looking forward, we will want to reduce the extent of our blacklist and interval filtering, which are currently needed to prevent these errors.

@samuelklee
Copy link
Contributor

samuelklee commented Oct 30, 2019

@mwalker174 let's check whether these NaNs were caused by high coverage. Perhaps we can address them along with those due to vanishing overdispersion (which were caused by large values of interval-psi-scale).

@samuelklee
Copy link
Contributor

At least partially addressed in #6245, we can reopen if there are other NaNs that we have to patch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants