Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removed mapping error rate from estimate of denoised copy ratios output by gCNV and updated sklearn. #7261

Merged
merged 1 commit into from
Jun 11, 2021

Conversation

samuelklee
Copy link
Contributor

@samuelklee samuelklee commented May 19, 2021

@fleharty this is a rebased version of the 17cadfa branch used to generate the Pf7 CNV call set. There are two minor changes: a) one to remove spurious negative dCR estimates reported by gCNV, which were negatively affecting genotyping of HRP2/3 deletions, and b) updating sklearn to the version used for clustering, so that we can reproduce everything exactly using just the GATK Docker. The latter change probably isn't absolutely necessary, but it doesn't seem to break anything so I'm going to go ahead with it. We might want to update to an even more recent version later on (especially if we make any breaking/non-refactoring improvements to the malaria genotyping code after the initial PR), but unfortunately this slightly changes the clustering assignment for a few samples.

@mwalker174 @asmirnov239 we discussed the first change some time ago, but just a heads up.

@samuelklee samuelklee force-pushed the sl_gcnv_dcr_error_rate branch from 17cadfa to cb2ab69 Compare May 19, 2021 17:37
@samuelklee samuelklee requested a review from fleharty May 19, 2021 17:42
@samuelklee
Copy link
Contributor Author

@fleharty want to get this in before release?

Copy link
Contributor

@fleharty fleharty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, though I would prefer to have more tests in the python code.

@@ -786,8 +786,7 @@ def __init__(self,
# the expected number of erroneously mapped reads
mean_mapping_error_correction_s = eps_mapping * read_depth_s * shared_workspace.average_ploidy_s

denoised_copy_ratio_st = ((shared_workspace.n_st - mean_mapping_error_correction_s.dimshuffle(0, 'x'))
/ ((1.0 - eps_mapping) * read_depth_s.dimshuffle(0, 'x') * bias_st))
denoised_copy_ratio_st = shared_workspace.n_st / (read_depth_s.dimshuffle(0, 'x') * bias_st)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this change should have broken a test, but it looks like the only tests test the Python environment. Should we add tests (maybe not for this PR)?

@samuelklee
Copy link
Contributor Author

samuelklee commented Jun 11, 2021

Yup, as you might recall from some discussions with @bhanugandham and @mwalker174, getting automated pipeline-level CNV evaluations up and running was highest on my list before I handed over the role and went on paternity leave. I think these tests would be more useful than unit/integration tests for correctness, but they would almost certainly have to run on CARROT.

That said, the current level of unit/integration test coverage is a bit different from that for the somatic tools, because 1) it's difficult to run gCNV in any useful way on Travis infrastructure, and 2) we hadn't decided on a framework/convention for python unit tests at the time the production code went in (although I think @ldgauthier has added some python unit tests by now).

So we currently only have plumbing WDL/integration tests on very small data for gCNV---and these only test that the tools run, not for correctness. For somatic CNV, we have unit tests for correctness on small simulated data (e.g., for things like segmentation and modeling classes), but integration tests don't cover correctness (and it would be pretty redundant to use the same simulated data for integration, so I'd rather put effort towards pipeline-level tests on real data).

It might be good for you and @mwalker174 to review the current level of testing coverage and understand where things need to be shored up---happy to discuss more.

@samuelklee samuelklee merged commit ce9dccb into master Jun 11, 2021
@samuelklee samuelklee deleted the sl_gcnv_dcr_error_rate branch June 11, 2021 12:58
koncheto-broad pushed a commit that referenced this pull request Jan 23, 2023
koncheto-broad added a commit that referenced this pull request Feb 2, 2023
* Added a new suite of tools for variant filtering based on site-level annotations. (#7954)

* Adds wdl that tests joint VCF filtering tools (#7932)

* adding filtering wdl

* renaming pipeline

* addressing comments

* added bash

* renaming json

* adding glob to extract for extra files

* changing dollar signs

* small comments

* Added changes for specifying model backend and other tweaks to WDLs and environment.

* Added classes for representing a collection of labeled variant annotations.

* Added interfaces for modeling and scoring backends.

* Added a new suite of tools for variant filtering based on site-level annotations.

* Added integration tests.

* Added test resources and expected results.

* Miscellaneous changes.

* Removed non-ASCII characters.

* Added documentation for TrainVariantAnnotationsModel and addressed review comments.

Co-authored-by: meganshand <[email protected]>

* Added toggle for selecting resource-matching strategies and miscellaneous minor fixes to new annotation-based filtering tools. (#8049)

* Adding use_allele_specific_annotation arg and fixing task with empty input in JointVcfFiltering WDL (#8027)

* Small changes to JointVCFFiltering WDL

* making default for use_allele_specific_annotations

* addressing comments

* first stab

* wire through WDL changes

* fixed typo

* set model_backend input value

* add gatk_override to JointVcfFiltering call

* typo in indel_annotations

* make model_backend optional

* tabs and spaces

* make all model_backends optional

* use gatk 4.3.0

* no point in changing the table names as this is a POC

* adding new branch to dockstore

* adding in branching logic for classic VQSR vs VQSR-Lite

* implementing the separate schemas for the VQSR vs VQSR-Lite branches, including Java changes necessary to produce the different tsv files

* passing classic flag to indel run of CreateFilteringFiles

* Update GvsCreateFilterSet.wdl

cleaning up verbiage

* Removed mapping error rate from estimate of denoised copy ratios output by gCNV and updated sklearn. (#7261)

* cleanup up sloppy comment

---------

Co-authored-by: samuelklee <[email protected]>
Co-authored-by: meganshand <[email protected]>
Co-authored-by: Rebecca Asch <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants