Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SVConcordance workflows update #540

Merged
merged 9 commits into from
Jun 19, 2023
Merged

SVConcordance workflows update #540

merged 9 commits into from
Jun 19, 2023

Conversation

mwalker174
Copy link
Collaborator

Streamlines workflows for SVConcordance. The new ordering is as follows:

CleanVcf -> FormatVcfForGatk -> JoinRawCalls -> SVConcordance -> RecalibrateGq

  • New workflow FormatVcfForGatk, which is needed for formatting ClusterBatch and CleanVcf vcfs for consumption by SVCluster/SVConcordance
  • JoinRawCalls and FormatVcfForGatk now generate new ploidy tables from a ped file. The ploidy table generated in ClusterBatch currently has chrY ploidy fixed to 1 for all samples due to an issue with RDTest. However, at this stage chrY should have ploidy 0 for females, so we generate new ploidy tables. Since JoinRawCalls and FormatVcfForGatk can be run in parallel, it is convenient to have each generate the ploidy table separately. They do repeat the same work here, but it is a very fast job.
  • Updated GATK docker with many fixes to SVConcordance (see Size similarity linkage and bug fixes for SV matching tools gatk#8257)
  • Adds HGDP testing batch, with resources for ClusterBatch, Vapor, and FormatVcfForGatk-onward.
  • Updates and simplifies svtk-to-gatk and gatk-to-svtk formatting scripts.
  • Scalability improvements for JoinRawCalls (reduces number of jobs by first concatenating ClusterBatch vcf within batches).
  • Simplifies and update SVConcordance workflow, removing formatting steps.

@mwalker174 mwalker174 changed the title Mw sv concordance update SVConcordance workflows update May 25, 2023
@mwalker174 mwalker174 requested a review from epiercehoffman May 26, 2023 15:41
Copy link
Member

@VJalili VJalili left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @mwalker174! In general, this reads excellent! My only comment is on the docker images.

inputs/values/dockers.json Show resolved Hide resolved
Copy link
Collaborator

@epiercehoffman epiercehoffman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's exciting to see the new filtering workflows get closer to being ready for widespread use! The simplification/reorganization of SVConcordance makes sense and the scaling improvements to JoinRawCalls sound good. Thanks for adding Vapor test data as well.

What would you think of integrating FormatVcfForGatk into CleanVcf to reduce the number of workflows users are required to run? It could replace FixEndsRescaleGq as well to reduce redundancy.

I have left some additional comments throughout. Some are really just questions for my own understanding, but those may also indicate places where further documentation could be helpful.

new_genotype['GT'] = (0, 1)
else:
new_genotype['GT'] = (0, 0)
if _cache_gt_sum(genotype.get('GT', None)) > 0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the problem this is addressing? Why is not possible to return (1,1)? Does it just not matter because this script is only run during ClusterBatch, and then all the genotypes will be reassigned during GenotypeBatch anyway? If that's the case, it may be worth documenting, since it would make this script more specific rather than for general use (similarly with the END2/END swap)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, this is meant only to be used in ClusterBatch. I've updated the arguments description.

@@ -97,6 +97,7 @@ workflow ClusterBatch {
ped_file=ped_file,
script=ploidy_table_script,
contig_list=contig_list,
retain_female_chr_y=false,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your PR note states that female chrY ploidy needs to be 1 during ClusterBatch but it looks like this would cause it to be 0. Should this be set to true?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes good catch

String? chr_x
String? chr_y

File? svtk_to_gatk_script # For debugging
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to remove this in production or does it not matter?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to leave it in - this can be useful in case someone wants to substitute a different script for slightly different vcfs.

genotype['ECN'] = ploidy_dict[sample][contig]
if scale_down_gq:
rescale_gq(record)
return record


def _cache_gt_sum(gt):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function doesn't appear to be used anymore in this script

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks deleted this and the one below it too

bnd_end_dict: Optional[Dict[Text, int]],
ploidy_dict: Dict[Text, Dict[Text, int]]) -> pysam.VariantRecord:
ploidy_dict: Dict[Text, Dict[Text, int]],
scale_down_gq: bool) -> pysam.VariantRecord:
"""
Converts a record from svtk to gatk style. This includes updating all GT fields with proper ploidy, and adding
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like there are updates to GT in this script anymore

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

if svtype == 'DEL':
new_genotype['CN'] = 1
record.ref = 'N'
if svtype == 'BND':
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GATK->SVTK script also looks for END2 for CTX, should this match?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I don't think it would affect anything with where these scripts are used currently (no CTX in ClusterBatch), but I've added the CTX case here in case

@@ -1168,6 +1168,7 @@
],
"clean_vcf": "gs://gatk-sv-ref-panel-1kg/outputs/MakeCohortVcf/8a209488-c928-449d-92cd-0a5131e92b7c/call-CleanVcf/CleanVcf/277f3f25-bb99-4fe4-a48b-567fd3f344f9/call-ConcatCleanedVcfs/ref_panel_1kg.cleaned.vcf.gz",
"clean_vcf_index": "gs://gatk-sv-ref-panel-1kg/outputs/MakeCohortVcf/8a209488-c928-449d-92cd-0a5131e92b7c/call-CleanVcf/CleanVcf/277f3f25-bb99-4fe4-a48b-567fd3f344f9/call-ConcatCleanedVcfs/ref_panel_1kg.cleaned.vcf.gz.tbi",
"clean_vcf_gatk_formatter_args": "--use-end2",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this clean VCF is from later than v0.22-beta, so based on your filtering Google Doc should this be "" instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's a mistake in the doc, which I've now updated. I've also changed this argument to the opposite --fix-end which is a little clearer and not required for current VCFs.

@mwalker174
Copy link
Collaborator Author

@VJalili @epiercehoffman Thanks for your reviews. I've responded to each comment individually.

@epiercehoffman I also added gatk formatting to the end of CleanVcfChromosome to make future processing easier. I will still leave the FormatVcfForGatk wdl in place for use with old cleaned vcfs.

Update gatk docker

More updates

Add hgdp resources

Update hgdp resources to gs://gatk-sv-hgdp

GATK nightly docker

Set default records_per_shard in FormatVcfForGatk

Update JoinRawCalls

Add gatk_formatted_vcf to 1kg ref panel

Remove remove_infos and remove_formats from PESRCluster preprocess task

Add indexes to JoinRawCalls

Update dockers

Update 1kgp joined_raw_calls_vcf

Fix PreparePESRVcfs

Fix SVLEN filter in PreparePESRVcfs

Bump default SVConcordance memory to 16GB
@mwalker174 mwalker174 force-pushed the mw_sv_concordance_update branch from bb26cff to 4ff8aab Compare June 15, 2023 17:41
Copy link
Member

@VJalili VJalili left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for this great work, and thank you for addressing the comments!

@mwalker174 mwalker174 merged commit 7e40807 into main Jun 19, 2023
@mwalker174 mwalker174 deleted the mw_sv_concordance_update branch June 19, 2023 15:12
gatk-sv-bot pushed a commit to Genometric/gatk-sv that referenced this pull request Jun 27, 2023
gatk-sv-bot pushed a commit to Genometric/gatk-sv that referenced this pull request Jun 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants