-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SVConcordance workflows update #540
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @mwalker174! In general, this reads excellent! My only comment is on the docker images.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's exciting to see the new filtering workflows get closer to being ready for widespread use! The simplification/reorganization of SVConcordance makes sense and the scaling improvements to JoinRawCalls sound good. Thanks for adding Vapor test data as well.
What would you think of integrating FormatVcfForGatk into CleanVcf to reduce the number of workflows users are required to run? It could replace FixEndsRescaleGq as well to reduce redundancy.
I have left some additional comments throughout. Some are really just questions for my own understanding, but those may also indicate places where further documentation could be helpful.
new_genotype['GT'] = (0, 1) | ||
else: | ||
new_genotype['GT'] = (0, 0) | ||
if _cache_gt_sum(genotype.get('GT', None)) > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the problem this is addressing? Why is not possible to return (1,1)? Does it just not matter because this script is only run during ClusterBatch, and then all the genotypes will be reassigned during GenotypeBatch anyway? If that's the case, it may be worth documenting, since it would make this script more specific rather than for general use (similarly with the END2/END swap)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly, this is meant only to be used in ClusterBatch. I've updated the arguments description.
wdl/ClusterBatch.wdl
Outdated
@@ -97,6 +97,7 @@ workflow ClusterBatch { | |||
ped_file=ped_file, | |||
script=ploidy_table_script, | |||
contig_list=contig_list, | |||
retain_female_chr_y=false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your PR note states that female chrY ploidy needs to be 1 during ClusterBatch but it looks like this would cause it to be 0. Should this be set to true
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes good catch
String? chr_x | ||
String? chr_y | ||
|
||
File? svtk_to_gatk_script # For debugging |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to remove this in production or does it not matter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to leave it in - this can be useful in case someone wants to substitute a different script for slightly different vcfs.
genotype['ECN'] = ploidy_dict[sample][contig] | ||
if scale_down_gq: | ||
rescale_gq(record) | ||
return record | ||
|
||
|
||
def _cache_gt_sum(gt): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function doesn't appear to be used anymore in this script
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks deleted this and the one below it too
bnd_end_dict: Optional[Dict[Text, int]], | ||
ploidy_dict: Dict[Text, Dict[Text, int]]) -> pysam.VariantRecord: | ||
ploidy_dict: Dict[Text, Dict[Text, int]], | ||
scale_down_gq: bool) -> pysam.VariantRecord: | ||
""" | ||
Converts a record from svtk to gatk style. This includes updating all GT fields with proper ploidy, and adding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't look like there are updates to GT in this script anymore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
if svtype == 'DEL': | ||
new_genotype['CN'] = 1 | ||
record.ref = 'N' | ||
if svtype == 'BND': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The GATK->SVTK script also looks for END2 for CTX, should this match?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, I don't think it would affect anything with where these scripts are used currently (no CTX in ClusterBatch), but I've added the CTX case here in case
inputs/values/ref_panel_1kg.json
Outdated
@@ -1168,6 +1168,7 @@ | |||
], | |||
"clean_vcf": "gs://gatk-sv-ref-panel-1kg/outputs/MakeCohortVcf/8a209488-c928-449d-92cd-0a5131e92b7c/call-CleanVcf/CleanVcf/277f3f25-bb99-4fe4-a48b-567fd3f344f9/call-ConcatCleanedVcfs/ref_panel_1kg.cleaned.vcf.gz", | |||
"clean_vcf_index": "gs://gatk-sv-ref-panel-1kg/outputs/MakeCohortVcf/8a209488-c928-449d-92cd-0a5131e92b7c/call-CleanVcf/CleanVcf/277f3f25-bb99-4fe4-a48b-567fd3f344f9/call-ConcatCleanedVcfs/ref_panel_1kg.cleaned.vcf.gz.tbi", | |||
"clean_vcf_gatk_formatter_args": "--use-end2", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this clean VCF is from later than v0.22-beta, so based on your filtering Google Doc should this be ""
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's a mistake in the doc, which I've now updated. I've also changed this argument to the opposite --fix-end
which is a little clearer and not required for current VCFs.
@VJalili @epiercehoffman Thanks for your reviews. I've responded to each comment individually. @epiercehoffman I also added gatk formatting to the end of CleanVcfChromosome to make future processing easier. I will still leave the FormatVcfForGatk wdl in place for use with old cleaned vcfs. |
Update gatk docker More updates Add hgdp resources Update hgdp resources to gs://gatk-sv-hgdp GATK nightly docker Set default records_per_shard in FormatVcfForGatk Update JoinRawCalls Add gatk_formatted_vcf to 1kg ref panel Remove remove_infos and remove_formats from PESRCluster preprocess task Add indexes to JoinRawCalls Update dockers Update 1kgp joined_raw_calls_vcf Fix PreparePESRVcfs Fix SVLEN filter in PreparePESRVcfs Bump default SVConcordance memory to 16GB
bb26cff
to
4ff8aab
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for this great work, and thank you for addressing the comments!
Streamlines workflows for SVConcordance. The new ordering is as follows:
CleanVcf -> FormatVcfForGatk -> JoinRawCalls -> SVConcordance -> RecalibrateGq