Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SA support and consistent naming for all GVS WDLs #7205

Merged
merged 38 commits into from
Apr 16, 2021

Conversation

mmorgantaylor
Copy link
Member

@mmorgantaylor mmorgantaylor commented Apr 15, 2021

  • GvsImportGenomes.wdl - renamed WDL, fixed issue with poorly returned bq load string
  • GvsCreateFilterSet.wdl - renamed WDL, added SA support, fixed a lot of issues in ExtractFeatures tool surrounding permissions - BQ projectIDs are now being properly passed through, freq_table UDF defined in repo rather than in BQ
  • GvsPrepareCallset.wdl (done in previous PR)
  • GvsExtractCallset.wdl - renamed WDL, added SA support
  • SA testing README added

all 4 tested with SA in this Terra workspace: https://app.terra.bio/#workspaces/broad-dsp-spec-ops-fc/gvs_sa_testing
testing also without SA in this workspace: https://app.terra.bio/#workspaces/broad-dsp-spec-ops-fc/gvs_testing_no_sa

filters:
branches:
- master
- ah_var_store
- name: funcotator
- mmt_SA_support_and_wdl_renaming
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: remove branch references before merging

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually will wait for a future PR so it doesn't break for anyone using these WDLs

@gatk-bot
Copy link

gatk-bot commented Apr 15, 2021

Travis reported job failures from build 33793
Failures in the following jobs:

Test Type JDK Job ID Logs
cloud openjdk8 33793.1 logs
cloud openjdk11 33793.14 logs

Copy link
Contributor

@kcibul kcibul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great -- very thorough

scripts/variantstore/wdl/GvsCreateFilterSet.wdl Outdated Show resolved Hide resolved
--local-sort-max-records-in-ram ~{local_sort_max_records_in_ram} \
--sample-table ~{fq_sample_table} \
--cohort-extract-table ~{fq_cohort_extract_table} \
-L ~{intervals} \
~{"-XL " + excluded_intervals} \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

@mmorgantaylor mmorgantaylor merged commit 3ca3772 into ah_var_store Apr 16, 2021
@mmorgantaylor mmorgantaylor deleted the mmt_SA_support_and_wdl_renaming branch April 16, 2021 18:57
Copy link
Contributor

@ahaessly ahaessly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍
just a few questions and items for discussion

df -h

gatk --java-options "-Xmx9g" \
ExtractCohort \
--mode GENOMES --ref-version 38 --query-mode LOCAL_SORT \
-R "~{reference}" \
-O local.vcf.gz \
-O ~{output_file} \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious about the difference between using double quotes in the line above and this one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it doesn't matter - the same way you can enter command line string arguments either with or without quotes. but being consistent would be good!

"GvsExtractCallset.reference_index": "gs://gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.fai",
"GvsExtractCallset.reference_dict": "gs://gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.dict",
"GvsExtractCallset.wgs_intervals": "gs://gcp-public-data--broad-references/hg38/v0/wgs_calling_regions.hg38.interval_list",
"GvsExtractCallset.scatter_count": 50,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the filter extract uses a scatter count of 20. is the difference on purpose?

Copy link
Member Author

@mmorgantaylor mmorgantaylor Apr 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the difference was not on purpose! 20 and 50 were defaults from what i found and left unchanged. would love to think deliberately about what these should be.

@@ -0,0 +1,17 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the name of the wdl is not plural. we should probably rename to GvsExtractCallset.example.inputs.json

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great catch! going to make a separate PR to fix this.

@@ -348,7 +349,9 @@ task CreateImportTsvs {
~{for_testing_only}

if [ ~{has_service_account_file} = 'true' ]; then
export GOOGLE_APPLICATION_CREDENTIALS=~{service_account_json}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think i briefly saw something about adding this on one of the slack channels. I'd like to have a brief discussion so we can set up some conventions. Doing this has an impact on being able to stream inputs. Briefly if the tool runs as the service account, it should be able to stream files in service account buckets, but it can't stream files from the workspace bucket. when we don't set the environment variable, we specifically set localization_optional to true for the files in the service account buckets and then copy them to the vm after auth-ing as the service account. if we are going to set the env var, we should test that it can stream from the service account bucket and then not do the explicit copy. (and we also need to make sure that it is not trying to stream anything from the workspace bucket)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for bringing this up, I wondered about this as well - I did test this WDL with this change and using a SA and it worked fine - https://app.terra.bio/#workspaces/broad-dsp-spec-ops-fc/gvs_sa_testing/job_history/78d0610c-256a-46e4-9f81-fbc10cc2be06 - that had the gvcfs in a separate (non-workspace) bucket, but I did keep your gsutil cp step in there, which is likely why we didn't encounter a streaming issue?

This was referenced Mar 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants