-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PED file validation #587
PED file validation #587
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making this available! I just have some minor comments.
wdl/Utils.wdl
Outdated
|
||
set -euo pipefail | ||
python opt/sv-pipeline/scripts/validate_ped.py -p ~{ped_file} -s {sample_list} | ||
# no outputs - task will either succeed or fail with details in log file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't aware you could do this, but I think it would be better to have either a dummy empty output or to just output the validated ped file. That way we can force Cromwell to wait until the validation is complete before moving on to tasks that consume the ped file.
# validate IDs | ||
# don't check for duplicates here: | ||
# family and parent IDs may appear multiple times, and sample IDs checked elsewhere | ||
for identifier, id_type in zip(fields[:FIELD_NUMBER_SEX], ["family", "sample", "parent", "parent"]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine, but you could define the id_type strings as constants at the top of the script (e.g. as you've done with FIELD_NUMBER_SEX
)
elif not sex.isdigit(): | ||
raise ValueError(f"Sample {sample_id} has an invalid value for sex: {sex}. " + | ||
"PED file must use integer values for sex: 1=Male, 2=Female, 0=Unknown/Other.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be more strict and require that it be 0, 1, or 2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either way is fine with me. I did it this way because it will only fail in CleanVcfPart1 if it's not an integer, and I wasn't sure if some groups might want to encode different categories of "other" in the sex column. But it does seem simpler to just enforce 0,1,2
wdl/GatherBatchEvidence.wdl
Outdated
call util.ValidatePedFile { | ||
input: | ||
ped_file = ped_file, | ||
sample_list = write_lines(samples_batch), | ||
sv_pipeline_docker = sv_pipeline_docker, | ||
runtime_attr_override = runtime_attr_validate_ped | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would good to require this passes before attempting to run cnmops, otherwise you would get two errors with a bad ped file (1 from validation and 1 from cnmops, which could be confusing). If you let the validated ped file be the output, you can just wire that up to the ped file subset task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point!
inputs/templates/terra_workspaces/cohort_mode/cohort_mode_workspace_dashboard.md.tmpl
Show resolved
Hide resolved
Updated based on review, re-tested locally, built docker and tested in GatherBatchEvidence with correct & incorrect PED file. Note that an incorrect PED file does not result in immediate workflow failure because many steps do not take the PED file as input. |
GATK-SV requires a slightly stricter PED file format than the one described in the linked PED file format documentation. Users are often confused about the PED file format, and error messages can be difficult to interpret. In this PR, I seek to clarify the PED file format requirements and add a script for validation to be used both in the pipeline and by users.
Updates
Testing