Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PED file validation #587

Merged
merged 14 commits into from
Aug 29, 2023
Merged

PED file validation #587

merged 14 commits into from
Aug 29, 2023

Conversation

epiercehoffman
Copy link
Collaborator

GATK-SV requires a slightly stricter PED file format than the one described in the linked PED file format documentation. Users are often confused about the PED file format, and error messages can be difficult to interpret. In this PR, I seek to clarify the PED file format requirements and add a script for validation to be used both in the pipeline and by users.

Updates

  • Add script to validate a PED file for use with GATK-SV
  • Add task in GatherBatchEvidence to run the PED file validation script
  • Update PED file documentation to clarify additional format requirements and link to validation script

Testing

  • Validated GatherBatchEvidence WDL & JSON with womtool
  • Tested validate_ped.py locally with a variety of correct and incorrect inputs including with/without a header line (ok), duplicate sample IDs (in PED file or sample list), missing samples, extra samples (ok), invalid sample IDs (in PED file or sample list), missing columns, non-integer sex values, space delimited file, and invalid family/parent IDs.
  • Did not yet rebuild docker & test in WDL - I will do so after feedback on the script to reduce the number of times the docker needs to be rebuilt

Copy link
Collaborator

@mwalker174 mwalker174 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making this available! I just have some minor comments.

wdl/Utils.wdl Outdated

set -euo pipefail
python opt/sv-pipeline/scripts/validate_ped.py -p ~{ped_file} -s {sample_list}
# no outputs - task will either succeed or fail with details in log file
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't aware you could do this, but I think it would be better to have either a dummy empty output or to just output the validated ped file. That way we can force Cromwell to wait until the validation is complete before moving on to tasks that consume the ped file.

# validate IDs
# don't check for duplicates here:
# family and parent IDs may appear multiple times, and sample IDs checked elsewhere
for identifier, id_type in zip(fields[:FIELD_NUMBER_SEX], ["family", "sample", "parent", "parent"]):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine, but you could define the id_type strings as constants at the top of the script (e.g. as you've done with FIELD_NUMBER_SEX)

Comment on lines 98 to 100
elif not sex.isdigit():
raise ValueError(f"Sample {sample_id} has an invalid value for sex: {sex}. " +
"PED file must use integer values for sex: 1=Male, 2=Female, 0=Unknown/Other.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be more strict and require that it be 0, 1, or 2?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either way is fine with me. I did it this way because it will only fail in CleanVcfPart1 if it's not an integer, and I wasn't sure if some groups might want to encode different categories of "other" in the sex column. But it does seem simpler to just enforce 0,1,2

Comment on lines 218 to 224
call util.ValidatePedFile {
input:
ped_file = ped_file,
sample_list = write_lines(samples_batch),
sv_pipeline_docker = sv_pipeline_docker,
runtime_attr_override = runtime_attr_validate_ped
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would good to require this passes before attempting to run cnmops, otherwise you would get two errors with a bad ped file (1 from validation and 1 from cnmops, which could be confusing). If you let the validated ped file be the output, you can just wire that up to the ped file subset task.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point!

@epiercehoffman
Copy link
Collaborator Author

Updated based on review, re-tested locally, built docker and tested in GatherBatchEvidence with correct & incorrect PED file. Note that an incorrect PED file does not result in immediate workflow failure because many steps do not take the PED file as input.

@epiercehoffman epiercehoffman merged commit b421e41 into main Aug 29, 2023
7 checks passed
@epiercehoffman epiercehoffman deleted the eph_validate_ped branch August 29, 2023 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants