Skip to content

Commit

Permalink
PED file validation (#587)
Browse files Browse the repository at this point in the history
  • Loading branch information
epiercehoffman authored Aug 29, 2023
1 parent 3346bd1 commit b421e41
Show file tree
Hide file tree
Showing 6 changed files with 215 additions and 6 deletions.
11 changes: 10 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,16 @@ We still encourage members of the community to adapt GATK-SV for non-GCP backend

### Data:
* Illumina short-read whole-genome CRAMs or BAMs, aligned to hg38 with [bwa-mem](https://github.com/lh3/bwa). BAMs must also be indexed.
* Family structure definitions file in [PED format](https://gatk.broadinstitute.org/hc/en-us/articles/360035531972-PED-Pedigree-format). Sex aneuploidies (detected in [EvidenceQC](#evidence-qc)) should be entered as sex = 0.
* Family structure definitions file in [PED format](#ped-format).

#### <a name="ped-format">PED file format</a>
The PED file format is described [here](https://gatk.broadinstitute.org/hc/en-us/articles/360035531972-PED-Pedigree-format). Note that GATK-SV imposes additional requirements:
* The file must be tab-delimited.
* The sex column must only contain 0, 1, or 2: 1=Male, 2=Female, 0=Other/Unknown. Sex chromosome aneuploidies (detected in [EvidenceQC](#evidence-qc)) should be entered as sex = 0.
* All family, individual, and parental IDs must conform to the [sample ID requirements](#sampleids).
* Missing parental IDs should be entered as 0.
* Header lines are allowed if they begin with a # character.
To validate the PED file, you may use `src/sv-pipeline/scripts/validate_ped.py -p pedigree.ped -s samples.list`.

#### <a name="sample-exclusion">Sample Exclusion</a>
We recommend filtering out samples with a high percentage of improperly paired reads (>10% or an outlier for your data) as technical outliers prior to running [GatherSampleEvidence](#gather-sample-evidence). A high percentage of improperly paired reads may indicate issues with library prep, degradation, or contamination. Artifactual improperly paired reads could cause incorrect SV calls, and these samples have been observed to have longer runtimes and higher compute costs for [GatherSampleEvidence](#gather-sample-evidence).
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ The following cohort-level or batch-level inputs are also required:
|---------|--------|--------------|
|`String`|`sample_set_id`|Batch identifier|
|`String`|`sample_set_set_id`|Cohort identifier|
|`File`|`cohort_ped_file`|Path to the GCS location of a family structure definitions file in [PED format](https://gatk.broadinstitute.org/hc/en-us/articles/360035531972-PED-Pedigree-format). Sex aneuploidies (detected in `02-EvidenceQC`) should be entered as sex = 0.|
|`File`|`cohort_ped_file`|Path to the GCS location of a family structure definitions file in [PED format](https://github.com/broadinstitute/gatk-sv#ped-format).|

### Pipeline outputs

Expand Down
142 changes: 142 additions & 0 deletions src/sv-pipeline/scripts/validate_ped.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
#!/bin/python

import argparse
import re


"""
Summary: Validates a PED file for use with GATK-SV. Performs some sample ID validations as well.
Usage: python validate_ped.py -p pedigree.ped -s samples.list
Outputs: The script will write to stdout "PED file passes validation!" if successful, and
otherwise it will print an error message to stderr describing the PED file format violation.
"""


FIELD_NUMBER_ID = 1
FIELD_NUMBER_SEX = 4
ILLEGAL_ID_SUBSTRINGS = ["chr", "name", "DEL", "DUP", "CPX", "CHROM"]
ID_TYPE_SAMPLE, ID_TYPE_FAMILY, ID_TYPE_PARENT = "sample", "family", "parent"


def validate_id(identifier, id_set, id_type, source_file):
# check for empty IDs
if identifier is None or identifier == "":
raise ValueError(f"Empty {id_type} ID in {source_file}")

# check all characters are alphanumeric or underscore
if not re.match(r'^\w+$', identifier):
raise ValueError(f"Invalid {id_type} ID in {source_file}: '{identifier}'. " +
"IDs should only contain alphanumeric and underscore characters.")

# check for all-numeric IDs
# except: allow maternal and paternal ID to be 0
# except: allow all-numeric family IDs (as in current ref panel PED file)
if id_type != ID_TYPE_FAMILY and not (id_type == ID_TYPE_PARENT and identifier == "0") and identifier.isdigit():
raise ValueError(f"Invalid {id_type} ID in {source_file}: {identifier}. " +
"IDs should not contain only numeric characters.")

# check for illegal substrings
for sub in ILLEGAL_ID_SUBSTRINGS:
if sub in identifier:
raise ValueError(f"Invalid {id_type} ID in {source_file}: {identifier}. " +
f"IDs cannot contain the following substrings: {', '.join(ILLEGAL_ID_SUBSTRINGS)}.")

# check for duplicate IDs
if id_set is not None:
if identifier in id_set:
raise ValueError(f"Duplicate {id_type} ID in {source_file}: {identifier}")


def get_samples(samples_file):
# Note that this sample ID validation is incomplete:
# * It does not check if sample IDs are substrings of other sample IDs
# * It only checks the provided sample list for duplicates, which is not the full cohort in GatherBatchEvidence
samples = set()
with open(samples_file, 'r') as samp:
for line in samp:
sample = line.strip()
validate_id(sample, samples, ID_TYPE_SAMPLE, "sample list")
samples.add(sample)

if len(samples) < 1:
raise ValueError("Empty samples list provided")

return samples


def validate_ped(ped_file, samples):
seen_sex_1 = False
seen_sex_2 = False
samples_found = set()
with open(ped_file, 'r') as ped:
for line in ped:
# specification allows commented lines, which should be removed by SubsetPedFile and CleanVcfPart1
if line.startswith("#"):
continue

# we require tab-delimited PED files although the specification does not
fields = line.strip().split("\t")
if len(fields) != 6:
raise ValueError("Invalid PED file. PED file must be tab-delimited and have 6 columns: " +
"family_ID, sample_ID, paternal_ID, maternal_ID, sex, phenotype.")

# validate IDs
# don't check for duplicates here:
# family and parent IDs may appear multiple times, and sample IDs checked elsewhere
for identifier, id_type in zip(fields[:FIELD_NUMBER_SEX],
[ID_TYPE_FAMILY, ID_TYPE_SAMPLE, ID_TYPE_PARENT, ID_TYPE_PARENT]):
validate_id(identifier, None, id_type, "PED file")

# check for at least one appearance each of 1 and 2 in sex column
# this check is to ensure M/F are coded as 1 and 2 (not 0 and 1)
# and both M and F are present (for sex-specific steps)
sample_id = fields[FIELD_NUMBER_ID]
sex = fields[FIELD_NUMBER_SEX]
if sex == "1":
seen_sex_1 = True
elif sex == "2":
seen_sex_2 = True

# require sex = 0, 1, or 2
elif sex != "0":
raise ValueError(f"Sample {sample_id} has an invalid value for sex: {sex}. " +
"PED file must use the following values for sex: 1=Male, 2=Female, 0=Unknown/Other.")

# check all samples in samples list are present in PED file exactly once
# no duplicate sample IDs
if sample_id in samples_found:
raise ValueError(f"Invalid PED file. PED file has duplicate entries for sample {sample_id}.")
elif sample_id in samples:
samples_found.add(sample_id)
# if not in samples list, ignore: ok for PED file to contain samples not in sample list

# check if all samples in sample list are present in PED file
if len(samples_found) < len(samples):
missing = samples - samples_found
raise ValueError(f"Invalid PED file. PED file is missing sample(s): {','.join(missing)}.")

if not (seen_sex_2 and seen_sex_1):
raise ValueError("Invalid PED file. PED file must use the following values for sex: " +
"1=Male, 2=Female, 0=Unknown/Other. PED file must contain at least " +
"one sample with sex=1 and one with sex=2.")

# passed validation tests
print("PED file passes validation!")


def main():
parser = argparse.ArgumentParser()

parser.add_argument("-p", "--ped-file", required=True, help="PED file to validate")
parser.add_argument("-s", "--samples-file", required=True,
help="File containing samples (one per line) that should be in PED file")
args = parser.parse_args()

samples = get_samples(args.samples_file)
validate_ped(args.ped_file, samples)


if __name__ == "__main__":
main()
11 changes: 10 additions & 1 deletion wdl/GatherBatchEvidence.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -161,6 +161,7 @@ workflow GatherBatchEvidence {
RuntimeAttr? ploidy_score_runtime_attr
RuntimeAttr? ploidy_build_runtime_attr
RuntimeAttr? runtime_attr_subset_ped
RuntimeAttr? runtime_attr_validate_ped
RuntimeAttr? add_sample_to_ped_runtime_attr
RuntimeAttr? condense_counts_runtime_attr
RuntimeAttr? preprocess_calls_runtime_attr
Expand Down Expand Up @@ -214,10 +215,18 @@ workflow GatherBatchEvidence {
}
Array[String] samples_batch = select_first([ref_panel_samples, samples])
call util.SubsetPedFile {
call util.ValidatePedFile {
input:
ped_file = ped_file,
sample_list = write_lines(samples_batch),
sv_pipeline_docker = sv_pipeline_docker,
runtime_attr_override = runtime_attr_validate_ped
}
call util.SubsetPedFile {
input:
ped_file = ValidatePedFile.output_ped,
sample_list = write_lines(samples_batch),
subset_name = batch,
sv_base_mini_docker = sv_base_mini_docker,
runtime_attr_override = runtime_attr_subset_ped
Expand Down
41 changes: 41 additions & 0 deletions wdl/Utils.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -400,6 +400,47 @@ task SubsetPedFile {
}
task ValidatePedFile {
input {
File ped_file
File sample_list
String sv_pipeline_docker
RuntimeAttr? runtime_attr_override
}
RuntimeAttr default_attr = object {
cpu_cores: 1,
mem_gb: 3.75,
disk_gb: 10,
boot_disk_gb: 10,
preemptible_tries: 3,
max_retries: 1
}
RuntimeAttr runtime_attr = select_first([runtime_attr_override, default_attr])
command <<<
set -euo pipefail
python /opt/sv-pipeline/scripts/validate_ped.py -p ~{ped_file} -s ~{sample_list}
>>>
output {
File output_ped = ped_file
}

runtime {
cpu: select_first([runtime_attr.cpu_cores, default_attr.cpu_cores])
memory: select_first([runtime_attr.mem_gb, default_attr.mem_gb]) + " GiB"
disks: "local-disk " + select_first([runtime_attr.disk_gb, default_attr.disk_gb]) + " HDD"
bootDiskSizeGb: select_first([runtime_attr.boot_disk_gb, default_attr.boot_disk_gb])
docker: sv_pipeline_docker
preemptible: select_first([runtime_attr.preemptible_tries, default_attr.preemptible_tries])
maxRetries: select_first([runtime_attr.max_retries, default_attr.max_retries])
}
}
task LocalizeCloudFileWithCredentials {
input {
String cloud_file_path
Expand Down
14 changes: 11 additions & 3 deletions website/docs/gs/input_files.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,16 @@ GATK-SV requires the following input data:
BAMs must also be indexed.

- Family structure definitions file in
[PED format](https://gatk.broadinstitute.org/hc/en-us/articles/360035531972-PED-Pedigree-format).
**Note**: Sex aneuploidies (detected in EvidenceQC) should be entered as sex = 0.
[PED format](/docs/gs/inputs#ped-format).

### PED file format {#ped-format}
The PED file format is described [here](https://gatk.broadinstitute.org/hc/en-us/articles/360035531972-PED-Pedigree-format). Note that GATK-SV imposes additional requirements:
* The file must be tab-delimited.
* The sex column must only contain 0, 1, or 2: 1=Male, 2=Female, 0=Other/Unknown. Sex chromosome aneuploidies (detected in [EvidenceQC](/docs/modules/eqc)) should be entered as sex = 0.
* All family, individual, and parental IDs must conform to the [sample ID requirements](/docs/gs/inputs#sampleids).
* Missing parental IDs should be entered as 0.
* Header lines are allowed if they begin with a # character.
To validate the PED file, you may use `src/sv-pipeline/scripts/validate_ped.py -p pedigree.ped -s samples.list`.

### Sample Exclusion
We recommend filtering out samples with a high percentage
Expand All @@ -25,7 +33,7 @@ these samples have been observed to have longer runtimes and
higher compute costs for [GatherSampleEvidence](/docs/modules/gse).


### Sample ID requirements
### Sample ID requirements {#sampleids}
#### Sample IDs must

- Be unique within the cohort
Expand Down

0 comments on commit b421e41

Please sign in to comment.