Skip to content

Commit

Permalink
Create SNP workflow (#53)
Browse files Browse the repository at this point in the history
  • Loading branch information
rnmitchell authored Jul 21, 2023
1 parent 6730754 commit d0cd93d
Show file tree
Hide file tree
Showing 25 changed files with 71,487 additions and 156 deletions.
1 change: 1 addition & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,4 @@ include lusSTR/tests/data/UAS_bulk_input/*
include lusSTR/tests/data/snps/*
include lusSTR/tests/data/RU_stutter_test/*
include lusSTR/tests/data/NGS_stutter_test/*
include lusSTR/tests/data/kinsnps/*
59 changes: 55 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,16 +27,18 @@ make devenv
## Usage

lusSTR accomodates three different input formats:
(1) UAS Sample Details Report and UAS Phenotype Report (for SNP processing) in .xlsx format (a single file or directory containing multiple files)
(1) UAS Sample Details Report, UAS Sample Report, and UAS Phenotype Report (for SNP processing) in .xlsx format (a single file or directory containing multiple files)
(2) STRait Razor output with one sample per file (a single file or directory containing multiple files)
(3) Sample(s) sequences in CSV format; first four columns must be Locus, NumReads, Sequence, SampleID; Optional last two columns can be Project and Analysis IDs.

*These individual sample files or directory of files must be specified in the config file (see below).*


lusSTR utilizes the ```lusstr``` command to invoke various Snakemake workflows. The ```lusstr strs``` command invokes the STR analysis workflow. *The SNP workflow is currently under construction.*
lusSTR utilizes the ```lusstr``` command to invoke various Snakemake workflows. The ```lusstr strs``` command invokes the STR analysis workflow.

The ```lusstr snps``` command invokes the SNP analysis workflow. Please see below for further information on processing SNP data.
___
### Creating the config file
### Creating the SNP config file

Running ```lusstr config``` creates a config file containing the default settings for the lusSTR STR analysis pipeline. The settings can be changed with command line arguments (see below) or by manually editing the config file. The default settings, along with their descriptions, are as follows:

Expand All @@ -57,6 +59,7 @@ data_type: ```ngs``` (ce/ngs) (invoke ```--ce``` if using CE allele data)
info: ```True``` (True/False); create allele information file (invoke ```--noinfo``` flag to not create the allele information file)
separate: ```False``` (True/False); for EFM only, if True will create individual files for samples; if False, will create one file with all samples (invoke ```--separate``` flag to separate EFM output files)
nofilters: ```False``` (True/False); skip all filtering steps but still creates EFM/STRmix output files (invoke ```--nofilters``` flag)
strand: ```uas``` (uas/forward); indicates the strand orientation in which to report the sequence in the final output table for STRmix NGS only (indicate using ```--strand```)

One additional argument can be provided with ```lusstr config```:
```-w```/```-workdir``` sets the working directory (e.g. ```-w lusstr_files/```) and all created files are stored in that directory.
Expand Down Expand Up @@ -175,8 +178,56 @@ ___

## SNP Data Processing

Currently under construction
lusSTR is able to process SNPs derived from the ForenSeq Signature Prep assay and the ForenSeq Kintelligence assay. SNPs from the ForenSeq Signature Prep assay could be analyzed using either the Verogen UAS or STRait Razor. SNPs from the ForenSeq Kintelligence assay must first be analyzed using the UAS.

___
### Creating the STR config file

Running ```lusstr config --snps``` creates a config file containing the default settings for the lusSTR SNP analysis pipeline. The settings can be changed with command line arguments (see below) or by manually editing the config file. The default settings, along with their descriptions, are as follows:


### general settings
uas: ```True``` (True/False); if ran through UAS (invoke ```--straitrazor``` flag if STRait Razor was used)
samp_input: ```/path/to/input/directory/or/samples``` input directory or sample; if not provided, will be current working directory (indicate using ```--input path/to/dir```)
output: ```lusstr_output``` output file/directory name; (indicate using ```--out dir/sampleid e.g. --out test_030923```)
kit: ```sigprep``` (sigprep/kintelligence) (invoke using the ```--kintelligence``` flag if using Kintelligence data)

### format settings
types: ```all``` choices are "all", "i" (identity SNPs only), "p" (phenotype only), "a" (ancestry only) or any combination (indicate using the ```--snp-type e.g. --snp-type i, p```)
nofilter: ```False``` (True/False); if no filtering is desired at the format step; if False, will remove any allele designated as Not Typed (invoke using the ```--nofiltering```)

### convert settings
strand: ```forward``` (forward/uas); indicates which orientation to report the alleles for the SigPrep SNPs; uas indicates the orientation as reported by the UAS or the forward strand
references: ## list IDs of the samples to be run as references in EFM; default is no reference samples
separate: false ## True/False; if want to separate samples into individual files for use in EFM
thresh: 0.03 ## Analytical threshold value


One additional argument can be provided with ```lusstr config```:
```-w```/```-workdir``` sets the working directory (e.g. ```-w lusstr_files/```) and all created files are stored in that directory.



### general settings:
uas: ```True``` (True/False); if ran through UAS (invoke ```--straitrazor``` flag if STRait Razor was used)
sex: ```False``` (True/False); include sex-chromosome STRs (invoke ```--sex``` flag)
samp_input: ```/path/to/input/directory/or/samples``` input directory or sample; if not provided, will be current working directory (indicate using ```--input path/to/dir``` )
output: ```lusstr_output``` output file/directory name (indicate using ```--out dir/sampleid e.g. --out test_030923```)

### convert settings
kit: ```forenseq``` (forenseq/powerseq) (invoke the ```--powerseq``` flag if using PowerSeq data)
nocombine: ```False``` (True/False); do not combine identical sequences during the ```convert``` step, if using STRait Razor data. (invoke the ```--nocombine``` flag)

### filter settings
output_type: ```strmix``` (strmix/efm) (invoke ```--efm``` flag if creating output for EuroForMix)
profile_type: ```evidence``` (evidence/reference) (invoke ```--reference``` flag if creating a reference output file)
data_type: ```ngs``` (ce/ngs) (invoke ```--ce``` if using CE allele data)
info: ```True``` (True/False); create allele information file (invoke ```--noinfo``` flag to not create the allele information file)
separate: ```False``` (True/False); for EFM only, if True will create individual files for samples; if False, will create one file with all samples (invoke ```--separate``` flag to separate EFM output files)
nofilters: ```False``` (True/False); skip all filtering steps but still creates EFM/STRmix output files (invoke ```--nofilters``` flag)
strand: ```forward``` (uas/forward); indicates the strand orientation in which to report the alleles in the final output table (indicate using ```--strand```)

----


lusSTR is still under development and any suggestions/issues found are welcome!
71 changes: 65 additions & 6 deletions lusSTR/cli/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,13 +20,45 @@

def main(args):
Path(args.workdir).mkdir(parents=True, exist_ok=True)
final_dest = f"{args.workdir}/config.yaml"
config = resource_filename("lusSTR", "data/config.yaml")
final_config = edit_config(config, args)
if args.snps:
final_dest = f"{args.workdir}/snp_config.yaml"
config = resource_filename("lusSTR", "data/snp_config.yaml")
final_config = edit_snp_config(config, args)
else:
final_dest = f"{args.workdir}/config.yaml"
config = resource_filename("lusSTR", "data/config.yaml")
final_config = edit_str_config(config, args)
with open(final_dest, "w") as file:
yaml.dump(final_config, file)

def edit_config(config, args):

def edit_snp_config(config, args):
with open(config, "r") as file:
data = yaml.safe_load(file)
if args.straitrazor:
data["uas"] = False
if args.input:
data["samp_input"] = args.input
if args.out:
data["output"] = args.out
if args.snptype:
data["types"] = args.snptype
if args.kintelligence:
data["kit"] = "kintelligence"
if args.separate:
data["separate"] = True
if args.nofiltering:
data["nofilter"] = True
if args.ref:
data["references"] = args.ref
else:
data["references"] = None
if args.strand:
data["strand"] = args.strand
return data


def edit_str_config(config, args):
with open(config, "r") as file:
data = yaml.safe_load(file)
if args.straitrazor:
Expand Down Expand Up @@ -55,6 +87,8 @@ def edit_config(config, args):
data["data_type"] = "ce"
if args.efm:
data["output_type"] = "efm"
if args.strand:
data["strand"] = args.strand
return data


Expand Down Expand Up @@ -86,7 +120,7 @@ def subparser(subparsers):
)
p.add_argument(
"--reference", action="store_true",
help="Use for creating Reference profiles"
help="Use for creating Reference profiles for STR workflow"
)
p.add_argument("--efm", action="store_true",help="Use to create EuroForMix profiles")
p.add_argument("--ce", action="store_true", help="Use for CE data")
Expand All @@ -100,5 +134,30 @@ def subparser(subparsers):
)
p.add_argument(
"--nofiltering", action="store_true",
help="Use to perform no filtering during the 'filter' step"
help="For STRs, use to perform no filtering during the 'filter' step. For SNPs, "
"only alleles specified as 'Typed' by the UAS will be included at the 'format' step."
)
p.add_argument(
"--snps", action="store_true",
help="Use to create a config file for the SNP workflow"
)
p.add_argument(
"--snp-type", default="all", dest="snptype",
help="Specify the type of SNPs to include in the final report. 'p' will include only the "
"Phenotype SNPs; 'a' will include only the Ancestry SNPs; 'i' will include only the "
"Identity SNPs; and 'all' will include all SNPs. More than one type can be specified (e.g. "
" 'p, a'). Default is all."
)
p.add_argument(
"--kintelligence", action="store_true",
help="Use if processing Kintelligence SNPs within a Kintellience Report(s)"
)
p.add_argument(
"--snp-reference", dest="ref",
help="Specify any references for SNP data for use in EFM."
)
p.add_argument(
"--strand", choices=["uas", "forward"],
help="Specify the strand orientation for the final output files. UAS orientation is "
"default for STRs; forward strand is default for SNPs."
)
27 changes: 19 additions & 8 deletions lusSTR/cli/snps.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# -------------------------------------------------------------------------------------------------
# Copyright (c) 2020, DHS.
# Copyright (c) 2023, DHS.
#
# This file is part of lusSTR (http://github.com/bioforensics/lusSTR) and is licensed under
# the BSD license: see LICENSE.txt.
Expand All @@ -10,16 +10,27 @@
# Development Center.
# -------------------------------------------------------------------------------------------------

import argparse
import lusSTR
import snakemake
from snakemake import snakemake

## placeholder until I update this

def main(args):
raise NotImplementedError('SNP workflow implementation pending')
pretarget = args.target if args.target != "all" else "convert"
workdir = args.workdir
result = snakemake(
lusSTR.snakefile(workflow="snps"), targets=[pretarget], workdir=workdir, verbose=True
)
if result is not True:
raise SystemError('Snakemake failed')

def subparser(subparsers):
p = subparsers.add_parser("snps", description="Running the entire STR pipeline (format, annotate and filter)")
p.add_argument("--config", default="config.yaml", help="config file used to identify settings.")
p.add_argument("-w", "--workdir", metavar="W", default=".", help="working directory")
p.add_argument("--skip-filter", dest="filter", action = "store_true", help="Skip filtering step")
p = subparsers.add_parser(
"snps", description="Running the SNP pipeline"
)
p.add_argument(
"target", choices=["format", "all"],
help="Steps to run. Specifying 'format' will run only 'format'. Specifying "
"'all' will run all steps of the SNP workflow ('format' and 'convert')."
)
p.add_argument("-w", "--workdir", metavar="W", default=".", help="working directory")
3 changes: 2 additions & 1 deletion lusSTR/data/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,4 +17,5 @@ profile_type: "evidence" ## evidence/reference
data_type: "ngs" ## ce/ngs
info: True ## True/False; create allele information file
separate: False ##True/False; for EFM only, if True will create individual files for samples; if False, will create one file with all samples
nofilters: False ##True/False; skip all filtering steps but still creates EFM/STRmix output files
nofilters: False ##True/False; skip all filtering steps but still creates EFM/STRmix output files
strand: uas ##uas/forward; strand orientation to report
18 changes: 18 additions & 0 deletions lusSTR/data/snp_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
%YAML 1.2
---

## general settings
uas: True ## True/False; if ran through UAS
samp_input: "/path/to/input/directory/or/samples" ## input directory or sample; if not provided, will be cwd
output: "lusstr_output" ## output file/directory name; Example: "test_030923"
kit: "sigprep" ## sigprep/kintelligence

## format settings
types: "all" ## choices are "all", "i" (identity SNPs only), "p" (phenotype only), "a" (ancestry only) or any combination
nofilter: False ## True/False if no filtering is desired; if False, will remove any allele designated as Not Typed

## convert settings
strand: "forward" ## forward/uas; indicates which oritentation to report the alleles for the ForenSeq SNPs; uas indicates the orientation as reported by the UAS or the forward strand
references: "" ## list IDs of the samples to be run as references in EFM
separate: false ## True/False; if want to separate samples into individual files for use in EFM
thresh: 0.03 ## Analytical threshold value
35 changes: 24 additions & 11 deletions lusSTR/scripts/filter_settings.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,15 +26,17 @@ def get_filter_metadata_file():
filter_marker_data = json.load(fh)


def filters(locus_allele_info, locus, locus_reads, datatype):
def filters(locus_allele_info, locus, locus_reads, datatype, brack_col):
metadata = filter_marker_data[locus]
if len(locus_allele_info) == 1:
locus_allele_info = single_allele_thresholds(metadata, locus_reads, locus_allele_info)
else:
locus_allele_info, locus_reads = multiple_allele_thresholds(
metadata, locus_reads, locus_allele_info
)
locus_allele_info = ce_filtering(locus_allele_info, locus_reads, metadata, datatype)
locus_allele_info = ce_filtering(
locus_allele_info, locus_reads, metadata, datatype, brack_col
)
if datatype == "ngs":
locus_allele_info = same_size_filter(locus_allele_info, metadata)
return locus_allele_info
Expand Down Expand Up @@ -98,7 +100,7 @@ def thresholds(filter, metadata, locus_reads, quest_al_reads):
return locus_reads, True


def ce_filtering(locus_allele_info, locus_reads, metadata, datatype):
def ce_filtering(locus_allele_info, locus_reads, metadata, datatype, brack_col):
for i in range(len(locus_allele_info)): # check for stutter alleles
if locus_allele_info.loc[i, "allele_type"] != "real_allele":
continue
Expand All @@ -111,7 +113,14 @@ def ce_filtering(locus_allele_info, locus_reads, metadata, datatype):
if init_type_all == "BelowAT":
continue
locus_allele_info = allele_ident(
locus_allele_info, init_type_all, metadata, ref_allele_reads, i, j, datatype
locus_allele_info,
init_type_all,
metadata,
ref_allele_reads,
i,
j,
datatype,
brack_col,
)
for j in range(len(locus_allele_info)):
type_all = locus_allele_info.loc[j, "allele_type"]
Expand All @@ -133,13 +142,15 @@ def ce_filtering(locus_allele_info, locus_reads, metadata, datatype):
return locus_allele_info


def allele_ident(locus_allele_info, init_type_all, metadata, ref_allele_reads, i, j, datatype):
def allele_ident(
locus_allele_info, init_type_all, metadata, ref_allele_reads, i, j, datatype, brack_col
):
quest_al_reads = locus_allele_info.loc[j, "Reads"]
ref_allele = float(locus_allele_info.loc[i, "CE_Allele"])
question_allele = float(locus_allele_info.loc[j, "CE_Allele"])
if datatype == "ngs":
ref_bracket = locus_allele_info.loc[i, "UAS_Output_Bracketed_Notation"]
question_bracket = locus_allele_info.loc[j, "UAS_Output_Bracketed_Notation"]
ref_bracket = locus_allele_info.loc[i, brack_col]
question_bracket = locus_allele_info.loc[j, brack_col]
else:
ref_bracket = None
question_bracket = None
Expand All @@ -155,6 +166,7 @@ def allele_ident(locus_allele_info, init_type_all, metadata, ref_allele_reads, i
ref_bracket,
question_bracket,
datatype,
brack_col,
)
if "stutter" in locus_allele_info.loc[j, "allele_type"]:
if "/" in locus_allele_info.loc[j, "allele_type"] and pd.isnull(
Expand Down Expand Up @@ -254,6 +266,7 @@ def allele_type(
ref_bracket,
question_bracket,
datatype,
brack_col,
):
stutter_thresh = metadata["StutterThresholdDynamicPercent"]
forward_thresh = metadata["StutterForwardThresholdDynamicPercent"]
Expand All @@ -275,11 +288,11 @@ def allele_type(
)
elif allele_diff == 2 and ref_reads > quest_al_reads: # -2 stutter
allele = ce if datatype == "ce" else question_bracket
if check_2stutter(all_type_df, datatype, allele)[0] is True:
if check_2stutter(all_type_df, datatype, allele, brack_col)[0] is True:
if (
datatype == "ngs" and bracketed_stutter_id(ref_bracket, question_bracket, -2) == -2
) or datatype == "ce":
ref_reads = check_2stutter(all_type_df, datatype, allele)[1]
ref_reads = check_2stutter(all_type_df, datatype, allele, brack_col)[1]
stutter_thresh_reads = stutter_thresh * ref_reads
all_type, stut_perc = minus2_stutter(
all_type,
Expand Down Expand Up @@ -318,7 +331,7 @@ def forward_stut_thresh(perc, perc_stut, reads):
return forward_thresh


def check_2stutter(stutter_df, allele_des, allele):
def check_2stutter(stutter_df, allele_des, allele, brack_col):
is_true, reads = False, None
if "-1_stutter" in stutter_df.loc[:, "allele_type"].values:
if allele_des == "ce":
Expand All @@ -329,7 +342,7 @@ def check_2stutter(stutter_df, allele_des, allele):
break
else:
for k, row in stutter_df.iterrows():
bracket_test = stutter_df.loc[k, "UAS_Output_Bracketed_Notation"]
bracket_test = stutter_df.loc[k, brack_col]
if bracketed_stutter_id(bracket_test, allele, -1) == -1:
is_true, reads = True, stutter_df.loc[k, "Reads"]
return is_true, reads
Expand Down
Loading

0 comments on commit d0cd93d

Please sign in to comment.