Create SNP workflow (#53)

bioforensics · Jul 21, 2023 · d0cd93d · d0cd93d
1 parent 6730754
commit d0cd93d
Show file tree

Hide file tree

Showing 25 changed files with 71,487 additions and 156 deletions.
diff --git a/MANIFEST.in b/MANIFEST.in
@@ -12,3 +12,4 @@ include lusSTR/tests/data/UAS_bulk_input/*
 include lusSTR/tests/data/snps/*
 include lusSTR/tests/data/RU_stutter_test/*
 include lusSTR/tests/data/NGS_stutter_test/*
+include lusSTR/tests/data/kinsnps/*
diff --git a/README.md b/README.md
@@ -27,16 +27,18 @@ make devenv
 ## Usage
 
 lusSTR accomodates three different input formats:  
-(1) UAS Sample Details Report and UAS Phenotype Report (for SNP processing) in .xlsx format (a single file or directory containing multiple files)  
+(1) UAS Sample Details Report, UAS Sample Report, and UAS Phenotype Report (for SNP processing) in .xlsx format (a single file or directory containing multiple files)  
 (2) STRait Razor output with one sample per file (a single file or directory containing multiple files)  
 (3) Sample(s) sequences in CSV format; first four columns must be Locus, NumReads, Sequence, SampleID; Optional last two columns can be Project and Analysis IDs.  
 
 *These individual sample files or directory of files must be specified in the config file (see below).*
 
 
-lusSTR utilizes the ```lusstr``` command to invoke various Snakemake workflows. The ```lusstr strs``` command invokes the STR analysis workflow. *The SNP workflow is currently under construction.*
+lusSTR utilizes the ```lusstr``` command to invoke various Snakemake workflows. The ```lusstr strs``` command invokes the STR analysis workflow.  
+
+The ```lusstr snps``` command invokes the SNP analysis workflow. Please see below for further information on processing SNP data.
 ___
-### Creating the config file
+### Creating the SNP config file
 
 Running ```lusstr config``` creates a config file containing the default settings for the lusSTR STR analysis pipeline. The settings can be changed with command line arguments (see below) or by manually editing the config file. The default settings, along with their descriptions, are as follows:
 
@@ -57,6 +59,7 @@ data_type: ```ngs``` (ce/ngs) (invoke ```--ce``` if using CE allele data)
 info: ```True``` (True/False); create allele information file (invoke ```--noinfo``` flag to not create the allele information file)  
 separate: ```False``` (True/False); for EFM only, if True will create individual files for samples; if False, will create one file with all samples (invoke ```--separate``` flag to separate EFM output files)  
 nofilters: ```False``` (True/False); skip all filtering steps but still creates EFM/STRmix output files (invoke ```--nofilters``` flag)  
+strand: ```uas``` (uas/forward); indicates the strand orientation in which to report the sequence in the final output table for STRmix NGS only (indicate using ```--strand```)
 
 One additional argument can be provided with ```lusstr config```:  
 ```-w```/```-workdir``` sets the working directory (e.g. ```-w lusstr_files/```) and all created files are stored in that directory.
@@ -175,8 +178,56 @@ ___
 
  ## SNP Data Processing
 
-Currently under construction
+lusSTR is able to process SNPs derived from the ForenSeq Signature Prep assay and the ForenSeq Kintelligence assay. SNPs from the ForenSeq Signature Prep assay could be analyzed using either the Verogen UAS or STRait Razor. SNPs from the ForenSeq Kintelligence assay must first be analyzed using the UAS.
+
+___
+### Creating the STR config file
+
+Running ```lusstr config --snps``` creates a config file containing the default settings for the lusSTR SNP analysis pipeline. The settings can be changed with command line arguments (see below) or by manually editing the config file. The default settings, along with their descriptions, are as follows:  
+
+
+### general settings  
+uas: ```True```  (True/False); if ran through UAS  (invoke ```--straitrazor``` flag if STRait Razor was used)  
+samp_input: ```/path/to/input/directory/or/samples``` input directory or sample; if not provided, will be current working directory (indicate using ```--input path/to/dir```)  
+output: ```lusstr_output``` output file/directory name; (indicate using ```--out dir/sampleid e.g. --out test_030923```)   
+kit: ```sigprep``` (sigprep/kintelligence) (invoke using the ```--kintelligence``` flag if using Kintelligence data)  
+
+### format settings  
+types: ```all``` choices are "all", "i" (identity SNPs only), "p" (phenotype only), "a" (ancestry only) or any combination (indicate using the ```--snp-type e.g. --snp-type i, p```)  
+nofilter: ```False``` (True/False); if no filtering is desired at the format step; if False, will remove any allele designated as Not Typed (invoke using the ```--nofiltering```)  
+
+### convert settings  
+strand: ```forward``` (forward/uas); indicates which orientation to report the alleles for the SigPrep SNPs; uas indicates the orientation as reported by the UAS or the forward strand
+references:  ## list IDs of the samples to be run as references in EFM; default is no reference samples  
+separate: false ## True/False; if want to separate samples into individual files for use in EFM  
+thresh: 0.03 ## Analytical threshold value    
+
+
+One additional argument can be provided with ```lusstr config```:  
+```-w```/```-workdir``` sets the working directory (e.g. ```-w lusstr_files/```) and all created files are stored in that directory.
+
+
+
+### general settings:  
+uas: ```True``` (True/False); if ran through UAS (invoke ```--straitrazor``` flag if STRait Razor was used)  
+sex: ```False``` (True/False); include sex-chromosome STRs (invoke ```--sex``` flag)  
+samp_input: ```/path/to/input/directory/or/samples``` input directory or sample; if not provided, will be current working directory (indicate using ```--input path/to/dir``` )  
+output: ```lusstr_output``` output file/directory name (indicate using ```--out dir/sampleid e.g. --out test_030923```)
+
+### convert settings  
+kit: ```forenseq``` (forenseq/powerseq) (invoke the ```--powerseq``` flag if using PowerSeq data)  
+nocombine: ```False``` (True/False); do not combine identical sequences during the ```convert``` step, if using STRait Razor data. (invoke the ```--nocombine``` flag)  
+
+### filter settings  
+output_type: ```strmix``` (strmix/efm) (invoke ```--efm``` flag if creating output for EuroForMix)  
+profile_type: ```evidence``` (evidence/reference) (invoke ```--reference``` flag if creating a reference output file)  
+data_type: ```ngs``` (ce/ngs) (invoke ```--ce``` if using CE allele data)  
+info: ```True``` (True/False); create allele information file (invoke ```--noinfo``` flag to not create the allele information file)  
+separate: ```False``` (True/False); for EFM only, if True will create individual files for samples; if False, will create one file with all samples (invoke ```--separate``` flag to separate EFM output files)  
+nofilters: ```False``` (True/False); skip all filtering steps but still creates EFM/STRmix output files (invoke ```--nofilters``` flag)  
+strand: ```forward``` (uas/forward); indicates the strand orientation in which to report the alleles in the final output table (indicate using ```--strand```)
 
 ----
 
+
 lusSTR is still under development and any suggestions/issues found are welcome!
diff --git a/lusSTR/cli/config.py b/lusSTR/cli/config.py
@@ -20,13 +20,45 @@
 
 def main(args):
     Path(args.workdir).mkdir(parents=True, exist_ok=True)
-    final_dest = f"{args.workdir}/config.yaml"
-    config = resource_filename("lusSTR", "data/config.yaml")
-    final_config = edit_config(config, args)
+    if args.snps:
+        final_dest = f"{args.workdir}/snp_config.yaml"
+        config = resource_filename("lusSTR", "data/snp_config.yaml")
+        final_config = edit_snp_config(config, args)
+    else:
+        final_dest = f"{args.workdir}/config.yaml"
+        config = resource_filename("lusSTR", "data/config.yaml")
+        final_config = edit_str_config(config, args)
     with open(final_dest, "w") as file:
         yaml.dump(final_config, file)
 
-def edit_config(config, args):
+
+def edit_snp_config(config, args):
+    with open(config, "r") as file:
+        data = yaml.safe_load(file)
+        if args.straitrazor:
+            data["uas"] = False
+        if args.input:
+            data["samp_input"] = args.input
+        if args.out:
+            data["output"] = args.out
+        if args.snptype:
+            data["types"] = args.snptype
+        if args.kintelligence:
+            data["kit"] = "kintelligence"
+        if args.separate:
+            data["separate"] = True
+        if args.nofiltering:
+            data["nofilter"] = True
+        if args.ref:
+            data["references"] = args.ref
+        else:
+            data["references"] = None
+        if args.strand:
+            data["strand"] = args.strand
+        return data
+
+
+def edit_str_config(config, args):
     with open(config, "r") as file:
         data = yaml.safe_load(file)
     if args.straitrazor:
@@ -55,6 +87,8 @@ def edit_config(config, args):
         data["data_type"] = "ce"
     if args.efm:
         data["output_type"] = "efm"
+    if args.strand:
+        data["strand"] = args.strand
     return data
 
 
@@ -86,7 +120,7 @@ def subparser(subparsers):
     )
     p.add_argument(
         "--reference", action="store_true", 
-        help="Use for creating Reference profiles"
+        help="Use for creating Reference profiles for STR workflow"
     )
     p.add_argument("--efm", action="store_true",help="Use to create EuroForMix profiles")
     p.add_argument("--ce", action="store_true", help="Use for CE data")
@@ -100,5 +134,30 @@ def subparser(subparsers):
     )
     p.add_argument(
         "--nofiltering", action="store_true", 
-        help="Use to perform no filtering during the 'filter' step"
+        help="For STRs, use to perform no filtering during the 'filter' step. For SNPs, "
+        "only alleles specified as 'Typed' by the UAS will be included at the 'format' step."
+    )
+    p.add_argument(
+        "--snps", action="store_true",
+        help="Use to create a config file for the SNP workflow"
+    )
+    p.add_argument(
+        "--snp-type", default="all", dest="snptype",
+        help="Specify the type of SNPs to include in the final report. 'p' will include only the "
+        "Phenotype SNPs; 'a' will include only the Ancestry SNPs; 'i' will include only the "
+        "Identity SNPs; and 'all' will include all SNPs. More than one type can be specified (e.g. "
+        " 'p, a'). Default is all."
+    )
+    p.add_argument(
+        "--kintelligence", action="store_true",
+        help="Use if processing Kintelligence SNPs within a Kintellience Report(s)"
+    )
+    p.add_argument(
+        "--snp-reference", dest="ref",
+        help="Specify any references for SNP data for use in EFM."
+    )
+    p.add_argument(
+        "--strand", choices=["uas", "forward"],
+        help="Specify the strand orientation for the final output files. UAS orientation is "
+        "default for STRs; forward strand is default for SNPs."
     )
diff --git a/lusSTR/cli/snps.py b/lusSTR/cli/snps.py
@@ -1,5 +1,5 @@
 # -------------------------------------------------------------------------------------------------
-# Copyright (c) 2020, DHS.
+# Copyright (c) 2023, DHS.
 #
 # This file is part of lusSTR (http://github.com/bioforensics/lusSTR) and is licensed under
 # the BSD license: see LICENSE.txt.
@@ -10,16 +10,27 @@
 # Development Center.
 # -------------------------------------------------------------------------------------------------
 
+import argparse
 import lusSTR
-import snakemake
+from snakemake import snakemake
 
-## placeholder until I update this
 
 def main(args):
-    raise NotImplementedError('SNP workflow implementation pending')
+    pretarget = args.target if args.target != "all" else "convert"
+    workdir = args.workdir
+    result = snakemake(
+        lusSTR.snakefile(workflow="snps"), targets=[pretarget], workdir=workdir, verbose=True
+    )
+    if result is not True:
+        raise SystemError('Snakemake failed')
 
 def subparser(subparsers):
-    p = subparsers.add_parser("snps", description="Running the entire STR pipeline (format, annotate and filter)")
-    p.add_argument("--config", default="config.yaml", help="config file used to identify settings.")
-    p.add_argument("-w", "--workdir", metavar="W", default=".", help="working directory")
-    p.add_argument("--skip-filter", dest="filter", action = "store_true", help="Skip filtering step")
+    p = subparsers.add_parser(
+        "snps", description="Running the SNP pipeline"
+    )
+    p.add_argument(
+        "target", choices=["format", "all"], 
+        help="Steps to run. Specifying 'format' will run only 'format'. Specifying "
+        "'all' will run all steps of the SNP workflow ('format' and 'convert')."
+    )
+    p.add_argument("-w", "--workdir", metavar="W", default=".", help="working directory")
diff --git a/lusSTR/data/config.yaml b/lusSTR/data/config.yaml
@@ -17,4 +17,5 @@ profile_type: "evidence" ## evidence/reference
 data_type: "ngs" ## ce/ngs
 info: True ## True/False; create allele information file
 separate: False ##True/False; for EFM only, if True will create individual files for samples; if False, will create one file with all samples
-nofilters: False ##True/False; skip all filtering steps but still creates EFM/STRmix output files
+nofilters: False ##True/False; skip all filtering steps but still creates EFM/STRmix output files
+strand: uas ##uas/forward; strand orientation to report
diff --git a/lusSTR/data/snp_config.yaml b/lusSTR/data/snp_config.yaml
@@ -0,0 +1,18 @@
+%YAML 1.2
+---
+
+## general settings
+uas: True  ## True/False; if ran through UAS
+samp_input: "/path/to/input/directory/or/samples" ## input directory or sample; if not provided, will be cwd
+output: "lusstr_output" ## output file/directory name; Example: "test_030923"
+kit: "sigprep" ## sigprep/kintelligence 
+
+## format settings
+types: "all" ## choices are "all", "i" (identity SNPs only), "p" (phenotype only), "a" (ancestry only) or any combination
+nofilter: False ## True/False if no filtering is desired; if False, will remove any allele designated as Not Typed
+
+## convert settings
+strand: "forward" ## forward/uas; indicates which oritentation to report the alleles for the ForenSeq SNPs; uas indicates the orientation as reported by the UAS or the forward strand
+references: "" ## list IDs of the samples to be run as references in EFM
+separate: false ## True/False; if want to separate samples into individual files for use in EFM
+thresh: 0.03 ## Analytical threshold value
diff --git a/lusSTR/scripts/filter_settings.py b/lusSTR/scripts/filter_settings.py
@@ -26,15 +26,17 @@ def get_filter_metadata_file():
     filter_marker_data = json.load(fh)
 
 
-def filters(locus_allele_info, locus, locus_reads, datatype):
+def filters(locus_allele_info, locus, locus_reads, datatype, brack_col):
     metadata = filter_marker_data[locus]
     if len(locus_allele_info) == 1:
         locus_allele_info = single_allele_thresholds(metadata, locus_reads, locus_allele_info)
     else:
         locus_allele_info, locus_reads = multiple_allele_thresholds(
             metadata, locus_reads, locus_allele_info
         )
-        locus_allele_info = ce_filtering(locus_allele_info, locus_reads, metadata, datatype)
+        locus_allele_info = ce_filtering(
+            locus_allele_info, locus_reads, metadata, datatype, brack_col
+        )
         if datatype == "ngs":
             locus_allele_info = same_size_filter(locus_allele_info, metadata)
     return locus_allele_info
@@ -98,7 +100,7 @@ def thresholds(filter, metadata, locus_reads, quest_al_reads):
         return locus_reads, True
 
 
-def ce_filtering(locus_allele_info, locus_reads, metadata, datatype):
+def ce_filtering(locus_allele_info, locus_reads, metadata, datatype, brack_col):
     for i in range(len(locus_allele_info)):  # check for stutter alleles
         if locus_allele_info.loc[i, "allele_type"] != "real_allele":
             continue
@@ -111,7 +113,14 @@ def ce_filtering(locus_allele_info, locus_reads, metadata, datatype):
                 if init_type_all == "BelowAT":
                     continue
                 locus_allele_info = allele_ident(
-                    locus_allele_info, init_type_all, metadata, ref_allele_reads, i, j, datatype
+                    locus_allele_info,
+                    init_type_all,
+                    metadata,
+                    ref_allele_reads,
+                    i,
+                    j,
+                    datatype,
+                    brack_col,
                 )
         for j in range(len(locus_allele_info)):
             type_all = locus_allele_info.loc[j, "allele_type"]
@@ -133,13 +142,15 @@ def ce_filtering(locus_allele_info, locus_reads, metadata, datatype):
     return locus_allele_info
 
 
-def allele_ident(locus_allele_info, init_type_all, metadata, ref_allele_reads, i, j, datatype):
+def allele_ident(
+    locus_allele_info, init_type_all, metadata, ref_allele_reads, i, j, datatype, brack_col
+):
     quest_al_reads = locus_allele_info.loc[j, "Reads"]
     ref_allele = float(locus_allele_info.loc[i, "CE_Allele"])
     question_allele = float(locus_allele_info.loc[j, "CE_Allele"])
     if datatype == "ngs":
-        ref_bracket = locus_allele_info.loc[i, "UAS_Output_Bracketed_Notation"]
-        question_bracket = locus_allele_info.loc[j, "UAS_Output_Bracketed_Notation"]
+        ref_bracket = locus_allele_info.loc[i, brack_col]
+        question_bracket = locus_allele_info.loc[j, brack_col]
     else:
         ref_bracket = None
         question_bracket = None
@@ -155,6 +166,7 @@ def allele_ident(locus_allele_info, init_type_all, metadata, ref_allele_reads, i
         ref_bracket,
         question_bracket,
         datatype,
+        brack_col,
     )
     if "stutter" in locus_allele_info.loc[j, "allele_type"]:
         if "/" in locus_allele_info.loc[j, "allele_type"] and pd.isnull(
@@ -254,6 +266,7 @@ def allele_type(
     ref_bracket,
     question_bracket,
     datatype,
+    brack_col,
 ):
     stutter_thresh = metadata["StutterThresholdDynamicPercent"]
     forward_thresh = metadata["StutterForwardThresholdDynamicPercent"]
@@ -275,11 +288,11 @@ def allele_type(
             )
     elif allele_diff == 2 and ref_reads > quest_al_reads:  # -2 stutter
         allele = ce if datatype == "ce" else question_bracket
-        if check_2stutter(all_type_df, datatype, allele)[0] is True:
+        if check_2stutter(all_type_df, datatype, allele, brack_col)[0] is True:
             if (
                 datatype == "ngs" and bracketed_stutter_id(ref_bracket, question_bracket, -2) == -2
             ) or datatype == "ce":
-                ref_reads = check_2stutter(all_type_df, datatype, allele)[1]
+                ref_reads = check_2stutter(all_type_df, datatype, allele, brack_col)[1]
                 stutter_thresh_reads = stutter_thresh * ref_reads
                 all_type, stut_perc = minus2_stutter(
                     all_type,
@@ -318,7 +331,7 @@ def forward_stut_thresh(perc, perc_stut, reads):
     return forward_thresh
 
 
-def check_2stutter(stutter_df, allele_des, allele):
+def check_2stutter(stutter_df, allele_des, allele, brack_col):
     is_true, reads = False, None
     if "-1_stutter" in stutter_df.loc[:, "allele_type"].values:
         if allele_des == "ce":
@@ -329,7 +342,7 @@ def check_2stutter(stutter_df, allele_des, allele):
                     break
         else:
             for k, row in stutter_df.iterrows():
-                bracket_test = stutter_df.loc[k, "UAS_Output_Bracketed_Notation"]
+                bracket_test = stutter_df.loc[k, brack_col]
                 if bracketed_stutter_id(bracket_test, allele, -1) == -1:
                     is_true, reads = True, stutter_df.loc[k, "Reads"]
     return is_true, reads