bioforensics · rnmitchell · Jan 27, 2021 · Jan 12, 2021 · Jan 12, 2021 · Jan 13, 2021
diff --git a/README.md b/README.md
@@ -4,6 +4,8 @@ lusSTR is a tool written in Python to convert NGS sequence data of forensic STR
 
 This Python package has been written for use with either: (1) the 27 autosomal STR loci, 24 Y-chromosome STR loci and 7 X-chromosome STR loci from the Verogen ForenSeq panel, or (2) the 22 autosomal STR loci and 22 Y-chromosome loci from the Promega PowerSeq panel. The package accomodates either the Sample Details Report from the ForenSeq Universal Analysis Software (UAS) or STRait Razor output. If STRait Razor output is provided, sequences are filtered to the UAS sequence region for annotation.
 
+lusSTR also processes SNP data from the Verogen ForenSeq panel. ForenSeq consists of 94 identity SNPs, 22 phenotype (hair/eye color) SNPs, 54 ancestry SNPs and 2 phenotype and ancestry SNPs. Identity SNP data is provided in the UAS Sample Details Report; phenotype and ancestry SNP data is provided in the UAS Phenotype Report. All SNP calls are also reported in the STRait Razor output.
+
 
 ## Installation
 
@@ -22,11 +24,11 @@ make devenv
 ## Usage
 
 lusSTR accomodates three different input formats:
-(1) UAS Sample Details Report in .xlsx format
+(1) UAS Sample Details Report and UAS Phenotype Report (for SNP processing) in .xlsx format
 (2) STRait Razor output with one sample per file
 (3) Sample(s) sequences in CSV format; first four columns must be Locus, NumReads, Sequence, SampleID; Optional last two columns can be Project and Analysis IDs.
 
-### Formatting input
+### Formatting input for STR loci sequences
 
 If inputting data from either the UAS Sample Details Report or STRait Razor output, the user must first invoke the ```format``` command to extract necessary information and format for the ```annotate``` command.
 
@@ -87,7 +89,7 @@ lusstr format STRaitRazorOutputFolder/ -o STRaitRazor_test_file.csv --include-se
 With this, two tables will be produced: ```STRaitRazor_test_file.csv``` and ```STRaitRazor_test_file_sex_loci.csv```.
 
 
-### Annotation
+### Annotation of STR loci sequences
 
 The ```annotate``` command produces a tab-delineated table with the following columns:
 *  Sample ID
@@ -152,6 +154,50 @@ lusstr annotate STRaitRazor_test_file.csv -o STRaitRazor_powerseq_final.txt --ki
 ```
  Two additional tables will be produced: (1) ```STRaitRazor_powerseq_final_sexloci.txt``` and (2) ```STRaitRazor_powerseq_final_sexloci_flanks_anno.txt``` for annotation of the sex chromosome loci and their flanking regions.
 
+ ## SNP Data Processing
+
+ The ```snp``` command produces tab-delineated table with the following columns:
+ * Sample ID
+ * Project ID
+ * Analysis ID (same as Project ID)
+ * SNP (rsID)
+ * Reads: number of reads observed for the specified allele
+ * Foward Strand Allele: allele call on the forward strand
+ * UAS Allele: allele call as reported from the UAS
+ * Type: SNP type (identity/phenotype/ancestry)
+ * Issues: Indicates if called allele is one of two expected alleles for SNP
+
+If STRait Razor data is used as input, the number of reads for identical alleles within a SNP are combined in the above table. Further, if STRait Razor data is used as input, a second table (```*_full_output.txt```) is produced containing information for each sequence (not combined) with the following columns:
+ * Sample ID
+ * Project ID
+ * Analysis ID
+ * SNP
+ * Sequence: sequence containing the SNP of interest
+ * Reads
+ * Forward Strand Allele
+ * UAS Allele
+ * Type
+ * Potential issues: flags sequences which may contains errors, such as an unexpected allele call or short than expected sequence length.
+
+ ### Usage
+
+ ```
+ lusstr snps <input_directory> -o <output file name> --type <all, i, p> --uas
+ ```
+
+The ```snp``` command requires a folder of either UAS Reports (Sample Details Report(s) and/or Phenotype Report(s)) or STRait Razor output file(s).
+The ```-o``` flag specifies the name of the output file (should end in ```.txt```)
+The ```--type``` flag specifies the type of SNPs to include in the output file(s). The options are: ```all``` (all SNPs), ```i``` (identity SNPs only), or ```p``` (ancestry and phenotype SNPs only).  The default is ```i```.
+Similar to the processing of STR loci sequences, the ```--uas``` flag indicates the input files are Reports from the UAS. Absence of this flag indicates the provided files are STRait Razor output files.
+
+**Examples**:
+```
+lusstr snps UAS_files/ -o uas_output_all.txt --type all --uas
+```
+```
+lusstr snps STRait_Razor_output/ -o strait_razor_p.txt --type p
+```
+
 ----
 
 lusSTR is still under development and any suggestions/issues found are welcome!
diff --git a/lusSTR/cli.py b/lusSTR/cli.py
@@ -9,7 +9,7 @@
 
 import argparse
 import lusSTR
-from . import format, annot
+from . import format, annot, snps
 
 
 def format_subparser(subparsers):
@@ -68,14 +68,42 @@ def annot_subparser(subparsers):
     )
 
 
+def snps_subparser(subparsers):
+    cli = subparsers.add_parser('snps')
+    cli.add_argument(
+        '-o', '--out', metavar='FILE',
+        help='file to which output will be written; default is terminal (stdout)'
+    )
+    cli.add_argument(
+        'input',
+        help='Input is either a directory of either UAS output files (Sample Details Report and '
+        'Phenotype Report) or of STRait Razor output files. If input is the UAS output file(s) '
+        '(in .xlsx format), use of the --uas flag is required. If STRait Razor output is '
+        'used, the name of the provided directory will be used as the Analysis ID in the '
+        'final annotation table.'
+    )
+    cli.add_argument(
+        '--type', choices=['all', 'p', 'i'], default='i',
+        help='Specify the type of SNPs to include in the final report. "p" will include only the '
+        'Phenotype and Ancestry SNPs; "i" will include only the Identity SNPs; and "all" will '
+        'include all SNPs. Default is Identity SNPs only (i).'
+    )
+    cli.add_argument(
+        '--uas', action='store_true',
+        help='Use if sequences have been run through the ForenSeq UAS.'
+    )
+
+
 mains = {
     'format': lusSTR.format.main,
     'annotate': lusSTR.annot.main,
+    'snps': lusSTR.snps.main,
 }
 
 subparser_funcs = {
     'format': format_subparser,
     'annotate': annot_subparser,
+    'snps': snps_subparser,
 }
 
 

diff --git a/lusSTR/format.py b/lusSTR/format.py
@@ -27,8 +27,8 @@ def uas_load(inpath, sexloci=False):
         sex_strs = pd.DataFrame() if sexloci is True else None
         files = glob.glob(os.path.join(inpath, '*.xlsx'))
         for filename in sorted(files):
-            filepath = os.path.join(inpath, filename)
-            autodata, sexdata = uas_format(filepath, sexloci)
+            print(filename)
+            autodata, sexdata = uas_format(filename, sexloci)
             auto_strs = auto_strs.append(autodata)
             if sexloci is True:
                 sex_strs = sex_strs.append(sexdata)
@@ -86,7 +86,14 @@ def strait_razor_concat(indir, sexloci=False):
             filename, sep='\t', header=None,
             names=['Locus_allele', 'Length', 'Sequence', 'Forward_Reads', 'Reverse_Reads']
         )
-        table[['Locus', 'Allele']] = table.Locus_allele.str.split(":", expand=True)
+        try:
+            table[['Locus', 'Allele']] = table.Locus_allele.str.split(":", expand=True)
+        except ValueError:
+            print(
+                f'Error found with {filename}. Will bypass and continue. Please check file'
+                f' and rerun the command, if necessary.'
+            )
+            continue
         table['Total_Reads'] = table['Forward_Reads'] + table['Reverse_Reads']
         table['SampleID'] = name
         table['Project'] = analysisID