Skip to content

Commit

Permalink
Update terminology in README and CLI, remove amelogenin in annotate (#42
Browse files Browse the repository at this point in the history
)
  • Loading branch information
rnmitchell authored May 11, 2021
1 parent 83a6170 commit 4eaab0a
Show file tree
Hide file tree
Showing 16 changed files with 27,777 additions and 33,503 deletions.
33 changes: 19 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# lusSTR

lusSTR is a tool written in Python to convert NGS sequence data of forensic STR loci to different annotation types for ease in downstream analyses.
lusSTR is a tool written in Python to convert NGS sequence data of forensic STR loci to different sequence representations and allele designations for ease in downstream analyses.

This Python package has been written for use with either: (1) the 27 autosomal STR loci, 24 Y-chromosome STR loci and 7 X-chromosome STR loci from the Verogen ForenSeq panel, or (2) the 22 autosomal STR loci and 22 Y-chromosome loci from the Promega PowerSeq panel. The package accomodates either the Sample Details Report from the ForenSeq Universal Analysis Software (UAS) or STRait Razor output. If STRait Razor output is provided, sequences are filtered to the UAS sequence region for annotation.
This Python package has been written for use with either: (1) the 27 autosomal STR loci, 24 Y-chromosome STR loci and 7 X-chromosome STR loci from the Verogen ForenSeq panel, or (2) the 22 autosomal STR loci and 22 Y-chromosome loci from the Promega PowerSeq panel. The package accomodates either the Sample Details Report from the ForenSeq Universal Analysis Software (UAS) or STRait Razor output. If STRait Razor output is provided, sequences are filtered to the UAS sequence region for translation.

lusSTR also processes SNP data from the Verogen ForenSeq panel. ForenSeq consists of 94 identity SNPs, 22 phenotype (hair/eye color) SNPs, 54 ancestry SNPs and 2 phenotype and ancestry SNPs. Identity SNP data is provided in the UAS Sample Details Report; phenotype and ancestry SNP data is provided in the UAS Phenotype Report. All SNP calls are also reported in the STRait Razor output.

Expand Down Expand Up @@ -71,16 +71,19 @@ The above command will output two tables which are used in the ```annotate``` co

If using lusSTR version 0.4 or above, STRait Razor data **must** be produced using the STRait Razor config file released in January 2021 (ForenSeqv1.25.config and PowerSeqv2.1.config). These config files are available here: https://github.com/Ahhgust/STRaitRazor/tree/103ef68746f010add8f21266fa8bf8fb9f4a076e/.

If using the output from STRait Razor, the files **must** be labeled as ```SampleID.txt``` (example: ```Sample0001.txt```) and **must** be compiled in a separate folder (labeled with the project ID). The user must specify the folder name for the ```format``` command as well as an output filename (all sample files will be compiled into one file):
If using the output from STRait Razor, the files **must** be labeled as ```SampleID.txt``` (example: ```Sample0001.txt```) and can either be specified as a single file or as a folder of multiple STRait Razor output files (folder labeled with the project ID). The user must specify the file or folder name for the ```format``` command as well as an output filename (all sample files will be compiled into one file):
```
lusstr format <input> -o <output>
```

Example:
Examples:

```
lusstr format STRaitRazorOutputFolder/ -o STRaitRazor_test_file.csv
```
```
lusstr format A001.txt -o A001.csv
```

Again, sex loci can be included using the ```--include-sex``` flag.
```
Expand All @@ -89,7 +92,7 @@ lusstr format STRaitRazorOutputFolder/ -o STRaitRazor_test_file.csv --include-se
With this, two tables will be produced: ```STRaitRazor_test_file.csv``` and ```STRaitRazor_test_file_sex_loci.csv```.


### Annotation of STR loci sequences
### Translation of STR loci sequences

The ```annotate``` command produces a tab-delineated table with the following columns:
* Sample ID
Expand All @@ -98,15 +101,17 @@ The ```annotate``` command produces a tab-delineated table with the following co
* Locus
* UAS Output sequence: can be forward or reverse strand
* Forward strand sequence: will be same as UAS Output sequence for those loci reported on forward strand
* Traditional STR allele: common repeat unit based annotation
* Forward Strand Bracketed annotation: Bracketed annotation for forward strand sequence
* UAS Output Bracketed annotation: Bracketed annotation for the reported UAS sequence output (will be same for those loci which report the forward strand)
* RU allele: common length-based repeat unit (RU) allele designation
* Forward Strand Bracketed notation: Bracketed notation for forward strand sequence
* UAS Output Bracketed notation: Bracketed annotation for the reported UAS sequence output (will be same for those loci which report the forward strand)
* LUS: Longest uninterrupted stretch
* LUS+: annotation combining multiple annotations including traditional STR allele designation, LUS, secondary motif (if applicable) and tertiary motif (if applicable)
* LUS+: Notation combining multiple allele designations including RU, LUS, secondary motif (if applicable) and tertiary motif (if applicable)
* Reads: number of reads observed with the specified sequence

If the ```--include-sex``` flag is included, a second table with the above columns for the sex chromosome loci will be outputted as well.

**NOTE** on including the sex chromosome STR loci: in the ```annotate``` step, lusSTR requires two files for input: (1) the properly formatted file of autosomal STR loci produced from the ```format``` command (or a file with the appropriate format) with a label such as ```lusSTRinput.csv```, and (2) the properly formatted file of X- and Y-STR loci produced from the ```format``` command with the ```--include-sex``` flag (or a file with the appropriate format) labeled as ```lusSTRinput_sexloci.csv```. The file containing the X- and Y-STR loci *must* have the identical file name to the file containing the autosomal STRs but with ```_sexloci.csv``` (see above for precise examples). These two files are automatically created (and named appropriately) when using the ```--include-sex``` flag with the ```format``` command.

For the ```annotate``` command, the following must be specified:
* Input filename
* Output filename
Expand All @@ -124,7 +129,7 @@ lusstr annotate UAS_test_file.csv -o UAS_final_table.txt --kit forenseq --uas
```

If no ```--uas``` flag is provided, several additional processes occur with the ```annotate``` command:
* The full sequences are filtered to the UAS region before the annotation step. The number of bases to remove is determined based on the specified kit.
* The full sequences are filtered to the UAS region before the translation step. The number of bases to remove is determined based on the specified kit.
* Once the sequences are filtered to the UAS region, any duplicated sequences are removed and their reads are summed in with the remaining sequence ```Reads``` column. NOTE: This step can be skipped with the ```--nocombine``` flag.

Further, a second table (labeled as ```*_flanks_anno.txt```) containing information related to the flanking sequences surrounding the UAS sequence region is also produced with the following columns:
Expand All @@ -135,9 +140,9 @@ Further, a second table (labeled as ```*_flanks_anno.txt```) containing informat
* Reads: number of reads observed for the specified sequence
* Length-based Allele
* Full Sequence
* 5' Flanking Sequence Annotation
* UAS Region Sequence Annotation (same as column ```UAS Output Bracketed annotation``` in the main table)
* 3' Flanking Sequence Annotation
* 5' Flanking Sequence Bracketed Notation
* UAS Region Sequence Bracketed Notation (same as column ```UAS Output Bracketed Notation``` in the main table)
* 3' Flanking Sequence Bracketed Notation
* Potential Issues (such as: Possible indel or partial sequence)

The ```Potential Issues``` column in this report is to draw attention to potential problem sequences (due to perhaps an indel or partial sequence) and requires the attention of the user to further evaluate the sequence for it's authenticity.
Expand All @@ -152,7 +157,7 @@ If the ```--include-sex``` flag is included, as below:
```
lusstr annotate STRaitRazor_test_file.csv -o STRaitRazor_powerseq_final.txt --kit powerseq --include-sex
```
Two additional tables will be produced: (1) ```STRaitRazor_powerseq_final_sexloci.txt``` and (2) ```STRaitRazor_powerseq_final_sexloci_flanks_anno.txt``` for annotation of the sex chromosome loci and their flanking regions.
Two additional tables will be produced: (1) ```STRaitRazor_powerseq_final_sexloci.txt``` and (2) ```STRaitRazor_powerseq_final_sexloci_flanks_anno.txt``` for translation of the sex chromosome loci and their flanking regions.

## SNP Data Processing

Expand Down
60 changes: 46 additions & 14 deletions lusSTR/annot.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ def format_table(input, uas=False, kit='forenseq'):
'''
Function to format final output table and the flanking report (if necessary).
'''
data = pd.read_csv(input)
data = pd.read_csv(input, keep_default_na=False)
data.iloc[:, 3] = data.iloc[:, 3].astype(str)
list_of_lists = []
flanks_list = []
Expand All @@ -69,6 +69,8 @@ def format_table(input, uas=False, kit='forenseq'):
locus = 'PENTA E'
if locus == 'DYS385A-B' or locus == 'DYS385':
locus = 'DYS385A-B'
if locus == 'AMELOGENIN':
continue
metadata = str_marker_data[locus]
if kit == 'forenseq':
remove_5p = metadata['Foren_5']
Expand Down Expand Up @@ -104,41 +106,71 @@ def format_table(input, uas=False, kit='forenseq'):

columns = [
'SampleID', 'Project', 'Analysis', 'Locus', 'UAS_Output_Sequence',
'Forward_Strand_Sequence', 'Traditional_STR_Allele', 'Forward_Strand_Bracketed_form',
'UAS_Output_Bracketed_Form', 'LUS', 'LUS_Plus', 'Reads'
'Forward_Strand_Sequence', 'RU_Allele', 'Forward_Strand_Bracketed_Notation',
'UAS_Output_Bracketed_Notation', 'LUS', 'LUS_Plus', 'Reads'
]
final_output = pd.DataFrame(list_of_lists, columns=columns)
if not list_of_lists:
final_output = pd.DataFrame(list_of_lists, columns=columns)
else:
final_output = sort_table(pd.DataFrame(list_of_lists, columns=columns))
if not uas:
flanks_columns = [
'SampleID', 'Project', 'Analysis', 'Locus', 'Reads', 'Length_Allele',
'Full_Sequence', '5_Flank_Anno', 'UAS_Region_Anno', '3_Flank_Anno',
'Potential_Issues'
'SampleID', 'Project', 'Analysis', 'Locus', 'Reads', 'RU_Allele',
'Full_Sequence', '5_Flank_Bracketed_Notation', 'UAS_Region_Bracketed_Notation',
'3_Flank_Bracketed_Notation', 'Potential_Issues'
]
final_flank_output = pd.DataFrame(flanks_list, columns=flanks_columns)
if not flanks_list:
final_flank_output = pd.DataFrame(flanks_list, columns=flanks_columns)
else:
final_flank_output = sort_table(pd.DataFrame(flanks_list, columns=flanks_columns))
else:
final_flank_output = ''
return final_output, final_flank_output, columns


def combine_reads(table, columns):
comb_table = table.groupby(columns[:-1], as_index=False)['Reads'].sum()
sorted = sort_table(comb_table)
return sorted


def sort_table(table):
sorted_table = table.sort_values(
by=['SampleID', 'Project', 'Analysis', 'Locus', 'Reads', 'RU_Allele'],
ascending=False
)
return sorted_table


def main(args):
output_name = os.path.splitext(args.out)[0]
input_name = os.path.splitext(args.input)[0]
autosomal_final_table, autosomal_flank_table, columns = format_table(
args.input, args.uas, args.kit
)
if args.sex:
sex_final_table, sex_flank_table, sex_columns = format_table(
sex_final_table, sex_flank_table, columns = format_table(
f'{input_name}_sexloci.csv', args.uas, args.kit
)
sex_final_table.to_csv(f'{output_name}_sexloci.txt', sep='\t', index=False)
if not args.uas:
sex_flank_table.to_csv(f'{output_name}_sexloci_flanks_anno.txt', sep='\t', index=False)
sex_flank_table.to_csv(
f'{output_name}_sexloci_flanks_anno.txt', sep='\t', index=False
)
if args.combine:
if not sex_final_table.empty:
sex_final_table = combine_reads(sex_final_table, columns)
sex_final_table.to_csv(f'{output_name}_sexloci.txt', sep='\t', index=False)
else:
sex_final_table.to_csv(
f'{output_name}_sexloci_no_combined_reads.txt', index=False
)
else:
sex_final_table.to_csv(f'{output_name}_sexloci.txt', sep='\t', index=False)
if not args.uas:
autosomal_flank_table.to_csv(f'{output_name}_flanks_anno.txt', sep='\t', index=False)
if args.combine:
autosomal_final_table = (
autosomal_final_table.groupby(columns[:-1], as_index=False)['Reads'].sum()
)
if not autosomal_final_table.empty:
autosomal_final_table = combine_reads(autosomal_final_table, columns)
autosomal_final_table.to_csv(args.out, sep='\t', index=False)
else:
autosomal_final_table.to_csv(
Expand Down
13 changes: 7 additions & 6 deletions lusSTR/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,12 +20,13 @@ def format_subparser(subparsers):
)
cli.add_argument(
'input',
help='Input is either a single file (UAS Sample Details Report, in .xlsx format) or a '
'directory of STRait Razor output files. If input is the UAS Sample Details Report '
'(in .xlsx format), use of the --uas flag is required. If STRait Razor output is '
'used, the name of the provided directory will be used as the Analysis ID in the '
'final annotation table. Output files within the directory should be named as such: '
'SampleID_STRaitRazor.txt (e.g. A001_STRaitRazor.txt).'
help='Input is either a UAS Sample Details Report (in .xlsx format) or a STRait Razor '
'output file (.txt format). Both single files and directories containing multiple UAS or '
'STRait Razor files are accepted. If input is the UAS Sample Details Report, use of the '
'--uas flag is required. If a directory of STRait Razor files is provided, the name of '
'the directory will be used as the Project and Analysis IDs in the final annotation '
'table. If a single file is provided, the Project and Analysis IDs will be NA. '
'STRaitRazor files should be named as the Sample ID, e.g. A001.txt, A002.txt, etc.'
)
cli.add_argument(
'--uas', action='store_true',
Expand Down
58 changes: 36 additions & 22 deletions lusSTR/format.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ def uas_format(infile, sexloci=False):
return auto_strs, sex_strs


def strait_razor_concat(indir, sexloci=False):
def strait_razor_concat(inpath, sexloci=False):
'''Format a directory of STRait Razor output files for use with `lusSTR annotate`.'''
locus_list = [
'CSF1PO', 'D10S1248', 'D12S391', 'D13S317', 'D16S539', 'D17S1301', 'D18S51', 'D19S433',
Expand All @@ -80,33 +80,47 @@ def strait_razor_concat(indir, sexloci=False):
]
auto_strs = pd.DataFrame()
sex_strs = pd.DataFrame() if sexloci is True else None
analysisID = os.path.basename(indir.rstrip(os.sep))
files = glob.glob(os.path.join(indir, '*.txt'))
for filename in sorted(files):
name = filename.replace('.txt', '').split(os.sep)[-1]
table = pd.read_csv(
filename, sep='\t', header=None,
names=['Locus_allele', 'Length', 'Sequence', 'Forward_Reads', 'Reverse_Reads']
)
try:
table[['Locus', 'Allele']] = table.Locus_allele.str.split(":", expand=True)
except ValueError:
print(
f'Error found with {filename}. Will bypass and continue. Please check file'
f' and rerun the command, if necessary.'
)
continue
table['Total_Reads'] = table['Forward_Reads'] + table['Reverse_Reads']
table['SampleID'] = name
table['Project'] = analysisID
table['Analysis'] = analysisID
table = table[['Locus', 'Total_Reads', 'Sequence', 'SampleID', 'Project', 'Analysis']]
if os.path.isdir(inpath):
analysisID = os.path.basename(inpath.rstrip(os.sep))
files = glob.glob(os.path.join(inpath, '*.txt'))
for filename in sorted(files):
try:
table = strait_razor_table(filename, analysisID, sexloci)
except ValueError:
print(
f'Error found with {filename}. Will bypass and continue. Please check file'
f' and rerun the command, if necessary.'
)
continue
auto_strs = auto_strs.append(table[table.Locus.isin(locus_list)])
if sexloci is True:
sex_strs = sex_strs.append(table[table.Locus.isin(sex_locus_list)])
else:
table = strait_razor_table(inpath, 'NA', sexloci)
auto_strs = auto_strs.append(table[table.Locus.isin(locus_list)])
if sexloci is True:
sex_strs = sex_strs.append(table[table.Locus.isin(sex_locus_list)])
return auto_strs, sex_strs


def strait_razor_table(filename, analysisID, sexloci=False):
name = filename.replace('.txt', '').split(os.sep)[-1]
table = pd.read_csv(
filename, sep='\t', header=None,
names=['Locus_allele', 'Length', 'Sequence', 'Forward_Reads', 'Reverse_Reads']
)
try:
table[['Locus', 'Allele']] = table.Locus_allele.str.split(":", expand=True)
except ValueError:
table = []
table['Total_Reads'] = table['Forward_Reads'] + table['Reverse_Reads']
table['SampleID'] = name
table['Project'] = analysisID
table['Analysis'] = analysisID
table = table[['Locus', 'Total_Reads', 'Sequence', 'SampleID', 'Project', 'Analysis']]
return table


def main(args):
if args.uas:
results, sex_results = uas_load(args.input, args.sex)
Expand Down
2 changes: 1 addition & 1 deletion lusSTR/tests/data/2800M_full_anno_no_combined_reads.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
SampleID Project Analysis Locus UAS_Output_Sequence Forward_Strand_Sequence Traditional_STR_Allele Forward_Strand_Bracketed_form UAS_Output_Bracketed_Form LUS LUS_Plus Reads
SampleID Project Analysis Locus UAS_Output_Sequence Forward_Strand_Sequence RU_Allele Forward_Strand_Bracketed_Notation UAS_Output_Bracketed_Notation LUS LUS_Plus Reads
A01 NA NA CSF1PO AGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGAT ATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCT 12 [ATCT]12 [AGAT]12 12_12 12_12_0 100
A01 NA NA D10S1248 GGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAA GGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAA 13 [GGAA]13 [GGAA]13 13_13 13_13 100
A01 NA NA D10S1248 GGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAA GGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAA 15 [GGAA]15 [GGAA]15 15_15 15_15 100
Expand Down
Loading

0 comments on commit 4eaab0a

Please sign in to comment.