Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorporating SNP data #40

Merged
merged 32 commits into from
Jan 27, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
00fa8ed
Inital commit
Jan 12, 2021
3f4069e
Added safeguard to exclude non-STRait Razor .txt files in provided di…
Jan 12, 2021
203b5a1
Added script for formatting SNP data from STRait Razor output
Jan 13, 2021
d983685
updated SNP .json file
Jan 13, 2021
701b9a0
Added formatting for UAS output files
Jan 14, 2021
edd785b
Bypass any .xlsx file in a UAS specified directory that is not a Samp…
Jan 14, 2021
dd7ebce
Updated UAS bulk code in format.py and UAS bulk test datasets
Jan 14, 2021
bf55323
Updated UAS bulk test data files
Jan 14, 2021
f79e183
upated snp_data.json file
Jan 14, 2021
e5f6252
Handling multiple SNPs within one reported sequence in STRait Razor d…
Jan 15, 2021
48043d5
Added combine reads fuction for strait razor data; fixed other bugs
Jan 19, 2021
96f2200
Fixed bugs in UAS file processing
Jan 19, 2021
af77715
Added reads for total reads in sr data
Jan 20, 2021
d4400d4
Updated error in snp dict
Jan 20, 2021
6fbdad2
changed cli command from "format_snps" to "snps"
Jan 20, 2021
54dcd9c
changed name of command to "snps"; removed option of "a" SNPs (both "…
Jan 20, 2021
8ecc28d
Reformatted UAS code processing to process p/a snps together
Jan 21, 2021
4ebaa79
Rearranged final output tables
Jan 21, 2021
04480fa
Updated README
Jan 21, 2021
72dd9ae
updated setup.py
Jan 21, 2021
16ca20d
added expected length exception for rs2402130
Jan 22, 2021
0d1b397
added exception for rs1821380
Jan 22, 2021
abeb248
fixed bug in uas processing
Jan 22, 2021
c0cbaf3
Added flags for unexpected alleles in main reports
Jan 22, 2021
646af0b
Updated README
Jan 22, 2021
de67414
Added test for UAS SNP output
Jan 25, 2021
2e7e1e3
Added test for SR data for all SNPs
Jan 25, 2021
0da89bd
Updated docstrings for functions
Jan 25, 2021
e13692b
fixed bug with dropping index from subsetted dataframe
Jan 26, 2021
7942886
Added tests to check # of lines for type specified output
Jan 26, 2021
3effb20
Reorganized code
Jan 26, 2021
9f51c6b
Updated README
Jan 26, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 49 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@ lusSTR is a tool written in Python to convert NGS sequence data of forensic STR

This Python package has been written for use with either: (1) the 27 autosomal STR loci, 24 Y-chromosome STR loci and 7 X-chromosome STR loci from the Verogen ForenSeq panel, or (2) the 22 autosomal STR loci and 22 Y-chromosome loci from the Promega PowerSeq panel. The package accomodates either the Sample Details Report from the ForenSeq Universal Analysis Software (UAS) or STRait Razor output. If STRait Razor output is provided, sequences are filtered to the UAS sequence region for annotation.

lusSTR also processes SNP data from the Verogen ForenSeq panel. ForenSeq consists of 94 identity SNPs, 22 phenotype (hair/eye color) SNPs, 54 ancestry SNPs and 2 phenotype and ancestry SNPs. Identity SNP data is provided in the UAS Sample Details Report; phenotype and ancestry SNP data is provided in the UAS Phenotype Report. All SNP calls are also reported in the STRait Razor output.


## Installation

Expand All @@ -22,11 +24,11 @@ make devenv
## Usage

lusSTR accomodates three different input formats:
(1) UAS Sample Details Report in .xlsx format
(1) UAS Sample Details Report and UAS Phenotype Report (for SNP processing) in .xlsx format
(2) STRait Razor output with one sample per file
(3) Sample(s) sequences in CSV format; first four columns must be Locus, NumReads, Sequence, SampleID; Optional last two columns can be Project and Analysis IDs.

### Formatting input
### Formatting input for STR loci sequences

If inputting data from either the UAS Sample Details Report or STRait Razor output, the user must first invoke the ```format``` command to extract necessary information and format for the ```annotate``` command.

Expand Down Expand Up @@ -87,7 +89,7 @@ lusstr format STRaitRazorOutputFolder/ -o STRaitRazor_test_file.csv --include-se
With this, two tables will be produced: ```STRaitRazor_test_file.csv``` and ```STRaitRazor_test_file_sex_loci.csv```.


### Annotation
### Annotation of STR loci sequences

The ```annotate``` command produces a tab-delineated table with the following columns:
* Sample ID
Expand Down Expand Up @@ -152,6 +154,50 @@ lusstr annotate STRaitRazor_test_file.csv -o STRaitRazor_powerseq_final.txt --ki
```
Two additional tables will be produced: (1) ```STRaitRazor_powerseq_final_sexloci.txt``` and (2) ```STRaitRazor_powerseq_final_sexloci_flanks_anno.txt``` for annotation of the sex chromosome loci and their flanking regions.

## SNP Data Processing

The ```snp``` command produces tab-delineated table with the following columns:
* Sample ID
* Project ID
* Analysis ID (same as Project ID)
* SNP (rsID)
* Reads: number of reads observed for the specified allele
* Foward Strand Allele: allele call on the forward strand
* UAS Allele: allele call as reported from the UAS
* Type: SNP type (identity/phenotype/ancestry)
* Issues: Indicates if called allele is one of two expected alleles for SNP

If STRait Razor data is used as input, the number of reads for identical alleles within a SNP are combined in the above table. Further, if STRait Razor data is used as input, a second table (```*_full_output.txt```) is produced containing information for each sequence (not combined) with the following columns:
* Sample ID
* Project ID
* Analysis ID
* SNP
* Sequence: sequence containing the SNP of interest
* Reads
* Forward Strand Allele
* UAS Allele
* Type
* Potential issues: flags sequences which may contains errors, such as an unexpected allele call or short than expected sequence length.

### Usage

```
lusstr snps <input_directory> -o <output file name> --type <all, i, p> --uas
```

The ```snp``` command requires a folder of either UAS Reports (Sample Details Report(s) and/or Phenotype Report(s)) or STRait Razor output file(s).
The ```-o``` flag specifies the name of the output file (should end in ```.txt```)
The ```--type``` flag specifies the type of SNPs to include in the output file(s). The options are: ```all``` (all SNPs), ```i``` (identity SNPs only), or ```p``` (ancestry and phenotype SNPs only). The default is ```i```.
Similar to the processing of STR loci sequences, the ```--uas``` flag indicates the input files are Reports from the UAS. Absence of this flag indicates the provided files are STRait Razor output files.

**Examples**:
```
lusstr snps UAS_files/ -o uas_output_all.txt --type all --uas
```
```
lusstr snps STRait_Razor_output/ -o strait_razor_p.txt --type p
```

----

lusSTR is still under development and any suggestions/issues found are welcome!
30 changes: 29 additions & 1 deletion lusSTR/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@

import argparse
import lusSTR
from . import format, annot
from . import format, annot, snps


def format_subparser(subparsers):
Expand Down Expand Up @@ -68,14 +68,42 @@ def annot_subparser(subparsers):
)


def snps_subparser(subparsers):
cli = subparsers.add_parser('snps')
cli.add_argument(
'-o', '--out', metavar='FILE',
help='file to which output will be written; default is terminal (stdout)'
)
cli.add_argument(
'input',
help='Input is either a directory of either UAS output files (Sample Details Report and '
'Phenotype Report) or of STRait Razor output files. If input is the UAS output file(s) '
'(in .xlsx format), use of the --uas flag is required. If STRait Razor output is '
'used, the name of the provided directory will be used as the Analysis ID in the '
'final annotation table.'
)
cli.add_argument(
'--type', choices=['all', 'p', 'i'], default='i',
help='Specify the type of SNPs to include in the final report. "p" will include only the '
'Phenotype and Ancestry SNPs; "i" will include only the Identity SNPs; and "all" will '
'include all SNPs. Default is Identity SNPs only (i).'
)
cli.add_argument(
'--uas', action='store_true',
help='Use if sequences have been run through the ForenSeq UAS.'
)


mains = {
'format': lusSTR.format.main,
'annotate': lusSTR.annot.main,
'snps': lusSTR.snps.main,
}

subparser_funcs = {
'format': format_subparser,
'annotate': annot_subparser,
'snps': snps_subparser,
}


Expand Down
13 changes: 10 additions & 3 deletions lusSTR/format.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,8 @@ def uas_load(inpath, sexloci=False):
sex_strs = pd.DataFrame() if sexloci is True else None
files = glob.glob(os.path.join(inpath, '*.xlsx'))
for filename in sorted(files):
filepath = os.path.join(inpath, filename)
autodata, sexdata = uas_format(filepath, sexloci)
print(filename)
autodata, sexdata = uas_format(filename, sexloci)
auto_strs = auto_strs.append(autodata)
if sexloci is True:
sex_strs = sex_strs.append(sexdata)
Expand Down Expand Up @@ -86,7 +86,14 @@ def strait_razor_concat(indir, sexloci=False):
filename, sep='\t', header=None,
names=['Locus_allele', 'Length', 'Sequence', 'Forward_Reads', 'Reverse_Reads']
)
table[['Locus', 'Allele']] = table.Locus_allele.str.split(":", expand=True)
try:
table[['Locus', 'Allele']] = table.Locus_allele.str.split(":", expand=True)
except ValueError:
print(
f'Error found with {filename}. Will bypass and continue. Please check file'
f' and rerun the command, if necessary.'
)
continue
table['Total_Reads'] = table['Forward_Reads'] + table['Reverse_Reads']
table['SampleID'] = name
table['Project'] = analysisID
Expand Down
Loading