Skip to content

Commit

Permalink
Merge branch 'deconv-ISS218' of github.com:LadnerLab/PepSIRF into dec…
Browse files Browse the repository at this point in the history
…onv-ISS218
  • Loading branch information
SeanGolez committed Jul 22, 2024
2 parents 12421b9 + b73ab78 commit e9043e3
Show file tree
Hide file tree
Showing 48 changed files with 19,885 additions and 15,030 deletions.
14 changes: 14 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,20 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

- #236, added a functionality to the "-i" option in Subjoin to accept a regex pattern instead of a filename which contains sample/peptide names. The sample/peptide names used from the score matrix file will be filtered by whether they contain the regex pattern.
- #234, added "--unmapped-reads-output" option to Demux, which writes all reads that have not been mapped to a sample/peptide to the specified filename.
- #233, changed Deconv "-t" option to accept a tab demilited file with a column for each TaxID and a column for the score threshold to use for that TaxID. The originally functionality still holds: if a number to included with option, each TaxID will use that score threshold.
- #227, Demux outputs additional information about the total number of samples, the number of samples containing a given number of replicates, and the number of samples starting with "Sblk_". The replicate information with be written to the file provided with the option "--replicate_info".
- #223, Added "--exclude" option to subjoin that changes the output data file to contain all of the input samples/peptides except the the ones specified by the user.
- #221, Demux automatically truncates sequences in the library which are longer the than provided length through the "--seq" option. If a sequence is found to be shorter than the specified length, an error is thrown.
- #218, Added "--custom_id_name_map_info" option to Deconv which accepts a filename, the key column header, and the value column header in the file to use to link TaxIDs to taxon names. This option should be used instead of "--id_name_map" if the user wishes to define a tab-delimited ID name map.
- #210, Fixes crash in Link when a species does not have an associated ID. A single warning is logged which informs the user some species have not been considered and where to find a list of those species which should be reviewed.
- #152, Automated tests have been added and finished to test all recently added features and fixed issues in PepSIRF.
- #131, Provides more information in Enrich's failed enrichment output. Sample replicates which do not meet either threshold are identified in the output and are marked as either not meeting the minimum or maximum threshold.
- #56, Alters behavior of Demux when ran in reference independent mode. In ref-independent mode, index toggling is turned off; therefore, if an exact match at the given index is not found, the read is discarded.
- #2, Adds a system to handle logging PepSIRF's progress when running. A default file name is automatically generated with the module name, current time and date. An option '--logfile' which allows the user to provide a custom name for the log file.
- #36, Standardizes the order tied species are listed in Deconv output. If species names are provided, then the tied species are sorted by alphabeticall by their names; otherwise, they are sorted by their species ID.

## [1.6.0]
- #169, Added an option for FASTQ - level outputs to be generated by demux. This is done with the flag "-q" followed by a directory path where files will be generated
- #178, in the case of a sample not having enriched peptides, enrich will now add a space to the empty file. This allows for better compatability with deconv through Qiime2.
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

## GPL-3.0-or-later

### Current Version: v1.5.1
### Current Version: v1.6.0

Visit our [GitHub Pages website](https://ladnerlab.github.io/PepSIRF/)

Expand Down
2 changes: 1 addition & 1 deletion docs/1-index.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ permalink: /
<img src="./assets/images/PepSIRF_logo_BW.png" alt="" width="1024">


### Current Version: v1.4.0
### Current Version: v1.6.0

### Please cite:
[https://arxiv.org/abs/2007.05050](https://arxiv.org/abs/2007.05050)
91 changes: 88 additions & 3 deletions docs/5-changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,96 @@ permalink: /changelog/

# Changelog

## Unreleased

<strong>Subjoin: added new feature (Issue #236).</strong> Added a functionality to the "-i" option in Subjoin to accept a regex pattern instead of a filename which contains sample/peptide names. The sample/peptide names used from the score matrix file will be filtered by whether they contain the regex pattern.

<strong>Demux: added new feature (Issue #234).</strong> Added "--unmapped-reads-output" option to Demux, which writes all reads that have not been mapped to a sample/peptide to the specified filename.

<strong>Deconv: added new feature (Issue #233).</strong> Changed Deconv "-t" option to accept a tab demilited file with a column for each TaxID and a column for the score threshold to use for that TaxID. The originally functionality still holds: if a number to included with option, each TaxID will use that score threshold.

<strong>Demux: added new feature (Issue #227).</strong> Demux outputs additional information about the total number of samples, the number of samples containing a given number of replicates, and the number of samples starting with "Sblk_". The replicate information with be written to the file provided with the option "--replicate_info".

<strong>Subjoin: added new feature (Issue #223).</strong> Added "--exclude" option to subjoin that changes the output data file to contain all of the input samples/peptides except the the ones specified by the user.

<strong>Demux: added new feature (Issue #221).</strong> Demux automatically truncates sequences in the library which are longer the than provided length through the "--seq" option. If a sequence is found to be shorter than the specified length, an error is thrown.

<strong>Deconv: added new feature (Issue #218).</strong> Added "--custom_id_name_map_info" option to Deconv which accepts a filename, the key column header, and the value column header in the file to use to link TaxIDs to taxon names. This option should be used instead of "--id_name_map" if the user wishes to define a tab-delimited ID name map.

<strong>Link: added new feature (Issue #210).</strong> Fixes crash in Link when a species does not have an associated ID. A single warning is logged which informs the user some species have not been considered and where to find a list of those species which should be reviewed.

<strong>Test: added new feature (Issue #152).</strong> Automated tests have been added and finished to test all recently added features and fixed issues in PepSIRF.

<strong>Enrich: added new feature (Issue #131).</strong> Provides more information in Enrich's failed enrichment output. Sample replicates which do not meet either threshold are identified in the output and are marked as either not meeting the minimum or maximum threshold.

<strong>Demux: added new feature (Issue #56).</strong> Alters behavior of Demux when ran in reference independent mode. In ref-independent mode, index toggling is turned off; therefore, if an exact match at the given index is not found, the read is discarded.

<strong>Logger: added new feature (Issue #2).</strong> Adds a system to handle logging PepSIRF's progress when running. A default file name is automatically generated with the module name, current time and date. An option '--logfile' which allows the user to provide a custom name for the log file.

<strong>Deconv: added new feature (Issue #36).</strong> Standardizes the order tied species are listed in Deconv output. If species names are provided, then the tied species are sorted by alphabeticall by their names; otherwise, they are sorted by their species ID.


## 1.6.0 | 2023-04-04

Version 1.6.0 adds several new features.

## New Features:

<strong>Demux: added new feature (Issue #169).</strong> Added an option for FASTQ - level outputs to be generated by demux. This is done with the flag "-q" followed by a directory path where files will be generated.

<strong>Enrich: added new feature (Issue #178).</strong> In the case of a sample not having enriched peptides, enrich will now add a space to the empty file. This allows for better compatability with deconv through Qiime2.

<strong>Enrich: added new feature (Issue #137).</strong> Added an option for enrich to drop replicates with low raw read counts. This is done with the flag "-l" or "--low_raw_reads". If this functionality is invoked, dropped replicates will not be considered in the enrichment process, and the dropped replicates will be reported in the enrichment failure reasons file under "Removed Replicates": each line will contain the replicates removed from a sample.

<strong>Enrich: added new feature (Issue #131).</strong> Enrich now reports which replicates caused a raw read count threshold failure; and identifies if a replicate failed the maximum or minimum threshold.

<strong>Deconv: added new feature (Issue #161).</strong> Added a flag to deconv that allows the user to specify what string is expected at the end of each file containing enriched peptides (set to "\_enriched.txt" by default). If a file without does not end in the string that was specified, deconv skips over that file.

<strong>Info: added new feature (Issue #149).</strong> Added feature to info that generates a matrix of average counts given replicates. Two new flags must be included in order to use this feature: --rep_names and --get_avgs. --rep_names requires an input file with the names of the replicates that the user wants to generate a matrix of average counts for. --get_avgs requires and output file name where the matrix will be stored.


## 1.5.1 | 2022-09-10

Version 1.5.1 fixes a bug and adds a feature.

### New Features:

<strong>Enrich: added new feature (Issue #154).</strong> Altered behavior of enrich to produce blank sample file output for samples that failed enrichment.

### Bug Fixes:

<strong>Demux: bug fix (Issue #168).</strong>fixed bug introduced in release 1.5, where amino acid level output is overwritten with peptide level output. This no longer occurs.


## 1.5.0 | 2022-06-02

Version 1.5.0 adds multiple features and removed OMP support for Clang compilation.

### New Features:

<strong>Demux: added new feature (Issue #35).</strong> If samplenames or index name sets have duplicates in samplelist file, then those duplicates will be output to the terminal.

<strong>Demux: added new feature (Issue #57).</strong> Demux now has an additional option for providing a tab-delimited file with 5 ordered columns: 1) index name, which should correspond to a header name in the sample sheet, 2) read name, which should be either "r1" or "r2" to specify whether the index is in "--input_r1" or "--input_r2", 3) index start location (0-based, inclusive), 4) index length and 5) number of mismatched to allow. Note: the last three columns correspond to the info currently provided on the command line with "--f_index" and "--r_index" (or "--index1" and "--index2", with recent changes). With this feature, the demux module can now analyze an arbitrary amount of indexes to be found in r1 or r2 input sequences.

<strong>Demux: added new feature (Issue #57).</strong> Demux output diagnostics may now provide more index matches for flexibility with demux changes in #57.

<strong>Demux: added new feature (Issue #138).</strong> Demux now automatically removes reference duplicates when running in a reference dependent mode.

<strong>Zscore: added new feature (Issue #105).</strong> A check is added that verifys the bins provided to the Z score module. It is no longer possible to run the Z score module with the wrong set of bins.

<strong>CMakelists: recognized issue with clang (Issue #162).</strong> Removed threading support on MacOS.

### Bug Fixes:

<strong>Demux: added bug fix (Issue #156).</strong> Solved memory race condition in demux created during development of this release.

<strong>Demux: added bug fix (Issue #163).</strong> Solved memory race condition in demux that created incorrect counts.


## 1.4.0 | 2021-07-09

Version 1.4.0 adds multiple features and one bug fix for s_enrich, p_enrich, and link. CMakelists has been updated and a new module ‘enrich’ has been introduced.


### New Features:

<strong>Module added: enrich (Issue #114).</strong> The p_enrich module was altered to allow for flexibility in the number of replicates for each sample and renamed ‘enrich’. This new module can now provide the functionality of both s_enrich and p_enrich, and therefore, these two modules will no longer be available. Additionally, this module is able to handle >2 replicates.
Expand All @@ -19,11 +104,11 @@ Version 1.4.0 adds multiple features and one bug fix for s_enrich, p_enrich, and

<strong>CMakelists: Big Sur support (Issue #117).</strong> ‘-Xpreprocessor’ has been added to the command setting CMake C++ flags in order to support compilation on Mac OS Big Sur.


### Bug Fixes:

<strong>Link: Issue #116.</strong> A vague and system-dependent error occurred when --protein_file sequence names were not found in the --meta file. Modifications have been made to properly handle this situation and provide a clear and consistent error message.


## 1.3.7 | 2021-06-28
Version 1.3.7 adds one feature and one bug fix to norm.

Expand All @@ -33,10 +118,10 @@ Version 1.3.7 adds one feature and one bug fix to norm.

<strong>Norm: Issue #104.</strong> The norm module help message for option (--peptide_score, -p) has been updated.


## 1.3.6 | 2021-06-09
Version 1.3.6 adds several features and fixes several issues in demux, zscore, and subjoin.


### New Features:

<strong>Demux: new warning (Issue #96).</strong> The module now includes a warning for the user when index names from the (--samplelist, -s) file are not included in the index fasta file (--index, -i).
Expand Down
69 changes: 69 additions & 0 deletions extensions/linkageMap2GMT.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# sample command: python linkageMap2GMT.py -i PM1_linkageMap-species-k7_2024-03-26.tsv -o output.gmt
import argparse
import pandas as pd
import csv
import sys

parser = argparse.ArgumentParser(description="Convert Linkage Map into a GMT file")
parser.add_argument("-i", "--input-dir", type=str, metavar="", required=True, help="Filepath to input linkage map. [REQUIRED]")
parser.add_argument("-m", "--min-score", type=int, metavar="", required=False, default=1, help="Minimum score to filter species by. [DEFAULT=1]")
parser.add_argument("-o", "--output-dir", type=str, metavar="", required=True, help="Filepath where to write outputted GMT file. [REQUIRED]")
args=parser.parse_args()


# convert tsv linkage map to gmt
def linkage_map_2_gmt(
input_filename: str,
output_filename: str,
min_score: int = 1
)->None:

df = pd.read_csv(
input_filename, sep="\t", header=0
).rename(columns={"Linked Species IDs with counts":"Species"}
).dropna()

# create dictionary with each peptide and its set of species
peptide_dict = dict(zip(df["Peptide Name"], df["Species"]))

for peptide, species in peptide_dict.items():
peptide_dict[ peptide ] = str_to_set( str(species), min_score )

# initialize create species dict
species_dict = dict()
for peptide in peptide_dict.keys():
for species in peptide_dict[peptide]:
if species not in species_dict.keys():
species_dict[species] = peptide
else:
species_dict[species]+= f"\t{peptide}"

with open( output_filename, "w" ) as gmt:
for species in species_dict.keys():
gmt.write(f"{species}\t\t{species_dict[species]}\n")

print( f"Converted GMT file saved to: {output_filename}")

def str_to_set(
val: str,
min_score
)->set:
out_species = set();

# get list of groups of species with counts
spec_ls = val.split(",")

for values in spec_ls:
# species at 0 and score at 1
split_id = values.split(":")

# test if score greater or equal to than min
if int(split_id[1]) >= min_score:
# add all species to set (leave as string)
out_species.add( split_id[0] )

return out_species


if __name__ == "__main__":
linkage_map_2_gmt( args.input_dir, args.output_dir, args.min_score )
21 changes: 21 additions & 0 deletions include/modules/demux/module_demux.h
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,18 @@ class module_demux : public module
*/


/**
* Truncates library sequences to length provided by the user
* @param seq_length Sequence length specified with the "--seq" option
* @param lib_seqs Reference to library sequences received from file
* specified by the "--library"
**/
void trunc_lib_seqs(
std::size_t seq_length,
std::vector<sequence> &lib_seqs
);


/**
* Method to zero a vector of size_t elements.
* @param vec Pointer to the vector to zero.
Expand Down Expand Up @@ -364,6 +376,15 @@ class module_demux : public module
**/
std::string get_sample_info( std::vector<sample>& samplelist, std::string outfile_name );

/**
* Creates a single output fastq file containing all of the reads that have not been mapped to a sample/peptide
* @param filename file to output to
* @param samp_map fastaq output map
* @reads_dup vector of all reads
**/
void create_unmapped_reads_file( std::string filename,
std::map<std::string, std::vector<fastq_sequence>> samp_map, std::vector<fastq_sequence> reads_dup );

};

#endif /* MODULE_DEMUX_HH_INCLUDED */
1 change: 1 addition & 0 deletions include/modules/demux/options_demux.h
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ class options_demux: public options
int min_phred_score;
int num_indexes;
std::string replicate_info_fname;
std::string unmapped_reads_fname;

bool translation_aggregation;
std::string fastq_out;
Expand Down
Loading

0 comments on commit e9043e3

Please sign in to comment.