Releases: akikuno/DAJIN2
0.4.4
💥 Breaking
-
Update the threshold from 5 to 0.5 at
identify_dissimilar_loci
to capture 1% minor alleles. Commit Detail -
Return smaller allele clustering labels (
labels_previous
) when the adjusted Rand index is sufficiently high to reduce predicted allele numbers.
Commit Detail
🔧 Maintenance
-
Add the detailed discription at
identify_dissimilar_loci
to clarify the purpose of the function. Commit Detail -
Update a function name of
utils.io.check_excel_or_csv
toutils.io.determine_file_type
for clarity. Commit Detail -
Update examples: In tyr_c230gt_01, the point mutation of Tyr was previously 0.7%, but has been increased to 1.0% by adding point mutation reads from tyr_c230gt_50. Commit Detail
-
Rename
validate_columns_of_batch_file
in test_main.py. Commit Detail -
Add tests of
strand_bias_handler
Commit Detail -
Add type hints and comments in
return_labels
Commit Detail
0.4.3
📝 Documentation
- Update example dataset and a description of README.md/README_JP.md Commit Detail
🐛 Bug Fixes
-
Update
preprocess.genome_fetcher_fetch_seq_coordinates
to accurately verify that the entire length of the input sequence is present within the reference sequence. Previously, partial 100% matches were inadvertently accepted; this revision aims to ensure the full alignment of the input sequence with the reference. Commit Detail -
Update
report.bam_exporter
to be case-sensitive and consistent with directory names. This is to avoid errors caused by the difference between report/bam and report/BAM on Ubuntu, which is case-sensitive to directory names. Commit Detail
🔧 Maintenance
-
Change
threshold_readnumber
atlabem_merger.merge_labels
from 10 to 5 to capture 1% alleles from 500 total reads. Commit Detail -
Update the
requirements.txt
to install a newer version of the library. Commit Detail -
Update
report.report_bam
and rename toreport.bam_exporter
: Commit Detail- Use UUID instead of random number for the temporary file name.
- Rename
realign
torecalculate_sam_coodinates_to_reference
for the readability of the function name. - Add
convert_pos_to_one_indexed
to convert the 0-based position to 1-based position and suppress samtools warning.- Warning:
[W::sam_parse1] mapped query cannot have zero coordinate; treated as unmapped
- Warning:
- Add tests for the
write_sam_to_bam
function
-
Move
read_sam
function from sam_handler to io module. Commit Detail -
Rename
report.report_mutation
,report.report_files
toreport.mutation_exporter
andreport.sequence_exporter
to be more explicit. Commit Detail
0.4.2
🔧 Maintenance
-
Remove multi-mapping reads, as multi-mapping reads are mostly reads that are locally mapped to low-complexity regions. Commit Detail
-
Create
preprocess.input_formatter.py
to summarize formatting functions to a module. Commit Detail -
Refactor
directory_manager.py
Commit Detail -
Refactor
preprocess.__init__.py
Commit Detail -
To increase cohesion by functions of the same category into a single module, we have migrated
preprocess.fastx_parser
toutils.fastx_handler
. Commit Detail -
Remove the packages that are no longer in use from
requirements.txt
. Commit Detail -
Add
read_sam
in sam_handler module. Commit Detail -
Revise the docstring of
export_fasta_files
. Commit Detail -
Standardize to use
dataclass
instead ofNamedTuple
. Commit Detail
0.4.1
📝 Documentation
- Added documentation for a new feature in
README.md
: DAJIN2 can now detect complex mutations characteristic of genome editing, such as insertions occurring in regions where deletions have occurred.
🚀 New Features
-
Introduced
cssplits_handler.detect_insertion_within_deletion
to extract insertion sequences within deletions. This addresses cases where minimap2 may align bases that partially match the reference through local alignment, potentially failing to detect them as insertions. This enhancement ensures the proper detection of insertion sequences. Commit Detail -
Added
report.insertion_refractor.py
to include original insertion information in the consensus for mappings made by insertion. This addition enables the listing of both insertions and deletions within the insertion allele on a single HTML file. Commit Detail
🔧 Maintenance
-
Updated
insertions_to_fasta.py
. Commit Detail- Modified the approach to reduce randomness by replacing set or frozenset with list or tuple, and using
random.sample()
for subsetting reads. - Refactored
call_consensus_insertion_sequence
. - Fixed a bug in
extract_score_and_sequence
to ensure correct appending of scores for the insertions_merged_subset.
- Modified the approach to reduce randomness by replacing set or frozenset with list or tuple, and using
-
Changed the function name of
report
to be more explicit. Commit Detail -
Updated
utils.report_report_generator
Commit Detail- Capitalized "Allele" (e.g., control) and "Allele type" (e.g., intact).
- Changed the output format of read_all and read_summary from CSV to XLSX.
- Corrected the order of the Legend to follow a logical sequence from control to sample, and then to specific insertions.
-
Updated
utils.io.read_xlsx
to switch from using pandas to openpyxl due to the DeprecationWarning in Pandas being cumbersome. Commit Detail
🐛 Bug Fixes
-
Added
=
to the prefix for valid cstag recognition when there is ann
in inversion. Commit Detail -
Modified the io.load_from_csv function to trim spaces before and after each field, addressing an error caused by spaces in batch.csv. Commit Detail
⛔️ Deprecated
- Removed
reads_all.csv
. This CSV file, which showed the allele for each read, is no longer reported due to its limited usefulness and because the same information can be obtained from the BAM file. Commit Detail
0.4.0
💥 Breaking
- Changed the input from a path to a FASTQ file to a path to a directory: The output of Guppy is now stored in multiple FASTQ files under the
barcodeXX/
directory. Previously, it was necessary to combine the FASTQ files in thebarcodeXX/
directory into one and specify it as an argument. With this revision, it is now possible to directly specify thebarcodeXX
directory, allowing users to seamlessly proceed to DAJIN2 analysis after Guppy processing.
Commit Detail
📝 Documentation
- Changed
conda config --set channel_priority strict
toconda config --set channel_priority flexible
for installation process in TROUBLESHOOTING.md. Commit Detail
🚀 New Features
-
Apple Silicon (ARM64) supoorts. Commit Detail
-
Changed the definition of the minor allele from a read number of less than or equal to 10 to less than or equal to 5. This is based on the assumption that one sample contains 1000 reads, where 0.5% corresponds to 5 reads. Commit Detail
🔧 Update
-
Update
preprocess.insertion_to_fasta
to facilitate the discrimination of Insertion alleles, the Reference for Insertion alleles has been saved in FASTA/HTML directory. Commit Detail -
Update
insertions_to_fasta.extract_enriched_insertions
: Previously, it calculated the presence ratio of insertion alleles separately for samples and controls, filtering at 0.5%. However, due to a threshold issue, some control insertions were narrowly missing the threshold, resulting in them being incorrectly identified as sample-specific insertions. To rectify this, the algorithm now clusters samples and controls together, excluding clusters where both types are mixed. This modification allows for the extraction of sample-specific insertion alleles. Commit Detail -
Updated
preprocess.insertions_to_fasta.count_insertions
of the counting method to treat similar insertions as identical. Previously, the same insertion was erroneously counted as different ones due to sequence errors. Commit Detail -
Updated
preprocess.insertions_to_fasta.merge_similar_insertions
: Previously, clustering was done using MiniBatchKMeans, but this method had an issue where it excessively clustered when only highly similar insertion sequences existed. Therefore, a strategy similar toextract_enriched_insertions
was adopted, changing the algorithm to one that mixes with a uniform distribution of random scores before clustering. Commit Detail -
Added
preprocess.insertions_to_fasta.clustering_insertions
: Combined the clustering methods used inextract_enriched_insertions
andmerge_similar_insertions
into a common function. Commit Detail -
Moved the
call_sequence
function to thecssplits_handler
module. Commit Detail
🐛 Bug Fixes
-
Debug
clustering.merge_labels
to be able to correctly revert minor labels back to parent labels. Commit Detail -
Updated
utils.input_validator.validate_genome_and_fetch_urls
to obtainavailable_server
more explicitly. Previously, it relied on HTTP response codes, but there were instances where the UCSC Genome Browser showed a normal (200) response while internally being in error. Therefore, with this change, a more explicit method is employed by searching for specific keywords present in the normal HTML, to determine if the server is functioning correctly. Commit Detail -
Added
config.reset_logging
to reset the logging configuration. Previously, when batch processing multiple experiment IDs (names), a bug existed where the log settings from previous experiments remained, and the log file name was not updated. However, with this change, log files are now created for each experiment ID. Commit Detail -
Debugged
core.py
: Modified the specification ofpaths_predefined_fasta
to accept input from user-entered ALLELE data. Previously, it accepted fasta files stored in the fasta directory. However, this approach had a bug where fasta files left over from a previously aborted run (which included newly created insertions) were treated as predefined. This resulted in new insertions being incorrectly categorized as predefined. Commit Detail
0.3.6
📝 Documentation
- Added a quick guide for installation to TROUBLESHOOTING.md. Commit Detail
🚀 Update
Preprocess
-
Updated
input_validator.py
: The UCSC Blat server sometimes returns a 200 HTTP status code even when an error occurs. In such cases, "Very Early Error" is indicated in the title. Therefore, we have made it so that it returns False in those situations. Commit Detail -
Simplified
homopolymer_handler.py
for error detection using cosine similarity. Commit Detail -
Updated
mutation_extractor.py
to use cosine similarity to filter dissimilar loci. Commit Detail -
Updated the
mutation_extractor.identify_dissimilar_loci
so that it unconditionally returns True if the 'sample' shows more than 5% variation compared to the 'control'. Commit Detail -
Added
preprocess.midsv_caller.convert_consecutive_indels_to_match
: Due to alignment errors, instances where a true match is mistakenly replaced with "insertion following a deletion" are corrected. For example, "=C,=T" mistakenly replaced by "-C,+C|=T" is reverted back to "=C,=T". Commit Detail
Classification
- Added
allele_merger.merge_minor_alleles
to reclassify alleles with fewer than 10 reads to suppress excessive subdivision of alleles. Commit Detail
Clustering
-
Added the function
merge_minor_cluster
to revert labels clustered with fewer than 10 reads back to the previous labels to suppress excessive subdivision of alleles. Commit Detail -
Updated
generate_mutation_kmers
to consider indices not registered in mutation_loci as mutations by replacing them with "@". For example, "=G,=C,-C" and "=G,=G,=C" become "@,@,@" in both cases, making them the same and ensuring they do not affect clustering. Commit Detail
Consensus
- Implemented
LocalOutlierFactor
to filter abnormal control reads. Commit Detail
0.3.5
Last update: 2023-12-23
📝 Documentation
- Added
ROADMAP.md
to track the progress of the project Commit Detail - Added Prerequisites section to README.md Commit Detail
🚀 Features
Preprocessing
- Updated
homopolymer_handler.get_counts_homopolymer
to change to count mutations in homopolymer regions considering only the control Commit Detail
Clustering
- Changed clustering algorithm from KMeans to BisectingKMeans to handle larger dataset Commit Detail
Consensus
-
Added
convert_consecutive_indels_to_match
to offset the effect when the same base insertion/deletion occurs consecutively Commit Detail -
Added
similarity_searcher.py
to extract control reads resembling the consensus sequence, thereby enhancing the accuracy of detecting sample-specific mutations. Commit Detail -
Changed the method in `clust_formatter.get_thresholds`` to dynamically define the thresholds for ignoring mutations, instead of using fixed values.Commit Detail
-
Removed code that was previously commented out Commit Detail
🐛 Bug Fixes
- None
🔧 Maintenance
-
Modified batch processing to run on a single CPU thread per process Commit Detail
-
Simplifed import path Commit Detail
preprocess.midsv_caller.execute
topreprocess.generate_midsv
preprocess.mapping.generate_sam
topreprocess.generate_sam
-
Added tests to
consensus.convert_consecutive_indels_to_match
Commit Detail
⛔️ Deprecated
- None
0.3.4
📖 Documentation
- Added docs/TROUBLESHOOTING.md
- Added docs/CODE_OF_CONDUCT.md
- Added docs/CONTRIBUTING.md
✨ New Features
- None
🔧 Maintenance
Update preprocess.mutation_extractor.py
-
count_indels
:- Change: Method of counting indels modified to use only matches as the denominator, instead of matches + indels.
- Reason: To specifically focus on the occurrence rate of particular mutations.
-
find_dissimilar_indices
:- Change: Mutation detection modified. If the p-value remains < 0.05 after removing the target base sequence, the area is not detected as a mutation, assuming the significance is due to other parts.
- Implication: Increases mutation detection accuracy by excluding irrelevant base sequences.
-
merge_index_of_consecutive_indel
:- Change: Merged
merge_surrounding_index
andmerge_index_of_consecutive_insertions
into a single function. - Benefit: Streamlines the process and enhances efficiency in handling consecutive indels.
- Change: Merged
Update consensus.consensus.py
:
- Addressed a precision issue in floating-point calculations where N equals 100%, leading to
100 != 100.000002
. Changed the condition to "having only one key and that key beingN
". Commit details
Update mutation_extractor.py
:
- Switched to the Wilcoxon signed-rank test due to false negatives in the t-test for data with peak-like shapes. Commit details
Others
- Modified batch processing to run on a single CPU thread per process.
- Added
clust_formatter.cache_mutation_loci
. - Changed
mutation_extractor.merge_loci
to use union instead of intersection. - Added a filter for minor insertion alleles in
insertions_to_fasta.py
. - Moved
insertion_to_fasta.save_fasta
toutils.io.save_fasta
.
0.3.3
📖 Documentation
- Added troubleshooting.md
✨ New Features
- Excluded the letter 'N' except when all bases are 'N' (which indicates reads with missing ends).
- Upon successful completion, the log file is now moved to the report directory (DAJIN_Results/{name}).
🔧 Modification
- Changed from OneClassSVM to k-means for anomaly detection (d97d32a)
🧰 Maintenance
- Set up weekly tests to run on GitHub Actions.
0.3.2
📖 Documentation
- Revisions to README.md and READMD_jp.md
- Added a note to the README to install gcc and zlib when encountering installation errors for mappy via pip
✨ New Features
None
🛠️ Maintenance
-
Refactoring of
main.py
- config.set_single_threaded_blas
- config.set_logging
- utils.multiprocess
-
Verified operation with the latest
cstag
(v1.0.5) -
Limited the generation of log files with every execution
- It's troublesome to have an empty log every time you check for help or version
- Ensured log files are only generated at appropriate times (like during logging.info) or in case of unexpected errors
- Added
convert_cssplits_to_cstag
toutils.cssplits_handler
- Converted cssplits to cstag, ensuring to_html operates without issues
- However, the existing CS tag doesn't represent inversion, so further consideration is needed on how to handle this
- Added tests for
convert_cssplits_to_cstag