Releases: akikuno/DAJIN2
0.5.6
💥 Breaking
-
Support for PacBio HiFi reads. [Commit Detail]
-
Add
preprocess.sequence_error_handler
to exclude Nanopore sequence errors from the analysis. Issue: #60- Initial commit [Commit Detail]
- Since most Nanopore sequencing errors occur due to read interruptions,
parse_midsv_from_csv
classifies entries as either Unknown or Other (M). [Commit Detail] - Instead of strategies like Cosine similarity or HDBSCAN, the Jaro-Winkler distance is explicitly used as a string similarity metric. Jaro-Winkler was chosen because Levenshtein would be too time-consuming. [Commit Detail]
-
Add
sr
presets to all execusions inpreprocess.mapping
. Issue: #55 [Commit Detail] -
Increase the sensitivity by lowering the mutation detection threshold from 0.5% to 0.1% to detect mutations around 0.75%. [Commit Detail]
-
Use
AgglomerativeClustering
instead of Constrained KMeans because AgglomerativeClustering provides a more global clustering approach, and Constrained KMeans was not very useful due to the unreliability of itsmin_cluster_size
. [Commit Detail] -
Output seqence error reads as
BAM/{name}/sequence_errors.bam
. Issue: #61 [Commit Detail]
🚀 Performance
- Downsampling the sample reads to a maximum of 10,000. Issue: #58 [Commit Detail]
🐛 Bug Fixes
- Fix a bug where a element of dict with empty values was left behind after minor insertions were removed. [Commit Detail]
🔧 Maintenance
-
With the end of security support for Python 3.8 in October 2024, we have updated DAJIN2 to support Python 3.9 or later. [Commit Detail]
-
Replace typing.Generator to collections.abc.Iterator Since typing.Generator is deprecated. Issue: #53 [Commit Detail]
-
Automatically retrieve version information using
importlib.metadata.version
Issue: #59 [Commit Detail] -
Move the FASTX IO processing to
utils.io
. Issue: #66 [Commit Detail] -
Add E2E tests in Github Actions. [Commit Detail]
0.5.5.1
This is a patch for version v0.5.5.
An unfinished inversion detection program had mistakenly been included in the production code.
Since the inversion detection program is scheduled for implementation in version v0.5.6 or later, the code in question has been removed.
0.5.5
📝 Documentation
- Add
FAQ.md
andFAQ_JP.md
to address the question: "Why is the read count of the Control sample lower in the output BAM file?". [Commit Detail]
🔧 Maintenance
-
Integrating insertion and inversion detection: Issue #31
-
Add sv_handler [Commit Detail]
-
Modify arguments of
is_insertion
tois_sv
[Commit Detail] -
Remame
insertions_to_fasta.generate_insertions_fasta
toinsertion_detector.detect_insertions
because the function is not only for generating fasta files but also for generating csv tag. [Commit Detail]
-
-
Remove unused dependencies
networkx
: Issue #49 [Commit Detail]
0.5.4
💥 Breaking
-
Use simulated annealing to optimize cluster assignments in
clustering.constrained_kmenas
[Commit Detail]- Since
ortools
is not installable on osx-arm64 in Bioconda, I implemented an alternative method, simulated annealing, to solve min_cost_flow.
- Since
-
Change the criteria for terminating clustering. [Commit Detail]
- The following termination criteria have been added:
- Minimum cluster size is less than or equal to 0.5% of the sample's read number.
- Decrease in the proportion of samples with a silhouette score of 0.25 or higher.
- The following termination criterion has been removed:
- Adjusted Rand Index >= 0.95, as it led to early termination when minor clusters were generated.
- The following termination criteria have been added:
-
The threshold for
clustering.strand bias
determination has been loosened. [Commit Detail]- This adjustment addresses cases like
+:13, -:2
(0.87) observed inexample_flox/flox-1nt-deletion
. - Since the minor allele is particularly susceptible, further adjustments may be necessary in the future.
- This adjustment addresses cases like
🌟 New Features
- Support for Apple Silicon (osx-arm64) in Bioconda🍎 Issue: #46
0.5.3
💥 Breaking
-
Update
clustering.clustering
: Use Constrained Kmeans clustering to address the issue of cluster imbalance where extremely minor clusters were preferentially separated. Setmin_cluster_size
to 0.5% of the sample read count. [Commit Detail]- As a result,
clustering.label_merger.py
is no longer needed and has been removed.
- As a result,
-
Update
consensus.call_consensus
: For mutations determined to be sequence errors, we previously replaced them with unknown (N
), but thisN
had low interpretability. Therefore, mutations that DAJIN2 determines to be sequence errors will now be assigned the same base as the reference genome. [Commit Detail]
🐛 Bug Fixes
-
Due to a bias in
classifiler.calc_match
where alleles with shorter sequences were prioritized, the operation of dividing by sequence length has been removed. [Commit Detail] -
Fix
preporcess.mapping.generate_sam
to perform alignments withmap-ont
andsplice
in addition tosr
for sequence lengths of 500 bp or less, and select the optimal prefix from these alignments. Issue: #45 [Commit Detail]
0.5.2
📝 Documentation
- Add
FAQ.md
andFAQ_JP.md
to provide answers to questions. [Commit Detail]
🌟 New Features
- Update
mutation_extractor
[Commit Detail]- Simplified the logic of the
is_dissimilar_loci
if statement. Additionally, changed the threshold for determining a mutation in Consensus from 75% to 50% (to accommodate the insertion allele in Cas3 Tyr Barcode10). - Updated
detect_anomalies
to use MLPClassifier to detect mutations more flexibly and accurately compared to the previous threshold setting with MiniBatchKMeans.
- Simplified the logic of the
🔧 Maintenance
-
Make DAJIN2 compatible with Python 3.11 and 3.12. Issue: #43 [Commit Detail]
- pysam and mappy builds with Python 3.11 and 3.12 are now available on Bioconda.
-
Update GitHub Actions to test with Python 3.11 and 3.12. Issue: #43 [Commit Detail]
-
Resolve the B023 Function definition does not bind loop variable
alignment_lengths
issue. [Commit Detail] -
Add
question.yml
in GitHub Issue template. [Commit Detail]
🐛 Bug Fixes
- Update
cssplits_handler._get_index_of_large_deletions
: Modified to split large deletions when a match of 10 or more bases is found within the identified large deletion. Issue: #42 [Commit Detail]
0.5.1
🚀 New Features
- Enable to accept additional file formats as an input. Issue: #37
- FASTA [Commit Detail]
- BAM [Commit Detail]
📝 Documentation
- Add a description of the procedure for accepting files generated by Dorado basecaller as input. Issue: #37 [Commit Detail]
🔧 Maintenance
-
Specify the Python version to be between 3.8 and 3.10. [Commit Detail]
-
Change
mutation_exporter.report_mutations
to return list[list[str]]. Update the tests accordingly. [Commit Detail] -
Apply formatting with Ruff [Commit Detail]
🐛 Bug Fixes
- Add
reallocate_insertion_within_deletion
intoreport.mutation_exporter
and reflected it in the mutation info. [Commit Detail]
0.5.0
📝 Documentation
- Update the issue template from md to yml and modify it to make it easier for users to fill out each item. [Commit Detail]
💥 Breaking
-
Extremely low-frequency alleles (less than 0.05%) are considered Nanopore sequence errors and are not clustered #36.
- Configure
clustering.extract_labels
so that alleles with a low number of reads (0.05% or fewer or 5 reads or fewer) are not clustered. [Commit Detail] - Change
clustering.clustering
to stop if the minimum value of the elements in the cluster is 0.5% or less. [Commit Detail] - Add
consensus.remove_minor_alleles
to remove minor alleles with fewer than 5 reads or less than 0.5% [Commit Detail]
- Configure
-
Save subsetted fastq of a control sample if the read number is too large (> 10,000 reads). The control will have a maximum of 10,000 reads to avoid excessive computational load. [Commit Detail]
-
If the read length is 500 bases or less, change the mappy preset to
sr
. [Commit Detail] -
Update
extract_best_preset
to prioritizemap-ont
and removesplice
preset if inversion is observed. [Commit Detail] -
Update the algorithms of
cssplits_hander.reallocate_insertion_within_deletion
to automate change point detection by incorporating temporal changes. [Commit Detail]
🔧 Maintenance
-
Update
deploy_pypi.yml
to use the latest version of Actions. Refer to the latest official YAML for guidance. [Commit Detail] -
Integrate
requirements.txt
andMANIFEST.in
intopyproject.toml
by replacingsetup.py
[Commit Detail] -
Modify to record the execution command of DAJIN2 in the log file [Commit Detail]
-
Add a test to check if the version in
test_version.sh
matches the version inpyproject.toml
andutils.config
[Commit Detail] -
Rename
consensus.subset_clust
toconsensus.downsample_by_label
to clarify the function's purpose. [Commit Detail] -
Update
extract_unique_insertions
to merge highly similar extracted insertion sequences. [Commit Detail]- Fix
extract_unique_insertions
: There is a bug where removing the key twice in fasta_insertions_unique caused the index and key to become misaligned in enumerate(distances) if i != key. Therefore, the removal of keys from fasta_insertions_unique is now done all at once at the end. [Commit Detail]
- Fix
-
Add control characters for
fastx_handler.sanitize_filename
as forbidden chars. [Commit Detail] -
Chang the naming convention for the temporary directory:
<sample_name>/<process_content>/<allele_name>/(<label_name>)/file_name
. Example:flox/consensus/control/1/mutation_loci.pickle
. [Commit Detail] -
Move
sanitze_name
function fromutils.fastx_handler
toutils.io
[Commit Detail]
🐛 Bug Fixes
-
Remove
sam_handler.remove_overlapped_reads
to prevent unnecessary trimming of reads. [Commit Detail] -
Fix
preprocess.insertions_to_fasta.remove_minor_groups
to delete the keys (insertion loci) when insertions are removed and result in an empty dict. This prevents errors when accessing non-existent keys insubset_insertions
. [Commit Detail] -
Fix the bug in
cssplits_handler.convert_cssplits_to_cstag
where the insertion cs tag is not merged with the next cs tag if they have the same operator (e.g.,+A|+A|=T, =T
: before:+aa=T=T
, after:+aa=TT
). [Commit Detail] -
Modify the system to separate intermediate files using a directory structure instead of underscores (
_
), ensuring that no errors occur even if users use allele names containing underscores [Commit Detail]
0.4.6
💥 Breaking
-
Update the log file Commit Detail
- Add the version of DAJIN2 to the log file to track the version of the analysis.
- Rename the log file to
DAJIN2_log_<current time>.txt
from<current time>_DAJIN2.log
to enabling open the file in any text editor.
-
Update
mutation_extractor.is_dissimilar_loci
Commit Detail- Rename to
is_dissimilar_loci
fromidentify_dissimilar_loci
to explicitly indicate that a boolean is returned. - Changed to use cosine distance instead of cosine similarity to make "difference from control" more intuitive.
- Added a condition to ensure that the cosine distance is not dependent on the specific index: Calculate the cosine distance for 10 bases starting from the neighbor of the corresponding indel, and add the condition that the cosine distances of these adjacent 10 bases should be similar.
- Rename to
-
Update
preprocess.insertions_to_fasta.py
which detects unintended insertion alleles. Commit Detailclustering_insertions
: To accelerate MeanShift clustering, setbin_seeding=True
. Additionally, because clustering decoys without variation becomes extremely slow, we have switched to using decoys that include slight variations.extract_unique_insertions
: Withinunintended insertion alleles
, alleles similar to theintended allele
provided by the user are now excluded.- The similarity is defined as there being differences of more than 10 bases
-
Update
preprocess.insertions_to_fasta.clustering_insertions
to consider the length of each insertion sequence during clustering. This allows two alleles, such asN,(30-base Insertion)
and(30-base Insertion),N
, to be weighted with different scores as [(1, 30), (30, 1)], enabling correct clustering. Commit Detail -
Update
preprocess.homopolymer_handler
: Scaling data to [0, 1] for cosine similarity, normalizing to match scales due to significant differences in mutation rates between samples and controls. Commit Detail
📝 Documentation
-
Add the descriptions about required Python version supporting from 3.8 to 3.10 due to a Bioconda issue to the README.md. Commit Detail
-
Enhance the descriptions in GitHub Issue templates to clarify their purpose. Commit Detail
🔧 Maintenance
-
Move
DAJIN2_VERSION
toutils.config.py
frommain.py
to make it easier to recognize its location. Commit Detail -
Update
io.read_csv
to return alist[dict[str, str]]
, notlist[str]
to align the output format withread_xlsx
. Commit Detail -
Update
utils.input_validator
andpreprocess.genome_fetcher
to temporarily disable SSL certificate verification, allowing access to UCSC servers. Commit Detail -
Add an example of flox knockin design to the
examples
Commit Detail -
Update
preprocess.insertions_to_fasta.py
: The label names for the insertions were not starting from 1, so they have been revised to begin at 1. Commit Detail -
Change installer from pip to conda to install mappy in macos-latest (macos-14-arm64) in Github Action Commit Detail
🚀 Performance
- Update
consensus.similarity_searcher
to cache onehot encoded controls to avoid redundant computations and increase processing speed. Commit Detail
🐛 Bug Fixes
-
Debug
clustering.strand_bias_handler
Commit Detail- For
positive_strand_counts_by_labels: dict
, there was a bug that caused an error and halted execution when accessing a non-existent key. It has been fixed to output 0 instead. - Created a wrapper function
annotate_strand_bias_by_labels
for outputting strand bias. Fixed a bug where the second and subsequent arguments were not being correctly passed when reallocating clusters with strand bias.
- For
-
Fix
preprocess.knockin_handler
to correctly identify the flox knock-in sites as deletions not present in the control. Commit Detail -
Bug fix of
reallocate_insertion_within_deletion
Commit Detail- In the script that considers the region between two deletions as an insertion sequence, the size of the other deletion was not taken into account. Even if there was a single base deletion, the entire sequence between the deletions was considered as an insertion sequence. Therefore, the region between two deletions is now defined only if the size of both deletions is equal to or greater than the specified threshold (default = 3).
0.4.5
🐛 Bug Fixes
- In version 0.4.4 of strand_bias_handler.remove_biased_clusters, there was an error in the continuation condition for removing biased clusters, which has now been corrected. The correct condition should be 'there are alleles with and without strand bias and the iteration count is less than or equal to 1000'. Instead, it was incorrectly set to 'there are alleles with and without strand bias or the iteration count is less than or equal to 1000'.