diff --git a/.github/ISSUE_TEMPLATE/question.yml b/.github/ISSUE_TEMPLATE/question.yml
new file mode 100644
index 0000000..5547454
--- /dev/null
+++ b/.github/ISSUE_TEMPLATE/question.yml
@@ -0,0 +1,39 @@
+name: ❓ Question
+description: Report your question here
+labels: ['question']
+
+body:
+ - type: textarea
+ id: description
+ attributes:
+ label: '📋 Description'
+ description: A clear and concise description of the question.
+ validations:
+ required: true
+
+ - type: textarea
+ id: environment
+ attributes:
+ label: '🔍 Environment'
+ description: |
+ Optional: The environment information.
+ Example:
+ - OS: WSL (Ubuntu 22.04)
+ - DAJIN2 version: x.x.x
+ - Python version: x.x.x
+ value: |
+ - OS:
+ - DAJIN2 version:
+ - Python version:
+ render: markdown
+ validations:
+ required: false
+
+ - type: textarea
+ id: anything_else
+ attributes:
+ label: '📎 Anything else?'
+ description: |
+ Optional: Add any other contexts, links, or screenshots about the bug here.
+ validations:
+ required: false
diff --git a/.github/workflows/pytest.yml b/.github/workflows/pytest.yml
index 5c8f985..b388201 100644
--- a/.github/workflows/pytest.yml
+++ b/.github/workflows/pytest.yml
@@ -9,10 +9,10 @@ jobs:
build:
runs-on: ${{ matrix.os }}
strategy:
- max-parallel: 6
+ max-parallel: 10
matrix:
os: [ubuntu-latest, macos-latest]
- python-version: ['3.8', '3.9', '3.10']
+ python-version: ['3.8', '3.9', '3.10', '3.11', '3.12']
name: Python ${{ matrix.python-version }} on ${{ matrix.os }}
defaults:
diff --git a/README.md b/README.md
index 7f12195..4cf29c5 100644
--- a/README.md
+++ b/README.md
@@ -28,7 +28,7 @@ The name DAJIN is derived from the phrase 一網**打尽** (Ichimou **DAJIN** or
### Prerequisites
-- Python 3.8 to 3.10
+- Python >= 3.8
- Unix-like environment (Linux, macOS, WSL2, etc.)
### From [Bioconda](https://anaconda.org/bioconda/DAJIN2) (Recommended)
@@ -38,9 +38,6 @@ conda create -n env-dajin2 -c conda-forge -c bioconda python=3.10 DAJIN2 -y
conda activate env-dajin2
```
-> [!IMPORTANT]
-> DAJIN2 supports Python versions 3.8 to 3.10, but not Python 3.11 yet due to a [Bioconda issue](https://github.com/bioconda/bioconda-recipes/issues/37805).
-
> [!NOTE]
> To Apple Silicon (ARM64) users:
@@ -314,13 +311,19 @@ The **Allele type** includes:
> In PCR amplicon sequencing, the % of reads might not match the actual allele proportions due to amplification bias.
> Especially when large deletions are present, the deletion alleles might be significantly amplified, potentially not reflecting the actual allele proportions.
-## 📣Feedback and Support
+## 📣 Feedback and Support
+
+> [!NOTE]
+> For frequently asked questions, please refer to [this page](https://github.com/akikuno/DAJIN2/blob/main/docs/FAQ.md).
-For questions, bug reports, or other forms of feedback, we'd love to hear from you!
+
+For more questions, bug reports, or other forms of feedback, we'd love to hear from you!
Please use [GitHub Issues](https://github.com/akikuno/DAJIN2/issues/new/choose) for all reporting purposes.
Please refer to [CONTRIBUTING](https://github.com/akikuno/DAJIN2/blob/main/docs/CONTRIBUTING.md) for how to contribute and how to verify your contributions.
+
+
## 🤝 Code of Conduct
Please note that this project is released with a [Contributor Code of Conduct](https://github.com/akikuno/DAJIN2/blob/main/docs/CODE_OF_CONDUCT.md).
diff --git a/docs/FAQ.md b/docs/FAQ.md
new file mode 100644
index 0000000..fc256ab
--- /dev/null
+++ b/docs/FAQ.md
@@ -0,0 +1,16 @@
+# Frequently Asked Questions
+
+## How many reads are necessary?
+
+**We recommend at least 1,000 reads.**
+With 1,000 reads, it is possible to detect point mutation alleles with a frequency of 1%, ensuring high-precision analysis. However, if the target is an indel of more than a few tens of bases, or if the expected allele frequency is 5% or higher, detection is possible with fewer reads (~500 reads).
+
+## What is the recommended read length for analysis?
+
+**We recommend lengths below 10kb.**
+If the length is below 10kb, when reading PCR amplicons with Nanopore, the reads will uniformly cover the target region. It is possible to obtain PCR amplicons up to approximately 15kb, but this may result in uneven coverage of the target region, potentially reducing analysis accuracy.
+
+## Can data from platforms other than Nanopore (e.g., PacBio or NGS) be analyzed?
+
+**Yes, it is possible.**
+DAJIN2 accepts common file formats (FASTA, FASTQ, BAM) as input, allowing the analysis of data from platforms other than Nanopore. However, since we do not have experience using DAJIN2 with non-Nanopore data, please contact us [here](https://github.com/akikuno/DAJIN2/issues/new/choose) if you encounter any issues.
diff --git a/docs/FAQ_JP.md b/docs/FAQ_JP.md
new file mode 100644
index 0000000..e81d4ce
--- /dev/null
+++ b/docs/FAQ_JP.md
@@ -0,0 +1,19 @@
+# よくあるご質問
+
+
+## 必要なリード数はどれくらいですか?
+
+**1,000リード以上を推奨しています。**
+1,000リード以上あれば1%の点変異アレルの検出が可能であり、高精度な解析が可能です。一方で、検出対象が数10塩基以上のindelであったり、予想されるアレル頻度が5%以上である場合にはより少ないリード数(~500リード程度)で検出が可能です。
+
+
+## 解析可能なリード長はどれくらいですか?
+
+**10kb以下を推奨しています。**
+10kb以下であれば、PCRアンプリコンをNanoporeで読んだ際に、標的領域に満遍なくリードが張り付きます。最大で15kb程度のPCRアンプリコンを得ることは可能ではありますが、標的領域におけるカバレッジにムラが生じてしまい、解析精度が低下する可能性があります。
+
+## Nanopore以外(PacBioやNGS)のデータの解析は可能ですか?
+
+**可能です。**
+DAJIN2は一般的なファイルフォーマット(FASTA, FASTQ, BAM)を入力として受け付けているため、Nanopore以外のデータも解析可能です。ただし、私たちのほうではNanoporeデータ以外にDAJIN2を用いた経験がないため、もしご利用に不具合が生じた場合には、お手数ですが[こちら](https://github.com/akikuno/DAJIN2/issues/new/choose)よりお問い合わせください。
+
diff --git a/docs/README_JP.md b/docs/README_JP.md
index cdfcdfe..7e281d5 100644
--- a/docs/README_JP.md
+++ b/docs/README_JP.md
@@ -25,7 +25,7 @@ DAJIN2は、ナノポアシーアターゲットシーケンシングを用い
### 環境
-- Python 3.8 - 3.10
+- Python >= 3.8
- Unix環境 (Linux, macOS, WSL2, etc.)
### [Bioconda](https://anaconda.org/bioconda/DAJIN2) (推奨)
@@ -36,9 +36,6 @@ conda create -n env-dajin2 -c conda-forge -c bioconda python=3.10 DAJIN2 -y
conda activate env-dajin2
```
-> [!IMPORTANT]
-> 現状、[BiocondaがPython 3.11以上に対応していない](https://github.com/bioconda/bioconda-recipes/issues/37805)ため、DAJIN2はPython 3.8 から 3.10までをサポートしています。
-
> [!NOTE]
> Appleシリコン搭載のMacの場合:
> 現状、[BiocondaがAppleシリコンに対応していない](https://github.com/bioconda/bioconda-recipes/issues/37068#issuecomment-1257790919)ため、以下のようにRoseeta2経由でインストールを行ってください
@@ -364,12 +361,14 @@ read_plot.html および read_plot.pdf は、resd_summary.xlsxを可視化した
> とくに大型欠失が存在する場合、欠失アレルが顕著に増幅されることから、実際のアレル割合を反映しない可能性が高まります。
-## 📣フィードバックと行動規範
+## 📣 フィードバックと行動規範
+
+> [!NOTE]
+> よくあるご質問については、[こちら](https://github.com/akikuno/DAJIN2/blob/main/docs/FAQ_JP.md)ををご覧ください。
-質問、バグ報告、その他のフィードバックについて、皆さまからのご意見をお待ちしています。
+他の質問、バグ報告、フィードバックについて、皆さまからのご意見をお待ちしています。
報告には [GitHub Issues](https://github.com/akikuno/DAJIN2/issues/new/choose) をご利用ください(日本語でも大丈夫です)。
-
diff --git a/docs/RELEASE.md b/docs/RELEASE.md
index af12655..52e99c7 100644
--- a/docs/RELEASE.md
+++ b/docs/RELEASE.md
@@ -3,23 +3,64 @@
## 💥 Breaking
## 📝 Documentation
## 🚀 Performance
+## 🌟 New Features
## 🐛 Bug Fixes
## 🔧 Maintenance
## ⛔️ Deprecated
[[Commit Detail](https://github.com/akikuno/DAJIN2/commit/xxxxx)]
-->
-
+
+# Current Release
+
+# v0.5.2 (2024-XX-XX)
+
+## 📝 Documentation
+
++ Add `FAQ.md` and `FAQ_JP.md` to provide answers to questions. [[Commit Detail](https://github.com/akikuno/DAJIN2/commit/1172fddd34c382f92b6778d6f30fd733b458cc04)]
+
+## 🌟 New Features
+
+- Update `mutation_extractor` [[Commit Detail](https://github.com/akikuno/DAJIN2/commit/9444ee701ee52adeb6271552eff70667fb49b854)]
+ - Simplified the logic of the `is_dissimilar_loci` if statement. Additionally, changed the threshold for determining a mutation in Consensus from 75% to 50% (to accommodate the insertion allele in Cas3 Tyr Barcode10).
+ - Updated `detect_anomalies` to use MLPClassifier to detect mutations more flexibly and accurately compared to the previous threshold setting with MiniBatchKMeans.
+
+## 🔧 Maintenance
+
++ Make DAJIN2 compatible with Python 3.11 and 3.12. Issue: #43 [[Commit Detail](https://github.com/akikuno/DAJIN2/commit/8da9118f5c0f584ed1ab12541d5e410d1b9f0da8)]
+ + pysam and mappy builds with Python 3.11 and 3.12 are now available on Bioconda.
+
++ Update GitHub Actions to test with Python 3.11 and 3.12. Issue: #43 [[Commit Detail](https://github.com/akikuno/DAJIN2/commit/54df79e60b484da429c1cbf6f12b0c19196452cc)]
+
++ Resolve the B023 Function definition does not bind loop variable `alignment_lengths` issue. [[Commit Detail](https://github.com/akikuno/DAJIN2/commit/9c85d2f0410494a9b71d9905fad2f9e4efe30ed7)]
+
++ Add `question.yml` in GitHub Issue template. [[Commit Detail](https://github.com/akikuno/DAJIN2/commit/1172fddd34c382f92b6778d6f30fd733b458cc04)]
+
+
+## 🐛 Bug Fixes
+
++ Update `cssplits_handler._get_index_of_large_deletions`: Modified to split large deletions when a match of 10 or more bases is found within the identified large deletion. Issue: #42 [[Commit Detail](https://github.com/akikuno/DAJIN2/commit/xxxxx)]
- -->
-# Current Release
-## v0.5.1 (2024-06-15)
-## 💥 Breaking
+-------------------------------------------------------------
+
+# Past Releases
+
+
+
+
+
+
+ v0.5.1 (2024-06-15)
+
+## 🌟 New Features
+ Enable to accept additional file formats as an input. Issue: #37
+ FASTA [[Commit Detail](https://github.com/akikuno/DAJIN2/commit/ee6d392cd51649c928bd604acafbab4b9d28feb1)]
@@ -42,16 +83,7 @@
+ Add `reallocate_insertion_within_deletion` into `report.mutation_exporter` and reflected it in the mutation info. [[Commit Detail](https://github.com/akikuno/DAJIN2/commit/ed6a96e01bb40c77df9cd3a17a4c29524684b6f1)]
-
-
-
-
-
--------------------------------------------------------------
-
-# Past Releases
-
-
+
v0.5.0 (2024-06-05)
diff --git a/pyproject.toml b/pyproject.toml
index 2654223..5dbebb2 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -4,7 +4,7 @@ build-backend = "poetry.core.masonry.api"
[tool.poetry]
name = "DAJIN2"
-version = "0.5.1"
+version = "0.5.2"
description = "One-step genotyping tools for targeted long-read sequencing"
authors = ["Akihiro Kuno "]
readme = "README.md"
@@ -29,7 +29,7 @@ include = [
]
[tool.poetry.dependencies]
-python = ">=3.8, <3.11"
+python = "^3.8"
numpy = ">=1.24.0"
scipy = ">=1.10.0"
pandas = ">=1.0.0"
diff --git a/src/DAJIN2/core/preprocess/midsv_caller.py b/src/DAJIN2/core/preprocess/midsv_caller.py
index c83b2be..efcd75a 100644
--- a/src/DAJIN2/core/preprocess/midsv_caller.py
+++ b/src/DAJIN2/core/preprocess/midsv_caller.py
@@ -60,7 +60,7 @@ def extract_best_preset(preset_cigar_by_qname: dict[str, dict[str, str]]) -> dic
continue
# Define a custom key function to prioritize map-ont
- def custom_key(key: str) -> tuple[int, bool]:
+ def custom_key(key: str, alignment_lengths=alignment_lengths) -> tuple[int, bool]:
return (alignment_lengths[key], key == "map-ont")
max_key = max(alignment_lengths, key=custom_key)
diff --git a/src/DAJIN2/core/preprocess/mutation_extractor.py b/src/DAJIN2/core/preprocess/mutation_extractor.py
index 9ecadc1..dd5ca24 100644
--- a/src/DAJIN2/core/preprocess/mutation_extractor.py
+++ b/src/DAJIN2/core/preprocess/mutation_extractor.py
@@ -16,7 +16,7 @@
import numpy as np
-from sklearn.cluster import MiniBatchKMeans
+from sklearn.neural_network import MLPClassifier
from DAJIN2.core.preprocess.homopolymer_handler import extract_sequence_errors_in_homopolymer_loci
from DAJIN2.utils import io
@@ -94,18 +94,17 @@ def cosine_distance(x: list[float], y: list[float]) -> float:
def is_dissimilar_loci(values_sample, values_control, index: int, is_consensus: bool = False) -> bool:
# If 'sample' has more than 20% variation compared to 'control' in consensus mode, unconditionally set it to 'dissimilar loci'. This is set to counteract cases where, when evaluating cosine similarity during significant deletions, values exceedingly close to 1 can occur even if not observed in the control (e.g., control = [1,1,1,1,1], sample = [100,100,100,100,100] -> cosine similarity = 1).
if values_sample[index] - values_control[index] > 20:
- if is_consensus:
- if values_sample[index] > 75:
- return True
- else:
+ if not is_consensus or values_sample[index] > 50:
return True
+ else:
+ return False
# Subset 10 bases around index.
x = values_sample[index : index + 10]
y = values_control[index : index + 10]
- x_slice = values_sample[index + 1 : index + 11]
- y_slice = values_control[index + 1 : index + 11]
+ x_slice = x[1:]
+ y_slice = y[1:]
distance = cosine_distance(x, y)
distance_slice = cosine_distance(x_slice, y_slice)
@@ -117,14 +116,37 @@ def detect_anomalies(values_sample, values_control, threshold: float, is_consens
"""
Detect anomalies and return indices of outliers.
"""
- values_subtract = values_sample - values_control
- values_subtract = np.where(values_subtract <= threshold, 0, values_subtract)
- values_subtract_reshaped = values_subtract.reshape(-1, 1)
- kmeans = MiniBatchKMeans(n_clusters=2, random_state=0, n_init="auto").fit(values_subtract_reshaped)
- # Set the maximum threshold to 10 to prevent missing relatively minor mutations due to the k-means centers being overly influenced by obvious mutations.
- threshold_kmeans = min(20, kmeans.cluster_centers_.mean())
- candidate_loci = {i for i, v in enumerate(values_subtract_reshaped) if v > threshold_kmeans}
+ rng = np.random.default_rng(seed=1)
+
+ random_size = 10_000
+ control_size = len(values_control)
+ total_size = random_size + control_size
+
+ randoms = rng.uniform(0, 100, random_size)
+ randoms_error = np.clip(randoms + rng.uniform(0, threshold, random_size), 0, 100)
+ randoms_mutation = np.clip(randoms + rng.uniform(threshold, 100, random_size), 0, 100)
+
+ values_error = np.clip(values_control + rng.uniform(0, threshold, control_size), 0, 100)
+ values_mutation = np.clip(values_control + rng.uniform(threshold, 100, control_size), 0, 100)
+
+ matrix_error_randoms = np.array([randoms, randoms_error]).T
+ matrix_error_control = np.array([values_control, values_error]).T
+ matrix_error = np.concatenate([matrix_error_randoms, matrix_error_control], axis=0)
+
+ matrix_mutation_randoms = np.array([randoms, randoms_mutation]).T
+ matrix_mutation_control = np.array([values_control, values_mutation]).T
+ matrix_mutation = np.concatenate([matrix_mutation_randoms, matrix_mutation_control], axis=0)
+
+ X = np.concatenate([matrix_error, matrix_mutation], axis=0)
+ y = [0] * (total_size) + [1] * (total_size)
+
+ clf = MLPClassifier(solver="lbfgs", alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1)
+ clf.fit(X, y)
+
+ results = clf.predict(np.array([values_control, values_sample]).T)
+
+ candidate_loci = {i for i, v in enumerate(results) if v == 1}
return {i for i in candidate_loci if is_dissimilar_loci(values_sample, values_control, i, is_consensus)}
@@ -173,17 +195,17 @@ def merge_index_of_consecutive_indel(mutation_loci: dict[str, set[int]]) -> dict
"""Treat as contiguous indels if there are insertions/deletions within five bases of each other"""
mutation_loci_merged = {}
- """Reflect point mutations as they are"""
+ # Reflect point mutations as they are
mutation_loci_merged["*"] = mutation_loci["*"]
- """Merge if indels are within 10 bases"""
+ # Merge if indels are within 10 bases
for mut in ["+", "-"]:
idx_indel = sorted(mutation_loci[mut])
idx_indel_merged = set(idx_indel)
for i in range(len(idx_indel) - 1):
idx_1 = idx_indel[i]
idx_2 = idx_indel[i + 1]
- """If everything from idx_1 to idx_2 is already considered as indels, then skip it."""
+ # If everything from idx_1 to idx_2 is already considered as indels, then skip it.
if count_elements_within_range(idx_indel, idx_1 + 1, idx_2 - 1) == idx_2 - idx_1 + 1:
continue
if idx_1 + 10 > idx_2:
@@ -191,25 +213,25 @@ def merge_index_of_consecutive_indel(mutation_loci: dict[str, set[int]]) -> dict
idx_indel_merged.add(i)
mutation_loci_merged[mut] = idx_indel_merged
- """Additional logic for mutation enrichment within 10 bases on both ends"""
+ # Additional logic for mutation enrichment within 10 bases on both ends
for mut in ["+", "-"]:
idx_indel = sorted(mutation_loci_merged[mut])
idx_indel_merged = set(idx_indel)
for i in range(len(idx_indel) - 1):
idx_1 = idx_indel[i]
idx_2 = idx_indel[i + 1]
- """If everything from idx_1 to idx_2 is already considered as indels, then skip it."""
+ # If everything from idx_1 to idx_2 is already considered as indels, then skip it.
if count_elements_within_range(idx_indel, idx_1 + 1, idx_2 - 1) == idx_2 - idx_1 + 1:
continue
- """If the distance between idx_1 and idx_2 is more than 20 bases, then skip it."""
+ # If the distance between idx_1 and idx_2 is more than 20 bases, then skip it.
if idx_1 + 20 < idx_2:
continue
count_left = count_elements_within_range(idx_indel, idx_1 - 11, idx_1 - 1)
count_right = count_elements_within_range(idx_indel, idx_2 + 1, idx_2 + 11)
- """
- If 8 out of the 10 bases at both ends are indels,
- then everything from idx_1 to idx_2 will be considered as indels.
- """
+
+ # If 8 out of the 10 bases at both ends are indels,
+ # then everything from idx_1 to idx_2 will be considered as indels.
+
if count_left >= 8 and count_right >= 8:
for i in range(idx_1 + 1, idx_2):
idx_indel_merged.add(i)
@@ -295,21 +317,22 @@ def extract_mutation_loci(
if thresholds is None:
thresholds = {"*": 0.5, "-": 0.5, "+": 0.5}
indels_normalized_sample = io.load_pickle(path_indels_normalized_sample)
- indels_normalized_control = io.load_pickle(path_indels_normalized_control)
- """Extract candidate mutation loci"""
- indels_normalized_minimize_control = minimize_mutation_counts(indels_normalized_control, indels_normalized_sample)
+ # Extract candidate mutation loci
+ indels_normalized_control = minimize_mutation_counts(
+ io.load_pickle(path_indels_normalized_control), indels_normalized_sample
+ )
anomal_loci: dict[str, set[int]] = extract_anomal_loci(
- indels_normalized_sample, indels_normalized_minimize_control, thresholds, is_consensus
+ indels_normalized_sample, indels_normalized_control, thresholds, is_consensus
)
- """Extract error loci in homopolymer regions"""
+ # Extract error loci in homopolymer regions
errors_in_homopolymer = extract_sequence_errors_in_homopolymer_loci(
sequence, indels_normalized_sample, indels_normalized_control, anomal_loci
)
mutation_loci = discard_errors_in_homopolymer(anomal_loci, errors_in_homopolymer)
- """Merge all mutations and knockin loci"""
+ # Merge all mutations and knockin loci
if path_knockin.exists():
knockin_loci = io.load_pickle(path_knockin)
mutation_loci = add_knockin_loci(mutation_loci, knockin_loci)
@@ -323,7 +346,7 @@ def cache_mutation_loci(ARGS, is_control: bool = False) -> None:
cache_indels_count(ARGS, is_control)
if is_control:
- return
+ return None
for allele, sequence in ARGS.fasta_alleles.items():
path_mutation_sample = Path(ARGS.tempdir, ARGS.sample_name, "mutation_loci", allele)
diff --git a/src/DAJIN2/utils/config.py b/src/DAJIN2/utils/config.py
index f146939..202e413 100644
--- a/src/DAJIN2/utils/config.py
+++ b/src/DAJIN2/utils/config.py
@@ -8,7 +8,7 @@
from sklearn.exceptions import ConvergenceWarning
-DAJIN_VERSION = "0.5.1"
+DAJIN_VERSION = "0.5.2"
DAJIN_RESULTS_DIR = Path("DAJIN_Results")
TEMP_ROOT_DIR = Path(DAJIN_RESULTS_DIR, ".tempdir")
diff --git a/src/DAJIN2/utils/cssplits_handler.py b/src/DAJIN2/utils/cssplits_handler.py
index c7fbdd5..6de7cb4 100644
--- a/src/DAJIN2/utils/cssplits_handler.py
+++ b/src/DAJIN2/utils/cssplits_handler.py
@@ -242,21 +242,47 @@ def _extract_break_points_of_large_deletions(
return break_points
-def _convert_break_points_to_index(break_points: list[dict[str, int]]) -> set[int]:
- index_of_large_deletions = set()
+def _convert_break_points_to_index(break_points: list[dict[str, int]]) -> list[int]:
+ index_of_large_deletions = []
for break_point in break_points:
start = break_point["start"]
end = break_point["end"]
- index_of_large_deletions |= set(range(start, end + 1))
+ index_of_large_deletions += list(range(start, end + 1))
return index_of_large_deletions
+def _find_matched_indexes(cssplits: list[str], index_of_large_deletions: list[int]) -> list[int]:
+ matched_index = []
+ count_matches = 0
+ start_match = -1
+
+ index_of_large_deletions.sort()
+ for i in index_of_large_deletions:
+ if cssplits[i].startswith("="):
+ if start_match == -1:
+ start_match = i
+ count_matches += 1
+ else:
+ if count_matches >= 10:
+ matched_index += list(range(start_match, i))
+ count_matches = 0
+ start_match = -1
+
+ return matched_index
+
+
+def _remove_matched_indexes(index_of_large_deletions: list[int], matched_index: list[int]) -> set[int]:
+ return set(index_of_large_deletions) - set(matched_index)
+
+
def _get_index_of_large_deletions(cssplits: list[str], bin_size: int = 500, percentage: int = 50) -> set[int]:
range_of_large_deletions = _extract_candidate_index_of_large_deletions(cssplits, bin_size, percentage)
break_points = _extract_break_points_of_large_deletions(cssplits, range_of_large_deletions, bin_size)
- return _convert_break_points_to_index(break_points)
+ index_of_large_deletions = _convert_break_points_to_index(break_points)
+ matched_index = _find_matched_indexes(cssplits, index_of_large_deletions)
+ return _remove_matched_indexes(index_of_large_deletions, matched_index)
def _adjust_cs_insertion(cs: str) -> str:
diff --git a/tests/src/utils/test_cssplits_handler.py b/tests/src/utils/test_cssplits_handler.py
index cd05d18..97783e0 100644
--- a/tests/src/utils/test_cssplits_handler.py
+++ b/tests/src/utils/test_cssplits_handler.py
@@ -39,19 +39,6 @@ def test_add_match_operator_to_n(cssplits, expected):
assert cssplits_handler._add_match_operator_to_n(cssplits) == expected
-# @pytest.mark.parametrize(
-# "cssplits, expected",
-# [
-# ([], []),
-# (["=A", "=C", "=G"], ["=A", "=C", "=G"]),
-# (["+A|+A|*GC"], ["+A|+A|*GC"]),
-# (["+A|*GC|=C"], ["+A|=C|=C"]),
-# ],
-# )
-# def test_format_substitution_withtin_insertion(cssplits, expected):
-# assert cssplits_handler._format_substitution_withtin_insertion(cssplits) == expected
-
-
@pytest.mark.parametrize(
"input_cssplits, expected",
[
@@ -113,6 +100,10 @@ def test_call_sequence(cons_percentage, expected_sequence):
"cssplits, expected",
[
(["=T"] * 100 + ["-A"] * 300 + ["=T"] * 100, set(range(100, 400))),
+ (
+ ["=T"] * 100 + ["-A"] * 300 + ["=T"] * 10 + ["-A"] * 300 + ["=T"] * 100,
+ set(range(100, 400)) | set(range(410, 710)),
+ ),
],
)
def test_get_index_of_large_deletions(cssplits, expected):
@@ -138,24 +129,50 @@ def test_adjust_cs_insertion(cs: str, expected: str):
assert cssplits_handler._adjust_cs_insertion(cs) == expected
-# @pytest.mark.parametrize(
-# "input_str, expected_output",
-# [
-# ("-A,-A,-A,=C,=C,=C,-T,-T,-T,=G", "-A,-A,-A,-C,-C,-C,-T,-T,-T,+C|+C|+C|=G"),
-# ("-A,-A,-A,=C,=C,=C,=C,-T,-T,-T", "-A,-A,-A,=C,=C,=C,=C,-T,-T,-T"),
-# ("-A,-A,-A,N,=C,n,-T,-T,-T,=G", "-A,-A,-A,N,-C,n,-T,-T,-T,+N|+C|+n|=G"),
-# ("-A,-A,-A,=C,+T|+T|=C,=C,-T,-T,-T,=G", "-A,-A,-A,-C,-C,-C,-T,-T,-T,+C|+T|+T|+C|+C|=G"),
-# ("-A,-A,-A,=C,+T|+T|*CG,=C,-T,-T,-T,=G", "-A,-A,-A,-C,-C,-C,-T,-T,-T,+C|+T|+T|+G|+C|=G"),
-# ("-G,-G,-C,=A,=C,=C,*CA,=A,-T,-T,*AC", "-G,-G,-C,=A,=C,=C,*CA,=A,-T,-T,*AC"),
-# ],
-# ids=[
-# "insertion within deletion",
-# "4-character match",
-# "N and n",
-# "Insertion",
-# "Insertion followed by substitution",
-# "Should not be adjusted",
-# ],
-# )
-# def test_reallocate_insertion_within_deletion(input_str: str, expected_output: str):
-# assert reallocate_insertion_within_deletion(input_str, del_range=3, distance=3) == expected_output
+@pytest.mark.parametrize(
+ "cssplits, expected",
+ [
+ (
+ ["=T"] * 100 + ["-A"] * 300 + ["*TA"] * 10 + ["-A"] * 300 + ["=T"] * 100,
+ ["=T"] * 100
+ + ["-A"] * 300
+ + ["-T"] * 10
+ + ["-A"] * 300
+ + ["+A|+A|+A|+A|+A|+A|+A|+A|+A|+A|=T"]
+ + ["=T"] * 99,
+ ),
+ (
+ ["=T"] * 100 + ["-A"] * 150 + ["=T"] * 10 + ["-A"] * 150 + ["=T"] * 100,
+ ["=T"] * 100 + ["-A"] * 150 + ["=T"] * 10 + ["-A"] * 150 + ["=T"] * 100,
+ ),
+ (
+ ["=T"] * 100
+ + ["-A"] * 100
+ + ["*TA"] * 10
+ + ["-A"] * 100
+ + ["=T"] * 10
+ + ["-A"] * 100
+ + ["*TA"] * 10
+ + ["-A"] * 100
+ + ["=T"] * 100,
+ ["=T"] * 100
+ + ["-A"] * 100
+ + ["-T"] * 10
+ + ["-A"] * 100
+ + ["+A|+A|+A|+A|+A|+A|+A|+A|+A|+A|=T"]
+ + ["=T"] * 9
+ + ["-A"] * 100
+ + ["-T"] * 10
+ + ["-A"] * 100
+ + ["+A|+A|+A|+A|+A|+A|+A|+A|+A|+A|=T"]
+ + ["=T"] * 99,
+ ),
+ ],
+ ids=[
+ "insertion within deletion",
+ "matched region within deletion",
+ "insertions within deletion and matched region",
+ ],
+)
+def test_reallocate_insertion_within_deletion(cssplits: str, expected: str):
+ assert cssplits_handler.reallocate_insertion_within_deletion(cssplits) == expected