akikuno · akikuno · Jan 9, 2024 · Dec 24, 2023 · Dec 24, 2023 · Dec 24, 2023
diff --git a/docs/ROADMAP.md b/docs/ROADMAP.md
@@ -1,8 +1,83 @@
 # Development Logs of DAJIN2
 
+<!-- TEMPLATE
+# v0.0.0 (yyyy-mm-dd)
+## 📝 Documentation
+## 🚀 Features
+## 🐛 Bug Fixes
+## 🔧 Maintenance
+## ⛔️ Deprecated
++ [ ] XXX [Commit Detail](https://github.com/akikuno/DAJIN2/commit/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx)
+-->
+
+<!-- memo ToDo
+- barcode09 allele1 の`N`
+- barcode11 allele2 の大型欠失が反映されていない
+- barcode28 allele1 の`N`
+- FASTQ、VCFを出力する
+ -->
+# v0.3.6 (yyyy-mm-dd)
+
+## 📝 Documentation
+
++ Added a quick quide of installation to TROUBLESHOOTING.md [Commit Detail](https://github.com/akikuno/DAJIN2/commit/cefed0ff4d04282b9915486be07de85b2b77b657)
+
+## 🚀 Features
+
+### Preprocess
+
++ Update `input_validator.py`: The UCSC Blat server sometimes returns a 200 HTTP status code even when an error occurs. In such cases, "Very Early Error" is indicated in the Title. Therefore, we have made it so that it returns False in those situations. [Commit Detail](https://github.com/akikuno/DAJIN2/commit/4ad9c9ef8bd963a6e20c1721480aed0fe7922760)
+
++ Simplyfy `homopolymer_handler.py` for the error detection using cosine similarity [Commit Detail](https://github.com/akikuno/DAJIN2/commit/21c2596805c36074f360285600e60ee76b948908)
+
++ Update `mutation_extractor.py` to use cosine similarity to filter dissimilar loci [Commit Detail](https://github.com/akikuno/DAJIN2/commit/c9f5aa7b48581e58d99fe8c31275c422756aa9f1)
+
++ Update the `mutation_extractor.identify_dissimilar_loci`  so that it unconditionally returns True if  the 'sample' shows more than 5% variation compared to the 'control'. [Commit Detail](https://github.com/akikuno/DAJIN2/commit/0cbec5217fdfba6886979eb86cf970b587e83e5f)
+
++ Add `preprocess.midsv_caller.convert_consecutive_indels_to_match`: Due to alignment errors, there can be instances where a true match is mistakenly replaced with "insertion following a deletion". For example, although it should be "=C,=T", it gets replaced by "-C,+C|=T". In such cases, a process is performed to revert it back to "=C,=T". [Commit Detail](https://github.com/akikuno/DAJIN2/commit/69c56fa904ef847dc5b0e2dcdb90303409412d0f)
+
+### Classification
+
++ Added `allele_merger.merge_minor_alleles` to reclassify alleles with less than 10 reads to suppress excessive subdivision of alleles. [Commit Detail](https://github.com/akikuno/DAJIN2/commit/b0752960def313e237ccf7d44542f9810cad0c00)
+
+### Clustering
+
++ Added the function `merge_minor_cluster` to revert labels clustered with less than 10 reads back to the previous labels to suppress excessive subdivision of alleles.
+ [Commit Detail](https://github.com/akikuno/DAJIN2/commit/4bd9f7dd806d192475d8d4f20c1e50c37281d64e)
+
++ Update `generate_mutation_kmers` to to consider indices not registered in mutation_loci as mutations by replacing them with "@". For example, if there are no mutations in mutation_loci, "=G,=C,-C" and "~G,=G,=C" become "@,@,@" and "@,@,@" respectively, making them the same and ensuring they do not affect clustering. [Commit Detail](https://github.com/akikuno/DAJIN2/commit/9eefaaa1a9be3922b60655292c0a310e0f5fc76d)
+
+### Consensus
+
++ Use `LocalOutlierFactor` to filter abnormal control reads [Commit Detail](https://github.com/akikuno/DAJIN2/commit/4bd9f7dd806d192494c48da01fc039902c97a23ddea47dd5f2b42ab475d8d4f20c1e50c37281d64e)
+
+
+## 🐛 Bug Fixes
+
+### Consensus
+
++ 大型欠失の内部で欠失が反映されないバグを修正 [Commit Detail](https://github.com/akikuno/DAJIN2/commit/XXX)
+
+## 🔧 Maintenance
+
+
+## ⛔️ Deprecated
+
+---
+
+# 💡 Future Tasks
+
++ Remove minor alleles with predicted insertion
++ Enhance the Clarity of Insertion Allele Identification.
++ Develop and Integrate Inversion Detection Capability
++ ReferenceのアレルをFASTA/HTMLディレクトリに保存する
+
 -------------
 
-# v0.3.5 (2023-12-23)
+# Past Logs
+
+<details>
+<summary> v0.3.5 (2023-12-23) </summary>
 
 ## 📝 Documentation
 
@@ -26,8 +101,12 @@
 + [x] Added `similarity_searcher.py` to extract control reads resembling the consensus sequence, thereby enhancing the accuracy of detecting sample-specific mutations. [Commit Detail](https://github.com/akikuno/DAJIN2/commit/98a8a45e13835502f7dea2622274da81bbbc3ba3)
 
 + [x] Changed the method in `clust_formatter.get_thresholds`` to dynamically define the thresholds for ignoring mutations, instead of using fixed values.[Commit Detail](https://github.com/akikuno/DAJIN2/commit/2249d1601ad619a7db0fcc9ebf79d63f8dcf164b)
+
 + [x] Removed code that was previously commented out [Commit Detail](https://github.com/akikuno/DAJIN2/commit/2249d1601ad619a7db0fcc9ebf79d63f8dcf164b)
 
++ [x] Add `is_consensus` argument: When it comes to consensus, if the difference between sample and control is more than 20%, it is unconditionally considered a mutation. [Commit Detail](https://github.com/akikuno/DAJIN2/commit/7bca4590f97e1858304c3e9fb66c54a279dfcdf0)
+
+
 ## 🐛 Bug Fixes
 
 + None
@@ -46,18 +125,4 @@
 
 + None
 
--------------
-
-# 💡 Future Tasks
-
-+ Remove minor alleles with predicted insertion
-+ Enhance the Clarity of Insertion Allele Identification.
-+ Develop and Integrate Inversion Detection Capability
-
-<!-- TEMPLATE
-
-+ [ ] Added tests to `preprocess.mutation_extractor`
-
-+ [ ]  [Commit Detail](https://github.com/akikuno/DAJIN2/commit/)
-
--->
+</details>
diff --git a/docs/TROUBLESHOOTING.md b/docs/TROUBLESHOOTING.md
@@ -2,13 +2,31 @@
 
 ## Installation
 
+### Long Story Short
+
+```bash
+# Update conda
+conda update -n base conda -y
+
+# Setup of Bioconda
+conda config --add channels defaults
+conda config --add channels bioconda
+conda config --add channels conda-forge
+conda config --set channel_priority strict
+
+# Install DAJIN2 to a virtual environment
+conda create -n env-dajin2 python=3.10 -y
+conda activate env-dajin2
+conda install -c bioconda DAJIN2 -y
+```
+
 ### Prerequisites
 
 Before installing `DAJIN2`, please ensure that your system meets the following requirements:
 
 - Python >=3.7
 - Unix-like environment (Linux, macOS, WSL, etc.)
-- [conda](https://docs.conda.io/en/latest/) or [mamba](https://mamba.readthedocs.io/en/latest/) is highly recommended for managing dependencies
+- [Conda](https://docs.conda.io/en/latest/) is highly recommended for managing dependencies
 - If using pip, access to administrative privileges or the ability to install packages globally
 
 ### Dependencies
@@ -24,15 +42,25 @@ Before installing `DAJIN2`, please ensure that your system meets the following r
 
 We strongly recommend using Conda or Mamba for installation, as they efficiently manage the complex dependencies of `pysam` and `htslib`:
 
-1. Install Conda or Mamba if you haven't already. You can download Conda from [here](https://docs.conda.io/en/latest/miniconda.html).
+1. Install the latest Conda if you have not already. You can download Conda from [here](https://docs.conda.io/en/latest/miniconda.html).
+
+2. Setup the [Bioconda](https://bioconda.github.io/) channel:
+
+```bash
+conda config --add channels defaults
+conda config --add channels bioconda
+conda config --add channels conda-forge
+conda config --set channel_priority strict
+```
+
+3. Create a new virtual environment (optional but recommended):
 
-2. Create a new environment (optional but recommended):
 ```bash
-conda create -n DAJIN2-env
-conda activate DAJIN2-env
+conda create -n env-dajin2
+conda activate env-dajin2
 ```
 
-3. Install `DAJIN2`:
+4. Install `DAJIN2` in the virtual environment:
 ```bash
 conda install -c bioconda DAJIN2
 ```

diff --git a/setup.py b/setup.py
@@ -9,7 +9,7 @@
 
 setuptools.setup(
     name="DAJIN2",
-    version="0.3.5",
+    version="0.3.6",
     author="Akihiro Kuno",
     author_email="[email protected]",
     description="One-step genotyping tools for targeted long-read sequencing",

diff --git a/src/DAJIN2/core/classification/allele_merger.py b/src/DAJIN2/core/classification/allele_merger.py
@@ -0,0 +1,97 @@
+from __future__ import annotations
+
+from itertools import groupby
+from collections import defaultdict
+
+
+##########################################################
+# merge minor alleles
+##########################################################
+
+
+def count_alleles(score_of_each_alleles: list[dict]) -> dict[str, int]:
+    score_of_each_alleles.sort(key=lambda x: x["QNAME"])
+    allele_counts = defaultdict(int)
+    for _, group in groupby(score_of_each_alleles, key=lambda x: x["QNAME"]):
+        allele_predicted = max(group, key=lambda x: x["SCORE"])["ALLELE"]
+        allele_counts[allele_predicted] += 1
+
+    return allele_counts
+
+
+def extract_minor_alleles(count_allele: dict[str, int], threshold_readnumber: int = 10) -> dict[str, int]:
+    return {allele: value for allele, value in count_allele.items() if value < threshold_readnumber}
+
+
+def extract_major_alleles(count_allele: dict[str, int], threshold_readnumber: int = 10) -> set[str]:
+    return {allele for allele, value in count_allele.items() if value >= threshold_readnumber}
+
+
+def sum_dictionaries(dict1, dict2):
+    """Function to sum the values of two dictionaries based on their keys."""
+    return {key: dict1.get(key, 0) + dict2.get(key, 0) for key in set(dict1) | set(dict2)}
+
+
+def split_major_minor_alleles(
+    score_of_each_alleles: list[dict], minor_alleles: dict[str, int]
+) -> tuple[list[dict], list[dict]]:
+    score_on_minor_alleles = []
+    score_on_major_alleles = []
+    score_of_each_alleles.sort(key=lambda x: [x["QNAME"]])
+    for _, group in groupby(score_of_each_alleles, key=lambda x: x["QNAME"]):
+        group = list(group)
+        allele_predicted = max(group, key=lambda x: x["SCORE"])["ALLELE"]
+        if allele_predicted in minor_alleles:
+            score_on_minor_alleles.extend(group)
+        else:
+            score_on_major_alleles.extend(group)
+
+    return score_on_major_alleles, score_on_minor_alleles
+
+
+def extract_most_major_allele(major_alleles: dict[str, int]) -> str:
+    return max(major_alleles, key=lambda x: major_alleles[x])
+
+
+def replace_negative_inf_with_most_major_allele(
+    score_of_minor_alleles: list[dict], all_allele_counts: dict[str, int]
+) -> list[dict]:
+    most_major_allele = extract_most_major_allele(all_allele_counts)
+    score_of_minor_alleles = []
+    for s in score_of_minor_alleles:
+        if s["SCORE"] == float("-inf"):
+            s["ALLELE"] = most_major_allele
+        score_of_minor_alleles.append(s)
+    return score_of_minor_alleles
+
+
+def merge_minor_alleles(score_of_each_alleles, threshold_readnumber: int = 10) -> list[dict]:
+    score_of_each_alleles.sort(key=lambda x: [x["QNAME"]])
+
+    all_allele_counts = count_alleles(score_of_each_alleles)
+    minor_allele_counts = extract_minor_alleles(all_allele_counts, threshold_readnumber)
+
+    score_of_alleles_merged, score_of_minor_alleles = split_major_minor_alleles(
+        score_of_each_alleles, minor_allele_counts
+    )
+
+    score_of_minor_alleles.sort(key=lambda x: x["QNAME"])
+    new_major_allele = set()
+    for minor_allele in sorted(minor_allele_counts, key=minor_allele_counts.get):
+        if minor_allele in new_major_allele:
+            continue
+        for _, group in groupby(score_of_minor_alleles, key=lambda x: x["QNAME"]):
+            group = list(group)
+            allele_predicted = max(group, key=lambda x: x["SCORE"])["ALLELE"]
+            if allele_predicted != minor_allele:
+                continue
+            for g in group:
+                if g["ALLELE"] in minor_allele:
+                    g["SCORE"] = float("-inf")
+
+        allele_counts = count_alleles(score_of_minor_alleles)
+        new_major_allele |= extract_major_alleles(allele_counts, threshold_readnumber)
+
+    score_of_minor_alleles = replace_negative_inf_with_most_major_allele(score_of_minor_alleles, all_allele_counts)
+
+    return score_of_alleles_merged + score_of_minor_alleles
diff --git a/src/DAJIN2/core/classification/classifier.py b/src/DAJIN2/core/classification/classifier.py
@@ -1,38 +1,39 @@
 from __future__ import annotations
 
-import midsv
 from pathlib import Path
 from itertools import groupby
 
+from DAJIN2.utils import io
+from DAJIN2.core.classification.allele_merger import merge_minor_alleles
 
-def _calc_match(CSSPLIT: str) -> float:
-    match_score = CSSPLIT.count("=")
-    match_score -= CSSPLIT.count("+")  # insertion
-    match_score -= sum(cs.islower() for cs in CSSPLIT)  # inversion
-    cssplit = CSSPLIT.split(",")
 
-    return match_score / len(cssplit)
+def calc_match(cssplit: str) -> float:
+    match_score = cssplit.count("=")
+    match_score -= cssplit.count("+")  # insertion
+    match_score -= sum(cs.islower() for cs in cssplit)  # inversion
 
+    return match_score / len(cssplit.split(","))
 
-def _score_allele(TEMPDIR: Path, allele: str, SAMPLE_NAME: str) -> list[dict]:
-    midsv_sample = midsv.read_jsonl(Path(TEMPDIR, SAMPLE_NAME, "midsv", f"{allele}.json"))
-    scored_alleles = []
 
+def score_allele(path_midsv: Path, allele: str) -> list[dict]:
+    midsv_sample = io.read_jsonl(path_midsv)
+    scored_alleles = []
     for dict_midsv in midsv_sample:
-        score = _calc_match(dict_midsv["CSSPLIT"])
+        score = calc_match(dict_midsv["CSSPLIT"])
         dict_midsv.update({"SCORE": score, "ALLELE": allele})
         scored_alleles.append(dict_midsv)
 
     return scored_alleles
 
 
-def _extract_alleles_with_max_score(score_of_each_alleles: list[dict]) -> list[dict]:
+def extract_alleles_with_max_score(score_of_each_alleles: list[dict]) -> list[dict]:
     alleles_with_max_score = []
     score_of_each_alleles.sort(key=lambda x: x["QNAME"])
     for _, group in groupby(score_of_each_alleles, key=lambda x: x["QNAME"]):
         max_read = max(group, key=lambda x: x["SCORE"])
         del max_read["SCORE"]
         alleles_with_max_score.append(max_read)
+
     return alleles_with_max_score
 
 
@@ -44,6 +45,9 @@ def _extract_alleles_with_max_score(score_of_each_alleles: list[dict]) -> list[d
 def classify_alleles(TEMPDIR: Path, FASTA_ALLELES: dict, SAMPLE_NAME: str) -> list[dict]:
     score_of_each_alleles = []
     for allele in FASTA_ALLELES:
-        score_of_each_alleles.extend(_score_allele(TEMPDIR, allele, SAMPLE_NAME))
+        path_midsv = Path(TEMPDIR, SAMPLE_NAME, "midsv", f"{allele}.json")
+        score_of_each_alleles.extend(score_allele(path_midsv, allele))
+
+    score_of_each_alleles_merged = merge_minor_alleles(score_of_each_alleles)
 
-    return _extract_alleles_with_max_score(score_of_each_alleles)
+    return extract_alleles_with_max_score(score_of_each_alleles_merged)
diff --git a/src/DAJIN2/core/clustering/clustering.py b/src/DAJIN2/core/clustering/clustering.py
@@ -22,20 +22,30 @@
 ###############################################################################
 
 
-def optimize_labels(X: spmatrix, coverage_sample, coverage_control) -> list[int]:
+def count_number_of_clusters(labels_control: list[int], coverage_control: int) -> int:
+    """Reads < 1% in the control are considered clustering errors and are not counted"""
+    return sum(1 for reads in Counter(labels_control).values() if reads / coverage_control > 0.01)
+
+
+def optimize_labels(X: spmatrix, coverage_sample: int, coverage_control: int) -> list[int]:
     labels_previous = list(range(coverage_sample))
     for i in range(1, coverage_sample):
         np.random.seed(seed=1)
         labels_all = BisectingKMeans(n_clusters=i, random_state=1).fit_predict(X).tolist()
+
         labels_sample = labels_all[:coverage_sample]
         labels_control = labels_all[coverage_sample:]
-        labels_current = merge_labels(labels_control, labels_sample)
+        labels_current = merge_labels(labels_control, labels_sample, labels_previous)
         # print(i, Counter(labels_sample), Counter(labels_control), Counter(labels_current))  # ! DEBUG
-        # Reads < 1% in the control are considered clustering errors and are not counted
-        count_control = Counter(labels_control)
-        num_labels_control = sum(1 for reads in count_control.values() if reads / coverage_control > 0.01)
+
+        num_labels_control = count_number_of_clusters(labels_control, coverage_control)
         mutual_info = metrics.adjusted_rand_score(labels_previous, labels_current)
-        # Report the number of clusters in SAMPLE when the number of clusters in CONTROL is split into more than one.
+
+        """
+        Return the number of clusters when:
+        - the number of clusters in control is split into more than one.
+        - the mutual information between the current and previous labels is high enough (= similar).
+        """
         if num_labels_control >= 2:
             return labels_previous
         if 0.95 <= mutual_info <= 1.0: