Skip to content

Commit

Permalink
Merge pull request #19 from akikuno/develop-0.3.6
Browse files Browse the repository at this point in the history
Develop 0.3.6
  • Loading branch information
akikuno authored Jan 9, 2024
2 parents f98197e + 40588a1 commit f3cd582
Show file tree
Hide file tree
Showing 30 changed files with 726 additions and 571 deletions.
97 changes: 81 additions & 16 deletions docs/ROADMAP.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,83 @@
# Development Logs of DAJIN2

<!-- TEMPLATE
# v0.0.0 (yyyy-mm-dd)
## 📝 Documentation
## 🚀 Features
## 🐛 Bug Fixes
## 🔧 Maintenance
## ⛔️ Deprecated
+ [ ] XXX [Commit Detail](https://github.com/akikuno/DAJIN2/commit/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx)
-->

<!-- memo ToDo
- barcode09 allele1 の`N`
- barcode11 allele2 の大型欠失が反映されていない
- barcode28 allele1 の`N`
- FASTQ、VCFを出力する
-->
# v0.3.6 (yyyy-mm-dd)

## 📝 Documentation

+ Added a quick quide of installation to TROUBLESHOOTING.md [Commit Detail](https://github.com/akikuno/DAJIN2/commit/cefed0ff4d04282b9915486be07de85b2b77b657)

## 🚀 Features

### Preprocess

+ Update `input_validator.py`: The UCSC Blat server sometimes returns a 200 HTTP status code even when an error occurs. In such cases, "Very Early Error" is indicated in the Title. Therefore, we have made it so that it returns False in those situations. [Commit Detail](https://github.com/akikuno/DAJIN2/commit/4ad9c9ef8bd963a6e20c1721480aed0fe7922760)

+ Simplyfy `homopolymer_handler.py` for the error detection using cosine similarity [Commit Detail](https://github.com/akikuno/DAJIN2/commit/21c2596805c36074f360285600e60ee76b948908)

+ Update `mutation_extractor.py` to use cosine similarity to filter dissimilar loci [Commit Detail](https://github.com/akikuno/DAJIN2/commit/c9f5aa7b48581e58d99fe8c31275c422756aa9f1)

+ Update the `mutation_extractor.identify_dissimilar_loci` so that it unconditionally returns True if the 'sample' shows more than 5% variation compared to the 'control'. [Commit Detail](https://github.com/akikuno/DAJIN2/commit/0cbec5217fdfba6886979eb86cf970b587e83e5f)

+ Add `preprocess.midsv_caller.convert_consecutive_indels_to_match`: Due to alignment errors, there can be instances where a true match is mistakenly replaced with "insertion following a deletion". For example, although it should be "=C,=T", it gets replaced by "-C,+C|=T". In such cases, a process is performed to revert it back to "=C,=T". [Commit Detail](https://github.com/akikuno/DAJIN2/commit/69c56fa904ef847dc5b0e2dcdb90303409412d0f)

### Classification

+ Added `allele_merger.merge_minor_alleles` to reclassify alleles with less than 10 reads to suppress excessive subdivision of alleles. [Commit Detail](https://github.com/akikuno/DAJIN2/commit/b0752960def313e237ccf7d44542f9810cad0c00)

### Clustering

+ Added the function `merge_minor_cluster` to revert labels clustered with less than 10 reads back to the previous labels to suppress excessive subdivision of alleles.
[Commit Detail](https://github.com/akikuno/DAJIN2/commit/4bd9f7dd806d192475d8d4f20c1e50c37281d64e)

+ Update `generate_mutation_kmers` to to consider indices not registered in mutation_loci as mutations by replacing them with "@". For example, if there are no mutations in mutation_loci, "=G,=C,-C" and "~G,=G,=C" become "@,@,@" and "@,@,@" respectively, making them the same and ensuring they do not affect clustering. [Commit Detail](https://github.com/akikuno/DAJIN2/commit/9eefaaa1a9be3922b60655292c0a310e0f5fc76d)

### Consensus

+ Use `LocalOutlierFactor` to filter abnormal control reads [Commit Detail](https://github.com/akikuno/DAJIN2/commit/4bd9f7dd806d192494c48da01fc039902c97a23ddea47dd5f2b42ab475d8d4f20c1e50c37281d64e)


## 🐛 Bug Fixes

### Consensus

+ 大型欠失の内部で欠失が反映されないバグを修正 [Commit Detail](https://github.com/akikuno/DAJIN2/commit/XXX)

## 🔧 Maintenance


## ⛔️ Deprecated

---

# 💡 Future Tasks

+ Remove minor alleles with predicted insertion
+ Enhance the Clarity of Insertion Allele Identification.
+ Develop and Integrate Inversion Detection Capability
+ ReferenceのアレルをFASTA/HTMLディレクトリに保存する

-------------

# v0.3.5 (2023-12-23)
# Past Logs

<details>
<summary> v0.3.5 (2023-12-23) </summary>

## 📝 Documentation

Expand All @@ -26,8 +101,12 @@
+ [x] Added `similarity_searcher.py` to extract control reads resembling the consensus sequence, thereby enhancing the accuracy of detecting sample-specific mutations. [Commit Detail](https://github.com/akikuno/DAJIN2/commit/98a8a45e13835502f7dea2622274da81bbbc3ba3)

+ [x] Changed the method in `clust_formatter.get_thresholds`` to dynamically define the thresholds for ignoring mutations, instead of using fixed values.[Commit Detail](https://github.com/akikuno/DAJIN2/commit/2249d1601ad619a7db0fcc9ebf79d63f8dcf164b)

+ [x] Removed code that was previously commented out [Commit Detail](https://github.com/akikuno/DAJIN2/commit/2249d1601ad619a7db0fcc9ebf79d63f8dcf164b)

+ [x] Add `is_consensus` argument: When it comes to consensus, if the difference between sample and control is more than 20%, it is unconditionally considered a mutation. [Commit Detail](https://github.com/akikuno/DAJIN2/commit/7bca4590f97e1858304c3e9fb66c54a279dfcdf0)


## 🐛 Bug Fixes

+ None
Expand All @@ -46,18 +125,4 @@

+ None

-------------

# 💡 Future Tasks

+ Remove minor alleles with predicted insertion
+ Enhance the Clarity of Insertion Allele Identification.
+ Develop and Integrate Inversion Detection Capability

<!-- TEMPLATE
+ [ ] Added tests to `preprocess.mutation_extractor`
+ [ ] [Commit Detail](https://github.com/akikuno/DAJIN2/commit/)
-->
</details>
40 changes: 34 additions & 6 deletions docs/TROUBLESHOOTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,31 @@

## Installation

### Long Story Short

```bash
# Update conda
conda update -n base conda -y

# Setup of Bioconda
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --set channel_priority strict

# Install DAJIN2 to a virtual environment
conda create -n env-dajin2 python=3.10 -y
conda activate env-dajin2
conda install -c bioconda DAJIN2 -y
```

### Prerequisites

Before installing `DAJIN2`, please ensure that your system meets the following requirements:

- Python >=3.7
- Unix-like environment (Linux, macOS, WSL, etc.)
- [conda](https://docs.conda.io/en/latest/) or [mamba](https://mamba.readthedocs.io/en/latest/) is highly recommended for managing dependencies
- [Conda](https://docs.conda.io/en/latest/) is highly recommended for managing dependencies
- If using pip, access to administrative privileges or the ability to install packages globally

### Dependencies
Expand All @@ -24,15 +42,25 @@ Before installing `DAJIN2`, please ensure that your system meets the following r

We strongly recommend using Conda or Mamba for installation, as they efficiently manage the complex dependencies of `pysam` and `htslib`:

1. Install Conda or Mamba if you haven't already. You can download Conda from [here](https://docs.conda.io/en/latest/miniconda.html).
1. Install the latest Conda if you have not already. You can download Conda from [here](https://docs.conda.io/en/latest/miniconda.html).

2. Setup the [Bioconda](https://bioconda.github.io/) channel:

```bash
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --set channel_priority strict
```

3. Create a new virtual environment (optional but recommended):

2. Create a new environment (optional but recommended):
```bash
conda create -n DAJIN2-env
conda activate DAJIN2-env
conda create -n env-dajin2
conda activate env-dajin2
```

3. Install `DAJIN2`:
4. Install `DAJIN2` in the virtual environment:
```bash
conda install -c bioconda DAJIN2
```
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@

setuptools.setup(
name="DAJIN2",
version="0.3.5",
version="0.3.6",
author="Akihiro Kuno",
author_email="[email protected]",
description="One-step genotyping tools for targeted long-read sequencing",
Expand Down
97 changes: 97 additions & 0 deletions src/DAJIN2/core/classification/allele_merger.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
from __future__ import annotations

from itertools import groupby
from collections import defaultdict


##########################################################
# merge minor alleles
##########################################################


def count_alleles(score_of_each_alleles: list[dict]) -> dict[str, int]:
score_of_each_alleles.sort(key=lambda x: x["QNAME"])
allele_counts = defaultdict(int)
for _, group in groupby(score_of_each_alleles, key=lambda x: x["QNAME"]):
allele_predicted = max(group, key=lambda x: x["SCORE"])["ALLELE"]
allele_counts[allele_predicted] += 1

return allele_counts


def extract_minor_alleles(count_allele: dict[str, int], threshold_readnumber: int = 10) -> dict[str, int]:
return {allele: value for allele, value in count_allele.items() if value < threshold_readnumber}


def extract_major_alleles(count_allele: dict[str, int], threshold_readnumber: int = 10) -> set[str]:
return {allele for allele, value in count_allele.items() if value >= threshold_readnumber}


def sum_dictionaries(dict1, dict2):
"""Function to sum the values of two dictionaries based on their keys."""
return {key: dict1.get(key, 0) + dict2.get(key, 0) for key in set(dict1) | set(dict2)}


def split_major_minor_alleles(
score_of_each_alleles: list[dict], minor_alleles: dict[str, int]
) -> tuple[list[dict], list[dict]]:
score_on_minor_alleles = []
score_on_major_alleles = []
score_of_each_alleles.sort(key=lambda x: [x["QNAME"]])
for _, group in groupby(score_of_each_alleles, key=lambda x: x["QNAME"]):
group = list(group)
allele_predicted = max(group, key=lambda x: x["SCORE"])["ALLELE"]
if allele_predicted in minor_alleles:
score_on_minor_alleles.extend(group)
else:
score_on_major_alleles.extend(group)

return score_on_major_alleles, score_on_minor_alleles


def extract_most_major_allele(major_alleles: dict[str, int]) -> str:
return max(major_alleles, key=lambda x: major_alleles[x])


def replace_negative_inf_with_most_major_allele(
score_of_minor_alleles: list[dict], all_allele_counts: dict[str, int]
) -> list[dict]:
most_major_allele = extract_most_major_allele(all_allele_counts)
score_of_minor_alleles = []
for s in score_of_minor_alleles:
if s["SCORE"] == float("-inf"):
s["ALLELE"] = most_major_allele
score_of_minor_alleles.append(s)
return score_of_minor_alleles


def merge_minor_alleles(score_of_each_alleles, threshold_readnumber: int = 10) -> list[dict]:
score_of_each_alleles.sort(key=lambda x: [x["QNAME"]])

all_allele_counts = count_alleles(score_of_each_alleles)
minor_allele_counts = extract_minor_alleles(all_allele_counts, threshold_readnumber)

score_of_alleles_merged, score_of_minor_alleles = split_major_minor_alleles(
score_of_each_alleles, minor_allele_counts
)

score_of_minor_alleles.sort(key=lambda x: x["QNAME"])
new_major_allele = set()
for minor_allele in sorted(minor_allele_counts, key=minor_allele_counts.get):
if minor_allele in new_major_allele:
continue
for _, group in groupby(score_of_minor_alleles, key=lambda x: x["QNAME"]):
group = list(group)
allele_predicted = max(group, key=lambda x: x["SCORE"])["ALLELE"]
if allele_predicted != minor_allele:
continue
for g in group:
if g["ALLELE"] in minor_allele:
g["SCORE"] = float("-inf")

allele_counts = count_alleles(score_of_minor_alleles)
new_major_allele |= extract_major_alleles(allele_counts, threshold_readnumber)

score_of_minor_alleles = replace_negative_inf_with_most_major_allele(score_of_minor_alleles, all_allele_counts)

return score_of_alleles_merged + score_of_minor_alleles
32 changes: 18 additions & 14 deletions src/DAJIN2/core/classification/classifier.py
Original file line number Diff line number Diff line change
@@ -1,38 +1,39 @@
from __future__ import annotations

import midsv
from pathlib import Path
from itertools import groupby

from DAJIN2.utils import io
from DAJIN2.core.classification.allele_merger import merge_minor_alleles

def _calc_match(CSSPLIT: str) -> float:
match_score = CSSPLIT.count("=")
match_score -= CSSPLIT.count("+") # insertion
match_score -= sum(cs.islower() for cs in CSSPLIT) # inversion
cssplit = CSSPLIT.split(",")

return match_score / len(cssplit)
def calc_match(cssplit: str) -> float:
match_score = cssplit.count("=")
match_score -= cssplit.count("+") # insertion
match_score -= sum(cs.islower() for cs in cssplit) # inversion

return match_score / len(cssplit.split(","))

def _score_allele(TEMPDIR: Path, allele: str, SAMPLE_NAME: str) -> list[dict]:
midsv_sample = midsv.read_jsonl(Path(TEMPDIR, SAMPLE_NAME, "midsv", f"{allele}.json"))
scored_alleles = []

def score_allele(path_midsv: Path, allele: str) -> list[dict]:
midsv_sample = io.read_jsonl(path_midsv)
scored_alleles = []
for dict_midsv in midsv_sample:
score = _calc_match(dict_midsv["CSSPLIT"])
score = calc_match(dict_midsv["CSSPLIT"])
dict_midsv.update({"SCORE": score, "ALLELE": allele})
scored_alleles.append(dict_midsv)

return scored_alleles


def _extract_alleles_with_max_score(score_of_each_alleles: list[dict]) -> list[dict]:
def extract_alleles_with_max_score(score_of_each_alleles: list[dict]) -> list[dict]:
alleles_with_max_score = []
score_of_each_alleles.sort(key=lambda x: x["QNAME"])
for _, group in groupby(score_of_each_alleles, key=lambda x: x["QNAME"]):
max_read = max(group, key=lambda x: x["SCORE"])
del max_read["SCORE"]
alleles_with_max_score.append(max_read)

return alleles_with_max_score


Expand All @@ -44,6 +45,9 @@ def _extract_alleles_with_max_score(score_of_each_alleles: list[dict]) -> list[d
def classify_alleles(TEMPDIR: Path, FASTA_ALLELES: dict, SAMPLE_NAME: str) -> list[dict]:
score_of_each_alleles = []
for allele in FASTA_ALLELES:
score_of_each_alleles.extend(_score_allele(TEMPDIR, allele, SAMPLE_NAME))
path_midsv = Path(TEMPDIR, SAMPLE_NAME, "midsv", f"{allele}.json")
score_of_each_alleles.extend(score_allele(path_midsv, allele))

score_of_each_alleles_merged = merge_minor_alleles(score_of_each_alleles)

return _extract_alleles_with_max_score(score_of_each_alleles)
return extract_alleles_with_max_score(score_of_each_alleles_merged)
22 changes: 16 additions & 6 deletions src/DAJIN2/core/clustering/clustering.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,20 +22,30 @@
###############################################################################


def optimize_labels(X: spmatrix, coverage_sample, coverage_control) -> list[int]:
def count_number_of_clusters(labels_control: list[int], coverage_control: int) -> int:
"""Reads < 1% in the control are considered clustering errors and are not counted"""
return sum(1 for reads in Counter(labels_control).values() if reads / coverage_control > 0.01)


def optimize_labels(X: spmatrix, coverage_sample: int, coverage_control: int) -> list[int]:
labels_previous = list(range(coverage_sample))
for i in range(1, coverage_sample):
np.random.seed(seed=1)
labels_all = BisectingKMeans(n_clusters=i, random_state=1).fit_predict(X).tolist()

labels_sample = labels_all[:coverage_sample]
labels_control = labels_all[coverage_sample:]
labels_current = merge_labels(labels_control, labels_sample)
labels_current = merge_labels(labels_control, labels_sample, labels_previous)
# print(i, Counter(labels_sample), Counter(labels_control), Counter(labels_current)) # ! DEBUG
# Reads < 1% in the control are considered clustering errors and are not counted
count_control = Counter(labels_control)
num_labels_control = sum(1 for reads in count_control.values() if reads / coverage_control > 0.01)

num_labels_control = count_number_of_clusters(labels_control, coverage_control)
mutual_info = metrics.adjusted_rand_score(labels_previous, labels_current)
# Report the number of clusters in SAMPLE when the number of clusters in CONTROL is split into more than one.

"""
Return the number of clusters when:
- the number of clusters in control is split into more than one.
- the mutual information between the current and previous labels is high enough (= similar).
"""
if num_labels_control >= 2:
return labels_previous
if 0.95 <= mutual_info <= 1.0:
Expand Down
Loading

0 comments on commit f3cd582

Please sign in to comment.