Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop 0.3.6 #19

Merged
merged 27 commits into from
Jan 9, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
728d59a
bump v0.3.6
akikuno Dec 24, 2023
a9399f5
Added `merge_minor_alleles` to reclassify alleles with less than 10 r…
akikuno Dec 24, 2023
4bd9f7d
Added the function `merge_minor_cluster` to revert labels clustered w…
akikuno Dec 24, 2023
cefed0f
Added a short form of installation guide to TROUBLESHOOTING.md
akikuno Dec 25, 2023
a6bdf50
specify python=3.10
akikuno Dec 25, 2023
93c2111
Add line breaks to make the separation of the code more understandable
akikuno Dec 26, 2023
c91baab
update ROADMAP
akikuno Dec 26, 2023
3f66f0b
Remove the `is_insertion` arguments at `preprocess.cache_mutation_loc…
akikuno Dec 28, 2023
6b83646
move `consensus.cache_mutation_loci` to `consensus.mutation_extractor`
akikuno Dec 29, 2023
21c2596
Simplyfy the error detection using cosine similarity
akikuno Dec 29, 2023
94c48da
Use `LocalOutlierFactor` to filter abnormal control reads
akikuno Dec 29, 2023
f30fabf
Add consensus directory
akikuno Dec 29, 2023
c9f5aa7
Use cosine similarity to filter dissimilar loci
akikuno Dec 29, 2023
8b552c2
Update commit links
akikuno Dec 29, 2023
4ad9c9e
The UCSC Blat server sometimes returns a 200 HTTP status code even wh…
akikuno Dec 29, 2023
69c56fa
Added `preprocess.midsv_caller.convert_consecutive_indels_to_match`:
akikuno Jan 2, 2024
217355d
remove underscore of the functions
akikuno Jan 2, 2024
9eefaaa
Update `generate_mutation_kmers` to to consider indices not registere…
akikuno Jan 2, 2024
e50bd57
Change `coverage_match` from `coverage_not_N` to consider matched seq…
akikuno Jan 2, 2024
0cbec52
Update the `identify_dissimilar_loci` function so that it uncondition…
akikuno Jan 2, 2024
5b16f1a
Update `generate_mutation_kmers` to to consider indices not registere…
akikuno Jan 2, 2024
71d4303
rename function of `cssplit_to_base` from `cstag_to_base`
akikuno Jan 5, 2024
a57862b
fix the function name of `fetch_seq_coordinates`
akikuno Jan 5, 2024
b075296
Add `allele_merger.py` to merge minor alleles
akikuno Jan 5, 2024
a704118
update ROADMAP.md
akikuno Jan 5, 2024
7bca459
Add `is_consensus` argument: When it comes to consensus, if the diffe…
akikuno Jan 7, 2024
40588a1
update commit log
akikuno Jan 7, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
97 changes: 81 additions & 16 deletions docs/ROADMAP.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,83 @@
# Development Logs of DAJIN2

<!-- TEMPLATE
# v0.0.0 (yyyy-mm-dd)
## 📝 Documentation
## 🚀 Features
## 🐛 Bug Fixes
## 🔧 Maintenance
## ⛔️ Deprecated
+ [ ] XXX [Commit Detail](https://github.com/akikuno/DAJIN2/commit/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx)
-->

<!-- memo ToDo
- barcode09 allele1 の`N`
- barcode11 allele2 の大型欠失が反映されていない
- barcode28 allele1 の`N`
- FASTQ、VCFを出力する
-->
# v0.3.6 (yyyy-mm-dd)

## 📝 Documentation

+ Added a quick quide of installation to TROUBLESHOOTING.md [Commit Detail](https://github.com/akikuno/DAJIN2/commit/cefed0ff4d04282b9915486be07de85b2b77b657)

## 🚀 Features

### Preprocess

+ Update `input_validator.py`: The UCSC Blat server sometimes returns a 200 HTTP status code even when an error occurs. In such cases, "Very Early Error" is indicated in the Title. Therefore, we have made it so that it returns False in those situations. [Commit Detail](https://github.com/akikuno/DAJIN2/commit/4ad9c9ef8bd963a6e20c1721480aed0fe7922760)

+ Simplyfy `homopolymer_handler.py` for the error detection using cosine similarity [Commit Detail](https://github.com/akikuno/DAJIN2/commit/21c2596805c36074f360285600e60ee76b948908)

+ Update `mutation_extractor.py` to use cosine similarity to filter dissimilar loci [Commit Detail](https://github.com/akikuno/DAJIN2/commit/c9f5aa7b48581e58d99fe8c31275c422756aa9f1)

+ Update the `mutation_extractor.identify_dissimilar_loci` so that it unconditionally returns True if the 'sample' shows more than 5% variation compared to the 'control'. [Commit Detail](https://github.com/akikuno/DAJIN2/commit/0cbec5217fdfba6886979eb86cf970b587e83e5f)

+ Add `preprocess.midsv_caller.convert_consecutive_indels_to_match`: Due to alignment errors, there can be instances where a true match is mistakenly replaced with "insertion following a deletion". For example, although it should be "=C,=T", it gets replaced by "-C,+C|=T". In such cases, a process is performed to revert it back to "=C,=T". [Commit Detail](https://github.com/akikuno/DAJIN2/commit/69c56fa904ef847dc5b0e2dcdb90303409412d0f)

### Classification

+ Added `allele_merger.merge_minor_alleles` to reclassify alleles with less than 10 reads to suppress excessive subdivision of alleles. [Commit Detail](https://github.com/akikuno/DAJIN2/commit/b0752960def313e237ccf7d44542f9810cad0c00)

### Clustering

+ Added the function `merge_minor_cluster` to revert labels clustered with less than 10 reads back to the previous labels to suppress excessive subdivision of alleles.
[Commit Detail](https://github.com/akikuno/DAJIN2/commit/4bd9f7dd806d192475d8d4f20c1e50c37281d64e)

+ Update `generate_mutation_kmers` to to consider indices not registered in mutation_loci as mutations by replacing them with "@". For example, if there are no mutations in mutation_loci, "=G,=C,-C" and "~G,=G,=C" become "@,@,@" and "@,@,@" respectively, making them the same and ensuring they do not affect clustering. [Commit Detail](https://github.com/akikuno/DAJIN2/commit/9eefaaa1a9be3922b60655292c0a310e0f5fc76d)

### Consensus

+ Use `LocalOutlierFactor` to filter abnormal control reads [Commit Detail](https://github.com/akikuno/DAJIN2/commit/4bd9f7dd806d192494c48da01fc039902c97a23ddea47dd5f2b42ab475d8d4f20c1e50c37281d64e)


## 🐛 Bug Fixes

### Consensus

+ 大型欠失の内部で欠失が反映されないバグを修正 [Commit Detail](https://github.com/akikuno/DAJIN2/commit/XXX)

## 🔧 Maintenance


## ⛔️ Deprecated

---

# 💡 Future Tasks

+ Remove minor alleles with predicted insertion
+ Enhance the Clarity of Insertion Allele Identification.
+ Develop and Integrate Inversion Detection Capability
+ ReferenceのアレルをFASTA/HTMLディレクトリに保存する

-------------

# v0.3.5 (2023-12-23)
# Past Logs

<details>
<summary> v0.3.5 (2023-12-23) </summary>

## 📝 Documentation

Expand All @@ -26,8 +101,12 @@
+ [x] Added `similarity_searcher.py` to extract control reads resembling the consensus sequence, thereby enhancing the accuracy of detecting sample-specific mutations. [Commit Detail](https://github.com/akikuno/DAJIN2/commit/98a8a45e13835502f7dea2622274da81bbbc3ba3)

+ [x] Changed the method in `clust_formatter.get_thresholds`` to dynamically define the thresholds for ignoring mutations, instead of using fixed values.[Commit Detail](https://github.com/akikuno/DAJIN2/commit/2249d1601ad619a7db0fcc9ebf79d63f8dcf164b)

+ [x] Removed code that was previously commented out [Commit Detail](https://github.com/akikuno/DAJIN2/commit/2249d1601ad619a7db0fcc9ebf79d63f8dcf164b)

+ [x] Add `is_consensus` argument: When it comes to consensus, if the difference between sample and control is more than 20%, it is unconditionally considered a mutation. [Commit Detail](https://github.com/akikuno/DAJIN2/commit/7bca4590f97e1858304c3e9fb66c54a279dfcdf0)


## 🐛 Bug Fixes

+ None
Expand All @@ -46,18 +125,4 @@

+ None

-------------

# 💡 Future Tasks

+ Remove minor alleles with predicted insertion
+ Enhance the Clarity of Insertion Allele Identification.
+ Develop and Integrate Inversion Detection Capability

<!-- TEMPLATE

+ [ ] Added tests to `preprocess.mutation_extractor`

+ [ ] [Commit Detail](https://github.com/akikuno/DAJIN2/commit/)

-->
</details>
40 changes: 34 additions & 6 deletions docs/TROUBLESHOOTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,31 @@

## Installation

### Long Story Short

```bash
# Update conda
conda update -n base conda -y

# Setup of Bioconda
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --set channel_priority strict

# Install DAJIN2 to a virtual environment
conda create -n env-dajin2 python=3.10 -y
conda activate env-dajin2
conda install -c bioconda DAJIN2 -y
```

### Prerequisites

Before installing `DAJIN2`, please ensure that your system meets the following requirements:

- Python >=3.7
- Unix-like environment (Linux, macOS, WSL, etc.)
- [conda](https://docs.conda.io/en/latest/) or [mamba](https://mamba.readthedocs.io/en/latest/) is highly recommended for managing dependencies
- [Conda](https://docs.conda.io/en/latest/) is highly recommended for managing dependencies
- If using pip, access to administrative privileges or the ability to install packages globally

### Dependencies
Expand All @@ -24,15 +42,25 @@ Before installing `DAJIN2`, please ensure that your system meets the following r

We strongly recommend using Conda or Mamba for installation, as they efficiently manage the complex dependencies of `pysam` and `htslib`:

1. Install Conda or Mamba if you haven't already. You can download Conda from [here](https://docs.conda.io/en/latest/miniconda.html).
1. Install the latest Conda if you have not already. You can download Conda from [here](https://docs.conda.io/en/latest/miniconda.html).

2. Setup the [Bioconda](https://bioconda.github.io/) channel:

```bash
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --set channel_priority strict
```

3. Create a new virtual environment (optional but recommended):

2. Create a new environment (optional but recommended):
```bash
conda create -n DAJIN2-env
conda activate DAJIN2-env
conda create -n env-dajin2
conda activate env-dajin2
```

3. Install `DAJIN2`:
4. Install `DAJIN2` in the virtual environment:
```bash
conda install -c bioconda DAJIN2
```
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@

setuptools.setup(
name="DAJIN2",
version="0.3.5",
version="0.3.6",
author="Akihiro Kuno",
author_email="[email protected]",
description="One-step genotyping tools for targeted long-read sequencing",
Expand Down
97 changes: 97 additions & 0 deletions src/DAJIN2/core/classification/allele_merger.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
from __future__ import annotations

from itertools import groupby
from collections import defaultdict


##########################################################
# merge minor alleles
##########################################################


def count_alleles(score_of_each_alleles: list[dict]) -> dict[str, int]:
score_of_each_alleles.sort(key=lambda x: x["QNAME"])
allele_counts = defaultdict(int)
for _, group in groupby(score_of_each_alleles, key=lambda x: x["QNAME"]):
allele_predicted = max(group, key=lambda x: x["SCORE"])["ALLELE"]
allele_counts[allele_predicted] += 1

return allele_counts


def extract_minor_alleles(count_allele: dict[str, int], threshold_readnumber: int = 10) -> dict[str, int]:
return {allele: value for allele, value in count_allele.items() if value < threshold_readnumber}


def extract_major_alleles(count_allele: dict[str, int], threshold_readnumber: int = 10) -> set[str]:
return {allele for allele, value in count_allele.items() if value >= threshold_readnumber}


def sum_dictionaries(dict1, dict2):
"""Function to sum the values of two dictionaries based on their keys."""
return {key: dict1.get(key, 0) + dict2.get(key, 0) for key in set(dict1) | set(dict2)}


def split_major_minor_alleles(
score_of_each_alleles: list[dict], minor_alleles: dict[str, int]
) -> tuple[list[dict], list[dict]]:
score_on_minor_alleles = []
score_on_major_alleles = []
score_of_each_alleles.sort(key=lambda x: [x["QNAME"]])
for _, group in groupby(score_of_each_alleles, key=lambda x: x["QNAME"]):
group = list(group)
allele_predicted = max(group, key=lambda x: x["SCORE"])["ALLELE"]
if allele_predicted in minor_alleles:
score_on_minor_alleles.extend(group)
else:
score_on_major_alleles.extend(group)

return score_on_major_alleles, score_on_minor_alleles


def extract_most_major_allele(major_alleles: dict[str, int]) -> str:
return max(major_alleles, key=lambda x: major_alleles[x])


def replace_negative_inf_with_most_major_allele(
score_of_minor_alleles: list[dict], all_allele_counts: dict[str, int]
) -> list[dict]:
most_major_allele = extract_most_major_allele(all_allele_counts)
score_of_minor_alleles = []
for s in score_of_minor_alleles:
if s["SCORE"] == float("-inf"):
s["ALLELE"] = most_major_allele
score_of_minor_alleles.append(s)
return score_of_minor_alleles


def merge_minor_alleles(score_of_each_alleles, threshold_readnumber: int = 10) -> list[dict]:
score_of_each_alleles.sort(key=lambda x: [x["QNAME"]])

all_allele_counts = count_alleles(score_of_each_alleles)
minor_allele_counts = extract_minor_alleles(all_allele_counts, threshold_readnumber)

score_of_alleles_merged, score_of_minor_alleles = split_major_minor_alleles(
score_of_each_alleles, minor_allele_counts
)

score_of_minor_alleles.sort(key=lambda x: x["QNAME"])
new_major_allele = set()
for minor_allele in sorted(minor_allele_counts, key=minor_allele_counts.get):
if minor_allele in new_major_allele:
continue
for _, group in groupby(score_of_minor_alleles, key=lambda x: x["QNAME"]):
group = list(group)
allele_predicted = max(group, key=lambda x: x["SCORE"])["ALLELE"]
if allele_predicted != minor_allele:
continue
for g in group:
if g["ALLELE"] in minor_allele:
g["SCORE"] = float("-inf")

allele_counts = count_alleles(score_of_minor_alleles)
new_major_allele |= extract_major_alleles(allele_counts, threshold_readnumber)

score_of_minor_alleles = replace_negative_inf_with_most_major_allele(score_of_minor_alleles, all_allele_counts)

return score_of_alleles_merged + score_of_minor_alleles
32 changes: 18 additions & 14 deletions src/DAJIN2/core/classification/classifier.py
Original file line number Diff line number Diff line change
@@ -1,38 +1,39 @@
from __future__ import annotations

import midsv
from pathlib import Path
from itertools import groupby

from DAJIN2.utils import io
from DAJIN2.core.classification.allele_merger import merge_minor_alleles

def _calc_match(CSSPLIT: str) -> float:
match_score = CSSPLIT.count("=")
match_score -= CSSPLIT.count("+") # insertion
match_score -= sum(cs.islower() for cs in CSSPLIT) # inversion
cssplit = CSSPLIT.split(",")

return match_score / len(cssplit)
def calc_match(cssplit: str) -> float:
match_score = cssplit.count("=")
match_score -= cssplit.count("+") # insertion
match_score -= sum(cs.islower() for cs in cssplit) # inversion

return match_score / len(cssplit.split(","))

def _score_allele(TEMPDIR: Path, allele: str, SAMPLE_NAME: str) -> list[dict]:
midsv_sample = midsv.read_jsonl(Path(TEMPDIR, SAMPLE_NAME, "midsv", f"{allele}.json"))
scored_alleles = []

def score_allele(path_midsv: Path, allele: str) -> list[dict]:
midsv_sample = io.read_jsonl(path_midsv)
scored_alleles = []
for dict_midsv in midsv_sample:
score = _calc_match(dict_midsv["CSSPLIT"])
score = calc_match(dict_midsv["CSSPLIT"])
dict_midsv.update({"SCORE": score, "ALLELE": allele})
scored_alleles.append(dict_midsv)

return scored_alleles


def _extract_alleles_with_max_score(score_of_each_alleles: list[dict]) -> list[dict]:
def extract_alleles_with_max_score(score_of_each_alleles: list[dict]) -> list[dict]:
alleles_with_max_score = []
score_of_each_alleles.sort(key=lambda x: x["QNAME"])
for _, group in groupby(score_of_each_alleles, key=lambda x: x["QNAME"]):
max_read = max(group, key=lambda x: x["SCORE"])
del max_read["SCORE"]
alleles_with_max_score.append(max_read)

return alleles_with_max_score


Expand All @@ -44,6 +45,9 @@ def _extract_alleles_with_max_score(score_of_each_alleles: list[dict]) -> list[d
def classify_alleles(TEMPDIR: Path, FASTA_ALLELES: dict, SAMPLE_NAME: str) -> list[dict]:
score_of_each_alleles = []
for allele in FASTA_ALLELES:
score_of_each_alleles.extend(_score_allele(TEMPDIR, allele, SAMPLE_NAME))
path_midsv = Path(TEMPDIR, SAMPLE_NAME, "midsv", f"{allele}.json")
score_of_each_alleles.extend(score_allele(path_midsv, allele))

score_of_each_alleles_merged = merge_minor_alleles(score_of_each_alleles)

return _extract_alleles_with_max_score(score_of_each_alleles)
return extract_alleles_with_max_score(score_of_each_alleles_merged)
22 changes: 16 additions & 6 deletions src/DAJIN2/core/clustering/clustering.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,20 +22,30 @@
###############################################################################


def optimize_labels(X: spmatrix, coverage_sample, coverage_control) -> list[int]:
def count_number_of_clusters(labels_control: list[int], coverage_control: int) -> int:
"""Reads < 1% in the control are considered clustering errors and are not counted"""
return sum(1 for reads in Counter(labels_control).values() if reads / coverage_control > 0.01)


def optimize_labels(X: spmatrix, coverage_sample: int, coverage_control: int) -> list[int]:
labels_previous = list(range(coverage_sample))
for i in range(1, coverage_sample):
np.random.seed(seed=1)
labels_all = BisectingKMeans(n_clusters=i, random_state=1).fit_predict(X).tolist()

labels_sample = labels_all[:coverage_sample]
labels_control = labels_all[coverage_sample:]
labels_current = merge_labels(labels_control, labels_sample)
labels_current = merge_labels(labels_control, labels_sample, labels_previous)
# print(i, Counter(labels_sample), Counter(labels_control), Counter(labels_current)) # ! DEBUG
# Reads < 1% in the control are considered clustering errors and are not counted
count_control = Counter(labels_control)
num_labels_control = sum(1 for reads in count_control.values() if reads / coverage_control > 0.01)

num_labels_control = count_number_of_clusters(labels_control, coverage_control)
mutual_info = metrics.adjusted_rand_score(labels_previous, labels_current)
# Report the number of clusters in SAMPLE when the number of clusters in CONTROL is split into more than one.

"""
Return the number of clusters when:
- the number of clusters in control is split into more than one.
- the mutual information between the current and previous labels is high enough (= similar).
"""
if num_labels_control >= 2:
return labels_previous
if 0.95 <= mutual_info <= 1.0:
Expand Down
Loading