Merge pull request #14 from akikuno/develop-v0.3.1

Develop v0.3.1
akikuno · Aug 27, 2023 · 29a2e09 · 29a2e09
2 parents a1515bb + 4a962ea
commit 29a2e09
Show file tree

Hide file tree

Showing 76 changed files with 2,614 additions and 2,032 deletions.
diff --git a/.github/workflows/pytest.yml b/.github/workflows/pytest.yml
@@ -37,4 +37,4 @@ jobs:
       - name: Test with pytest
         run: |
           export PYTHONPATH=./src
-          python -m pytest tests/ -p no:warnings
+          python -m pytest tests/ -p no:warnings -vv
diff --git a/MANIFEST.in b/MANIFEST.in
@@ -1,3 +1,6 @@
+include requirements.txt
 include src/DAJIN2/template_igvjs.html
+
 graft src/DAJIN2/templates
 graft src/DAJIN2/static
+graft src/DAJIN2/utils
diff --git a/README.md b/README.md
@@ -1,47 +1,65 @@
 [![License](https://img.shields.io/badge/License-MIT-9cf.svg?style=flat-square)](https://choosealicense.com/licenses/mit/)
-[![PyPI](https://img.shields.io/pypi/v/DAJIN2.svg?label=PyPI&color=orange&style=flat-square)](https://pypi.org/project/DAJIN2/)
+[![Test](https://img.shields.io/github/actions/workflow/status/akikuno/dajin2/pytest.yml?branch=main&label=Test&color=brightgreen&style=flat-square)](https://github.com/akikuno/dajin2/actions)
 [![Python](https://img.shields.io/pypi/pyversions/DAJIN2.svg?label=Python&color=blue&style=flat-square)](https://pypi.org/project/DAJIN2/)
+[![PyPI](https://img.shields.io/pypi/v/DAJIN2.svg?label=PyPI&color=orange&style=flat-square)](https://pypi.org/project/DAJIN2/)
+[![Bioconda](https://img.shields.io/conda/v/bioconda/dajin2?label=Bioconda&color=orange&style=flat-square)](https://anaconda.org/bioconda/dajin2)
+
+
+<p align="center">
+<img src="https://user-images.githubusercontent.com/15861316/261833016-7f356960-88cf-4574-87e2-36162b174340.png" width="90%">
+</p>
+
+DAJIN2 is a genotyping software designed for organisms that have undergone genome editing, utilizing nanopore sequencing technology.  
+
+The name DAJIN is inspired by the term 一網**打尽** (Ichimou **DAJIN** or Yīwǎng **Dǎjìn**), which signifies capturing everything in a single net.  
+
+## Disclaimer
+
+DAJIN2 is still in the development phase.  
+Basic tests covering point mutations, deletions, and insertion designs have been conducted.  
+If you encounter any bugs or issues, please report them via [Issues](https://github.com/akikuno/DAJIN2/issues).  
+
 
-⚠️ DAJIN2 is currently under development ⚠️
 
-Expected to be available the stable version in August 2023 🤞
+## Installation
 
-## Installation (alpha-version)
+From [PyPI](https://pypi.org/project/DAJIN2/):
 
 ```bash
 pip install DAJIN2
 ```
 
-## Usage
+From [Bioconda](https://anaconda.org/bioconda/DAJIN2):
+
+```bash
+conda install -c bioconda DAJIN2
+```
 
-### Basics
 
-You can run DAJIN2 for a single sample (one sample vs one control)
+## Usage
+
+### Single Sample Analysis
 
+DAJIN2 allows for the analysis of single samples (one sample vs one control).
 
 ```bash
-DAJIN2 [-h] [-s SAMPLE] [-c CONTROL] [-a ALLELE] [-n NAME] [-g GENOME] [-t THREADS] [-v]
+DAJIN2 <-s|--sample> <-c|--control> <-a|--allele> <-n|--name> [-g|--genome] [-t|--threads] [-h|--help] [-v|--version]
 
 options:
-  -h, --help            show this help message and exit
-  -s SAMPLE, --sample SAMPLE
-                        Full path to a sample FASTQ file
-  -c CONTROL, --control CONTROL
-                        Full path to a control FASTQ file
-  -a ALLELE, --allele ALLELE
-                        Full path to a FASTA file
-  -n NAME, --name NAME  Output directory name
-  -g GENOME, --genome GENOME
-                        Reference genome ID (e.g hg38, mm10) [default: '']
-  -t THREADS, --threads THREADS
-                        Number of threads [default: 1]
-  -v, --version         show program's version number and exit
+  -s, --sample              Path to a sample FASTQ file
+  -c, --control             Path to a control FASTQ file
+  -a, --allele              Path to a FASTA file
+  -n, --name                Output directory name
+  -g, --genome (Optional)   Reference genome ID (e.g hg38, mm39) [default: '']
+  -t, --threads (Optional)  Number of threads [default: 1]
+  -h, --help                show this help message and exit
+  -v, --version             show the version number and exit
 ```
 
 #### Example
 
 ```bash
-# Donwload example dataset
+# Donwload the example dataset
 wget https://github.com/akikuno/DAJIN2/raw/main/examples/example-single.tar.gz
 tar -xf example-single.tar.gz
 
@@ -68,24 +86,24 @@ DAJIN2 \
 # 🎉 Finished! Open DAJINResults/stx2-deletion to see the report.
 ```
 
-### Batch handling
+### Batch Processing
+
+DAJIN2 can also handle multiple FASTQ files using the `batch` subcommand.
 
-DAJIN2 can handle many FASTQ files using the `batch' subcommand.
 
 ```bash
-DAJIN2 batch [-h] -f FILE [-t THREADS]
+DAJIN2 batch <-f|--file> [-t|--threads] [-h]
 
 options:
-  -h, --help            Show this help message and exit
-  -f FILE, --file FILE  CSV or Excel file
-  -t THREADS, --threads THREADS
-                        Number of threads [default: 1]
+  -f, --file                Path to a CSV or Excel file
+  -t, --threads (Optional)  Number of threads [default: 1]
+  -h, --help                Show this help message and exit
 ```
 
 #### Example
 
 ```bash
-# Donwload example dataset
+# Donwload the example dataset
 wget https://github.com/akikuno/DAJIN2/raw/main/examples/example-batch.tar.gz
 tar -xf example-batch.tar.gz
 
@@ -122,4 +140,6 @@ DAJIN2 batch --file example-batch/batch.csv --threads 3
 
 ## References
 
+For more information, please refer to the following publication:
+
 [Kuno A, et al. (2022) DAJIN enables multiplex genotyping to simultaneously validate intended and unintended target genome editing outcomes. *PLoS Biology* 20(1): e3001507.](https://doi.org/10.1371/journal.pbio.3001507)
diff --git a/requirements.txt b/requirements.txt
@@ -1,17 +1,21 @@
 numpy >= 1.20.0
 scipy >=  1.6.0
 pandas >= 1.0.0
+openpyxl >= 3.0.0
+rapidfuzz >=3.0.0
 statsmodels >= 0.13.5
 scikit-learn >= 1.0.0
+
 mappy >= 2.24
 pysam >= 0.19.0
-openpyxl >= 3.0.0
+
 Flask >= 2.2.0
 waitress >= 2.1.0
 Jinja2 >= 3.1.0
+
 plotly >= 5.0.0
 kaleido >= 0.2.0
+
 cstag == 0.4.1
 midsv >= 0.10.1
 wslPath >=0.3.0
-rapidfuzz >=3.0.0
diff --git a/setup.py b/setup.py
@@ -9,7 +9,7 @@
 
 setuptools.setup(
     name="DAJIN2",
-    version="0.3.0",
+    version="0.3.1b4",
     author="Akihiro Kuno",
     author_email="[email protected]",
     description="One-step genotyping tools for targeted long-read sequencing",
@@ -24,7 +24,8 @@
     entry_points={"console_scripts": ["DAJIN2=DAJIN2.main:execute"]},
     include_package_data=True,
     classifiers=[
-        "Development Status :: 3 - Alpha",
+        "Development Status :: 4 - Beta",
+        "Environment :: Console",
         "Programming Language :: Python :: 3",
         "License :: OSI Approved :: MIT License",
         "Operating System :: POSIX",

diff --git a/src/DAJIN2/core/classification/__init__.py b/src/DAJIN2/core/classification/__init__.py
@@ -1 +1 @@
-from DAJIN2.core.classification.classify import classify_alleles
+from DAJIN2.core.classification.classifier import classify_alleles
diff --git a/src/DAJIN2/core/classification/classifier.py b/src/DAJIN2/core/classification/classifier.py
@@ -0,0 +1,49 @@
+from __future__ import annotations
+
+import midsv
+from pathlib import Path
+from itertools import groupby
+
+
+def _calc_match(CSSPLIT: str) -> float:
+    match_score = CSSPLIT.count("=")
+    match_score -= CSSPLIT.count("+")  # insertion
+    match_score -= sum(cs.islower() for cs in CSSPLIT)  # inversion
+    cssplit = CSSPLIT.split(",")
+
+    return match_score / len(cssplit)
+
+
+def _score_allele(TEMPDIR: Path, allele: str, SAMPLE_NAME: str) -> list[dict]:
+    midsv_sample = midsv.read_jsonl(Path(TEMPDIR, SAMPLE_NAME, "midsv", f"{allele}.json"))
+    scored_alleles = []
+
+    for dict_midsv in midsv_sample:
+        score = _calc_match(dict_midsv["CSSPLIT"])
+        dict_midsv.update({"SCORE": score, "ALLELE": allele})
+        scored_alleles.append(dict_midsv)
+
+    return scored_alleles
+
+
+def _extract_alleles_with_max_score(score_of_each_alleles: list[dict]) -> list[dict]:
+    alleles_with_max_score = []
+    score_of_each_alleles.sort(key=lambda x: x["QNAME"])
+    for _, group in groupby(score_of_each_alleles, key=lambda x: x["QNAME"]):
+        max_read = max(group, key=lambda x: x["SCORE"])
+        del max_read["SCORE"]
+        alleles_with_max_score.append(max_read)
+    return alleles_with_max_score
+
+
+##########################################################
+# main
+##########################################################
+
+
+def classify_alleles(TEMPDIR: Path, FASTA_ALLELES: dict, SAMPLE_NAME: str) -> list[dict]:
+    score_of_each_alleles = []
+    for allele in FASTA_ALLELES:
+        score_of_each_alleles.extend(_score_allele(TEMPDIR, allele, SAMPLE_NAME))
+
+    return _extract_alleles_with_max_score(score_of_each_alleles)
diff --git a/src/DAJIN2/core/classification/classify.py b/src/DAJIN2/core/classification/classify.py
diff --git a/src/DAJIN2/core/clustering/clustering.py b/src/DAJIN2/core/clustering/clustering.py
@@ -1,6 +1,5 @@
 from __future__ import annotations
 
-import json
 import pickle
 import midsv
 import random
@@ -12,6 +11,7 @@
 from DAJIN2.core.clustering.make_kmer import generate_mutation_kmers
 from DAJIN2.core.clustering.make_score import make_score
 from DAJIN2.core.clustering.return_labels import return_labels
+from DAJIN2.utils import io
 
 
 def annotate_score(path_sample, mutation_score, mutation_loci, is_control=False) -> Generator[list[float]]:
@@ -41,12 +41,6 @@ def reorder_labels(labels: list[int], start: int = 0) -> list[int]:
     return labels_ordered
 
 
-def write_json(filepath: Path | str, data: Generator) -> None:
-    with open(filepath, "w") as f:
-        for line in data:
-            f.write(json.dumps(line) + "\n")
-
-
 ###########################################################
 # main
 ###########################################################
@@ -63,7 +57,7 @@ def is_strand_bias(path_control) -> bool:
         return True
 
 
-def add_labels(classif_sample, TEMPDIR, SAMPLE_NAME, CONTROL_NAME, THREADS: int = 1) -> list[dict[str]]:
+def add_labels(classif_sample, TEMPDIR, SAMPLE_NAME, CONTROL_NAME) -> list[dict[str]]:
     labels_all = []
     max_label = 0
     strand_bias = is_strand_bias(Path(TEMPDIR, CONTROL_NAME, "midsv", "control.json"))
@@ -86,14 +80,14 @@ def add_labels(classif_sample, TEMPDIR, SAMPLE_NAME, CONTROL_NAME, THREADS: int
             continue
         path_sample = Path(TEMPDIR, SAMPLE_NAME, "clustering", f"{allele}_{RANDOM_NUM}.json")
         path_control = Path(TEMPDIR, CONTROL_NAME, "midsv", f"{allele}.json")
-        write_json(path_sample, group)
+        io.write_jsonl(data=group, path=path_sample)
         mutation_score: list[dict[str, float]] = make_score(path_sample, path_control, mutation_loci, knockin_loci)
         scores_sample = annotate_score(path_sample, mutation_score, mutation_loci)
         scores_control = annotate_score(path_control, mutation_score, mutation_loci, is_control=True)
         path_score_sample = Path(TEMPDIR, SAMPLE_NAME, "clustering", f"{allele}_score_{RANDOM_NUM}.json")
         path_score_control = Path(TEMPDIR, CONTROL_NAME, "clustering", f"{allele}_score_{RANDOM_NUM}.json")
-        write_json(path_score_sample, scores_sample)
-        write_json(path_score_control, scores_control)
+        io.write_jsonl(data=scores_sample, path=path_score_sample)
+        io.write_jsonl(data=scores_control, path=path_score_control)
         labels = return_labels(path_score_sample, path_score_control, path_sample, strand_bias)
         labels_reorder = reorder_labels(labels, start=max_label)
         max_label = max(labels_reorder)

diff --git a/src/DAJIN2/core/consensus/__init__.py b/src/DAJIN2/core/consensus/__init__.py
@@ -1,6 +1,5 @@
 from DAJIN2.core.consensus.consensus import call_consensus
-from DAJIN2.core.consensus.consensus import call_allele_name
-from DAJIN2.core.consensus.consensus import update_key_by_allele_name
-from DAJIN2.core.consensus.consensus import add_key_by_allele_name
-from DAJIN2.core.consensus.subset import subset_clust
-from DAJIN2.core.consensus.extract_mutation_loci_by_labels import extract_mutation_loci_by_labels
+from DAJIN2.core.consensus.name_handler import call_allele_name
+from DAJIN2.core.consensus.name_handler import update_key_by_allele_name
+from DAJIN2.core.consensus.name_handler import add_key_by_allele_name
+from DAJIN2.core.consensus.clust_subsetter import subset_clust
diff --git a/src/DAJIN2/core/consensus/subset.py → src/DAJIN2/core/consensus/clust_subsetter.py b/src/DAJIN2/core/consensus/subset.py → src/DAJIN2/core/consensus/clust_subsetter.py
Original file line number	Diff line number	Diff line change
		@@ -1 +1 @@
		from DAJIN2.core.classification.classify import classify_alleles
		from DAJIN2.core.classification.classifier import classify_alleles