Merge pull request #21 from akikuno/develop-0.4.1

Develop 0.4.1
akikuno · Feb 13, 2024 · 777bb74 · 777bb74
2 parents 369a2d4 + 608e5be
commit 777bb74
Show file tree

Hide file tree

Showing 34 changed files with 904 additions and 333 deletions.
diff --git a/.github/workflows/pytest.yml b/.github/workflows/pytest.yml
@@ -16,9 +16,8 @@ jobs:
     name: Python ${{ matrix.python-version }} on ${{ matrix.os }}
 
     steps:
-      - uses: actions/checkout@v3
-
-      - uses: actions/setup-python@v4
+      - uses: actions/checkout@v4
+      - uses: actions/setup-python@v5
         with:
           python-version: ${{ matrix.python-version }}
 

diff --git a/README.md b/README.md
@@ -19,6 +19,7 @@ The name DAJIN is derived from the phrase 一網**打尽** (Ichimou **DAJIN** or
 ## 🌟 Features
 
 + **Comprehensive Mutation Detection**: Equipped with the capability to detect genome editing events over a wide range, it can identify a broad spectrum of mutations, from small changes to large structural variations.
+  + DAJIN2 is also possible to detect complex mutations characteristic of genome editing, such as "insertions occurring in regions where deletions have occurred."
 + **Intuitive Visualization**: The outcomes of genome editing are visualized intuitively, allowing for the rapid and easy identification and analysis of mutations.
 + **Multi-Sample Compatibility**: Accommodates a variety of samples, enabling simultaneous processing of multiple samples. This facilitates efficient progression of large-scale experiments and comparative studies.
 
@@ -253,10 +254,9 @@ DAJIN_Results/tyr-substitution
 │   ├── tyr_c230gt_01%.csv
 │   ├── tyr_c230gt_10%.csv
 │   └── tyr_c230gt_50%.csv
-├── read_all.csv
 ├── read_plot.html
 ├── read_plot.pdf
-└── read_summary.csv
+└── read_summary.xlsx
 ```
 
 ### 1. BAM
@@ -285,23 +285,22 @@ An example of a Tyr point mutation is described by its position on the chromosom
 ### 4. read_plot.html and read_plot.pdf
 
 Both read_plot.html and read_plot.pdf illustrate the proportions of each allele.  
-The chart's **Allele type** indicates the type of allele, and **% of reads** shows the proportion of reads for that allele.  
+The chart's **Allele type** indicates the type of allele, and **Percent of reads** shows the proportion of reads for that allele.  
 
 Additionally, the types of **Allele type** include:
-- **intact**: Alleles that perfectly match the input FASTA allele.
-- **indels**: Substitutions, deletions, insertions, or inversions within 50 bases.
-- **sv**: Substitutions, deletions, insertions, or inversions beyond 50 bases.
+- **Intact**: Alleles that perfectly match the input FASTA allele.
+- **Indels**: Substitutions, deletions, insertions, or inversions within 50 bases.
+- **SV**: Substitutions, deletions, insertions, or inversions beyond 50 bases.
 
 <img src="https://user-images.githubusercontent.com/15861316/274521067-4d217251-4c62-4dc9-9c05-7f5377dd3025.png" width="75%">
 
 > [!WARNING]
 > In PCR amplicon sequencing, the % of reads might not match the actual allele proportions due to amplification bias.  
 > Especially when large deletions are present, the deletion alleles might be significantly amplified, potentially not reflecting the actual allele proportions.
 
-### 5. read_all.csv and read_summary.csv
+### 5. read_summary.xlsx
 
-- read_all.csv: Records which allele each read is classified under.  
-- read_summary.csv: Describes the number of reads and presence proportion for each allele.  
+- read_summary.xlsx: Describes the number of reads and presence proportion for each allele.  
 
 ## 📣Feedback and Support
 

diff --git a/docs/README_JP.md b/docs/README_JP.md
@@ -15,6 +15,7 @@ DAJIN2は、ナノポアシーアターゲットシーケンシングを用い
 ## 🌟 特徴
 
 + **網羅的な変異検出**: ゲノム編集イベントを広範囲にわたり検出する能力を備えており、小さな変異から大きな構造変化まで、幅広い変異を特定することが可能です
+  + ゲノム編集に特徴的な「欠失が生じた領域に挿入が起こる」といった複合的な変異の検出も可能です
 + **直観的な可視化**: ゲノム編集の結果は直観的に可視化され、変異を迅速かつ容易に識別し、分析することができます
 + **多サンプル対応**: 多様なサンプルに対応しており、複数のサンプルを同時に処理することが可能です。これにより、大規模な実験や比較研究を効率的に進めることができます
 
@@ -256,10 +257,9 @@ DAJIN_Results/tyr-substitution
 │   ├── tyr_c230gt_01%.csv
 │   ├── tyr_c230gt_10%.csv
 │   └── tyr_c230gt_50%.csv
-├── read_all.csv
 ├── read_plot.html
 ├── read_plot.pdf
-└── read_summary.csv
+└── read_summary.xlsx
 ```
 
 ### 1. BAM
@@ -293,13 +293,13 @@ Tyr点変異の例を以下に示します：
 ### 4. read_plot.html / read_plot.pdf
 
 read_plot.html および read_plot.pdf は、各アレルの割合を図示しています。  
-図中の**Allele type**はアレルの種類を、**% of reads**は該当するリードのアレル割合を示しています。  
+図中の**Allele type**はアレルの種類を、**Percent of reads**は該当するリードのアレル割合を示しています。  
 
 また、**Allele type**の種類は以下の通りです：
 
-- **intact**：入力のFASTAアレルと完全に一致するアレル
-- **indels**：50塩基以内の置換、欠失、挿入、逆位
-- **sv**：50塩基以上の置換、欠失、挿入、逆位
+- **Intact**：入力のFASTAアレルと完全に一致するアレル
+- **Indels**：50塩基以内の置換、欠失、挿入、逆位を含むアレル
+- **SV**：50塩基以上の置換、欠失、挿入、逆位を含むアレル
 
 
 <img src="https://user-images.githubusercontent.com/15861316/274521067-4d217251-4c62-4dc9-9c05-7f5377dd3025.png" width="75%">
@@ -308,9 +308,8 @@ read_plot.html および read_plot.pdf は、各アレルの割合を図示し
 > PCRアンプリコンを用いたターゲットシーケンシングでは、増幅バイアスのため **% of reads**が実際のアレルの割合と一致しないことがあります。  
 > とくに大型欠失が存在する場合、欠失アレルが顕著に増幅されることから、実際のアレル割合を反映しない可能性が高まります。
 
-### 5. read_all.csv / read_summary.csv
+### 5. read_summary.xlsx
 
-- read_all.csv：各リードがどのアレルに分類されたかが記録されています。
 - read_summary.csv：各アレルのリード数と存在割合が記述されています。
 
 

diff --git a/docs/ROADMAP.md → docs/RELEASE.md b/docs/ROADMAP.md → docs/RELEASE.md
@@ -8,7 +8,7 @@
 ## 🐛 Bug Fixes
 ## 🔧 Maintenance
 ## ⛔️ Deprecated
-+ [ ] XXX [Commit Detail](https://github.com/akikuno/DAJIN2/commit/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx)
+- XXX [Commit Detail](https://github.com/akikuno/DAJIN2/commit/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx)
 -->
 
 <!-- 💡 ToDo
@@ -17,6 +17,44 @@
 - nCATSがほしい…
  -->
 
+# v0.4.1 (2024-02-13)
+
+## 📝 Documentation
+
+- Added documentation for a new feature in `README.md`: DAJIN2 can now detect complex mutations characteristic of genome editing, such as insertions occurring in regions where deletions have occurred.
+
+## 🚀 New Features
+
+- Introduced `cssplits_handler.detect_insertion_within_deletion` to extract insertion sequences within deletions. This addresses cases where minimap2 may align bases that partially match the reference through local alignment, potentially failing to detect them as insertions. This enhancement ensures the proper detection of insertion sequences. [Commit Detail](https://github.com/akikuno/DAJIN2/commit/7651e20852b94ed4d5bb38539bb56229dcc8b763)
+
+- Added `report.insertion_refractor.py` to include original insertion information in the consensus for mappings made by insertion. This addition enables the listing of both insertions and deletions within the insertion allele on a single HTML file. [Commit Detail](https://github.com/akikuno/DAJIN2/commit/e6c3b636bb2ba537d1341d1042341afd6583dd0b)
+
+## 🔧 Maintenance
+
+- Updated `insertions_to_fasta.py`. [Commit Detail](https://github.com/akikuno/DAJIN2/commit/7927feb0bb4f3091537aaebabd60a441456a3413)
+  - Modified the approach to reduce randomness by replacing set or frozenset with list or tuple, and using `random.sample()` for subsetting reads.
+  - Refactored `call_consensus_insertion_sequence`.
+  - Fixed a bug in `extract_score_and_sequence` to ensure correct appending of scores for the insertions_merged_subset.
+
+- Changed the function name of `report` to be more explicit. [Commit Detail](https://github.com/akikuno/DAJIN2/commit/93132c5beba17278c7d67b76817bb13dfaae57a3)
+
+- Updated `utils.report_report_generator` [Commit Detail](https://github.com/akikuno/DAJIN2/commit/821f06f05b5ed2f4ba2d7baad6159d774d2e5db0)
+  - Capitalized "Allele" (e.g., control) and "Allele type" (e.g., intact).
+  - Changed the output format of read_all and read_summary from CSV to XLSX.
+  - Corrected the order of the Legend to follow a logical sequence from control to sample, and then to specific insertions.
+
+- Updated `utils.io.read_xlsx` to switch from using pandas to openpyxl due to the DeprecationWarning in Pandas being cumbersome. [Commit Detail](https://github.com/akikuno/DAJIN2/commit/5d942bace8417bb973441b360a0ec31d77d81e24)
+
+## 🐛 Bug Fixes
+
+- Added `=` to the prefix for valid cstag recognition when there is an `n` in inversion. [Commit Detail](https://github.com/akikuno/DAJIN2/commit/747ff3ece221a8c1e4f1ba1b696c4751618b4992)
+
+- Modified the io.load_from_csv function to trim spaces before and after each field, addressing an error caused by spaces in batch.csv. [Commit Detail](https://github.com/akikuno/DAJIN2/commit/f5d49230f8ebd37061a27d6767d3c1954b8f8576)
+
+## ⛔️ Deprecated
+
+- Removed `reads_all.csv`. This CSV file, which showed the allele for each read, is no longer reported due to its limited usefulness and because the same information can be obtained from the BAM file. [Commit Detail](https://github.com/akikuno/DAJIN2/commit/76e3eaee320deb79cbf3cf97cc6aed69c5bbc3ef)
+
 -------------
 
 # Past Logs

diff --git a/requirements.txt b/requirements.txt
@@ -1,11 +1,13 @@
 numpy >= 1.20.0
-scipy >=  1.6.0
+scipy >= 1.6.0
 pandas >= 1.0.0
 openpyxl >= 3.0.0
 rapidfuzz >=3.0.0
 statsmodels >= 0.13.5
 scikit-learn >= 1.0.0
 
+openpyxl >= 3.0.0
+
 mappy >= 2.24
 pysam >= 0.19.0
 

diff --git a/setup.py b/setup.py
@@ -9,7 +9,7 @@
 
 setuptools.setup(
     name="DAJIN2",
-    version="0.4.0",
+    version="0.4.1",
     author="Akihiro Kuno",
     author_email="[email protected]",
     description="One-step genotyping tools for targeted long-read sequencing",

diff --git a/src/DAJIN2/core/clustering/label_extractor.py b/src/DAJIN2/core/clustering/label_extractor.py
@@ -1,6 +1,7 @@
 from __future__ import annotations
 
-import random
+import uuid
+
 from pathlib import Path
 from itertools import groupby
 
@@ -18,12 +19,13 @@ def extract_labels(classif_sample, TEMPDIR, SAMPLE_NAME, CONTROL_NAME) -> list[d
     classif_sample.sort(key=lambda x: x["ALLELE"])
     for allele, group in groupby(classif_sample, key=lambda x: x["ALLELE"]):
         """Cache data to temporary files"""
-        RANDOM_INT = random.randint(0, 10**10)
         if Path(TEMPDIR, CONTROL_NAME, "midsv", f"{allele}.json").exists():
             path_control = Path(TEMPDIR, CONTROL_NAME, "midsv", f"{allele}.json")
         else:
             path_control = Path(TEMPDIR, CONTROL_NAME, "midsv", f"{allele}_{SAMPLE_NAME}.json")
-        path_sample = Path(TEMPDIR, SAMPLE_NAME, "clustering", f"{allele}_{RANDOM_INT}.json")
+
+        unique_id = str(uuid.uuid4())
+        path_sample = Path(TEMPDIR, SAMPLE_NAME, "clustering", f"tmp_{allele}_{unique_id}.jsonl")
         io.write_jsonl(data=group, file_path=path_sample)
 
         """Load mutation_loci and knockin_loci."""
@@ -46,8 +48,8 @@ def extract_labels(classif_sample, TEMPDIR, SAMPLE_NAME, CONTROL_NAME) -> list[d
         scores_sample = annotate_score(path_sample, mutation_score, mutation_loci)
         scores_control = annotate_score(path_control, mutation_score, mutation_loci, is_control=True)
 
-        path_score_sample = Path(TEMPDIR, SAMPLE_NAME, "clustering", f"{allele}_score_{RANDOM_INT}.json")
-        path_score_control = Path(TEMPDIR, CONTROL_NAME, "clustering", f"{allele}_score_{RANDOM_INT}.json")
+        path_score_sample = Path(TEMPDIR, SAMPLE_NAME, "clustering", f"tmp_{allele}_score_sample_{unique_id}.jsonl")
+        path_score_control = Path(TEMPDIR, SAMPLE_NAME, "clustering", f"tmp_{allele}_score_control_{unique_id}.jsonl")
         io.write_jsonl(data=scores_sample, file_path=path_score_sample)
         io.write_jsonl(data=scores_control, file_path=path_score_control)
 

diff --git a/src/DAJIN2/core/consensus/consensus.py b/src/DAJIN2/core/consensus/consensus.py
@@ -115,4 +115,4 @@ def call_consensus(tempdir: Path, sample_name: str, clust_sample: list[dict]) ->
         key = ConsensusKey(allele, label, clust[0]["PERCENT"])
         cons_percentages[key] = cons_percentage
         cons_sequences[key] = call_sequence(cons_percentage)
-    return dict(cons_percentages), dict(cons_sequences)
+    return cons_percentages, cons_sequences
diff --git a/src/DAJIN2/core/core.py b/src/DAJIN2/core/core.py
@@ -173,7 +173,7 @@ def execute_control(arguments: dict):
     # Output BAM files
     ###########################################################
     logger.info(f"Output BAM files of {arguments['control']}...")
-    report.report_bam.output_bam(
+    report.report_bam.export_to_bam(
         ARGS.tempdir, ARGS.control_name, ARGS.genome_coordinates, ARGS.threads, is_control=True
     )
     ###########################################################
@@ -307,14 +307,16 @@ def execute_sample(arguments: dict):
     # RESULT
     io.write_jsonl(RESULT_SAMPLE, Path(ARGS.tempdir, "result", f"{ARGS.sample_name}.jsonl"))
     # FASTA
-    report.report_files.to_fasta(ARGS.tempdir, ARGS.sample_name, cons_sequence)
-    report.report_files.to_fasta_reference(ARGS.tempdir, ARGS.sample_name)
+    report.report_files.export_to_fasta(ARGS.tempdir, ARGS.sample_name, cons_sequence)
+    report.report_files.export_reference_to_fasta(ARGS.tempdir, ARGS.sample_name)
     # HTML
-    report.report_files.to_html(ARGS.tempdir, ARGS.sample_name, cons_percentage)
+    report.report_files.export_to_html(ARGS.tempdir, ARGS.sample_name, cons_percentage)
     # CSV (Allele Info)
-    report.report_mutation.to_csv(ARGS.tempdir, ARGS.sample_name, ARGS.genome_coordinates, cons_percentage)
+    report.report_mutation.export_to_csv(ARGS.tempdir, ARGS.sample_name, ARGS.genome_coordinates, cons_percentage)
     # BAM
-    report.report_bam.output_bam(ARGS.tempdir, ARGS.sample_name, ARGS.genome_coordinates, ARGS.threads, RESULT_SAMPLE)
+    report.report_bam.export_to_bam(
+        ARGS.tempdir, ARGS.sample_name, ARGS.genome_coordinates, ARGS.threads, RESULT_SAMPLE
+    )
     for path_bam_igvjs in Path(ARGS.tempdir, "cache", ".igvjs").glob(f"{ARGS.control_name}_control.bam*"):
         shutil.copy(path_bam_igvjs, Path(ARGS.tempdir, "report", ".igvjs", ARGS.sample_name))
     # VCF

diff --git a/src/DAJIN2/core/preprocess/directories.py b/src/DAJIN2/core/preprocess/directories.py
@@ -7,7 +7,7 @@ def create_temporal_directories(TEMPDIR: Path, NAME: str, is_control=False) -> N
     Path(TEMPDIR, "result").mkdir(parents=True, exist_ok=True)
     SUBDIRS = ["fasta", "fastq", "sam", "midsv", "mutation_loci", "clustering", "consensus"]
     if is_control is False:
-        SUBDIRS.extend(["knockin_loci", "classification"])
+        SUBDIRS.extend(["cstag", "knockin_loci", "classification"])
     for subdir in SUBDIRS:
         Path(TEMPDIR, NAME, subdir).mkdir(parents=True, exist_ok=True)