Skip to content

Commit

Permalink
Merge pull request #21 from akikuno/develop-0.4.1
Browse files Browse the repository at this point in the history
Develop 0.4.1
  • Loading branch information
akikuno authored Feb 13, 2024
2 parents 369a2d4 + 608e5be commit 777bb74
Show file tree
Hide file tree
Showing 34 changed files with 904 additions and 333 deletions.
5 changes: 2 additions & 3 deletions .github/workflows/pytest.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,8 @@ jobs:
name: Python ${{ matrix.python-version }} on ${{ matrix.os }}

steps:
- uses: actions/checkout@v3

- uses: actions/setup-python@v4
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}

Expand Down
17 changes: 8 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ The name DAJIN is derived from the phrase 一網**打尽** (Ichimou **DAJIN** or
## 🌟 Features

+ **Comprehensive Mutation Detection**: Equipped with the capability to detect genome editing events over a wide range, it can identify a broad spectrum of mutations, from small changes to large structural variations.
+ DAJIN2 is also possible to detect complex mutations characteristic of genome editing, such as "insertions occurring in regions where deletions have occurred."
+ **Intuitive Visualization**: The outcomes of genome editing are visualized intuitively, allowing for the rapid and easy identification and analysis of mutations.
+ **Multi-Sample Compatibility**: Accommodates a variety of samples, enabling simultaneous processing of multiple samples. This facilitates efficient progression of large-scale experiments and comparative studies.

Expand Down Expand Up @@ -253,10 +254,9 @@ DAJIN_Results/tyr-substitution
│ ├── tyr_c230gt_01%.csv
│ ├── tyr_c230gt_10%.csv
│ └── tyr_c230gt_50%.csv
├── read_all.csv
├── read_plot.html
├── read_plot.pdf
└── read_summary.csv
└── read_summary.xlsx
```

### 1. BAM
Expand Down Expand Up @@ -285,23 +285,22 @@ An example of a Tyr point mutation is described by its position on the chromosom
### 4. read_plot.html and read_plot.pdf

Both read_plot.html and read_plot.pdf illustrate the proportions of each allele.
The chart's **Allele type** indicates the type of allele, and **% of reads** shows the proportion of reads for that allele.
The chart's **Allele type** indicates the type of allele, and **Percent of reads** shows the proportion of reads for that allele.

Additionally, the types of **Allele type** include:
- **intact**: Alleles that perfectly match the input FASTA allele.
- **indels**: Substitutions, deletions, insertions, or inversions within 50 bases.
- **sv**: Substitutions, deletions, insertions, or inversions beyond 50 bases.
- **Intact**: Alleles that perfectly match the input FASTA allele.
- **Indels**: Substitutions, deletions, insertions, or inversions within 50 bases.
- **SV**: Substitutions, deletions, insertions, or inversions beyond 50 bases.

<img src="https://user-images.githubusercontent.com/15861316/274521067-4d217251-4c62-4dc9-9c05-7f5377dd3025.png" width="75%">

> [!WARNING]
> In PCR amplicon sequencing, the % of reads might not match the actual allele proportions due to amplification bias.
> Especially when large deletions are present, the deletion alleles might be significantly amplified, potentially not reflecting the actual allele proportions.
### 5. read_all.csv and read_summary.csv
### 5. read_summary.xlsx

- read_all.csv: Records which allele each read is classified under.
- read_summary.csv: Describes the number of reads and presence proportion for each allele.
- read_summary.xlsx: Describes the number of reads and presence proportion for each allele.

## 📣Feedback and Support

Expand Down
15 changes: 7 additions & 8 deletions docs/README_JP.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ DAJIN2は、ナノポアシーアターゲットシーケンシングを用い
## 🌟 特徴

+ **網羅的な変異検出**: ゲノム編集イベントを広範囲にわたり検出する能力を備えており、小さな変異から大きな構造変化まで、幅広い変異を特定することが可能です
+ ゲノム編集に特徴的な「欠失が生じた領域に挿入が起こる」といった複合的な変異の検出も可能です
+ **直観的な可視化**: ゲノム編集の結果は直観的に可視化され、変異を迅速かつ容易に識別し、分析することができます
+ **多サンプル対応**: 多様なサンプルに対応しており、複数のサンプルを同時に処理することが可能です。これにより、大規模な実験や比較研究を効率的に進めることができます

Expand Down Expand Up @@ -256,10 +257,9 @@ DAJIN_Results/tyr-substitution
│ ├── tyr_c230gt_01%.csv
│ ├── tyr_c230gt_10%.csv
│ └── tyr_c230gt_50%.csv
├── read_all.csv
├── read_plot.html
├── read_plot.pdf
└── read_summary.csv
└── read_summary.xlsx
```

### 1. BAM
Expand Down Expand Up @@ -293,13 +293,13 @@ Tyr点変異の例を以下に示します:
### 4. read_plot.html / read_plot.pdf

read_plot.html および read_plot.pdf は、各アレルの割合を図示しています。
図中の**Allele type**はアレルの種類を、**% of reads**は該当するリードのアレル割合を示しています。
図中の**Allele type**はアレルの種類を、**Percent of reads**は該当するリードのアレル割合を示しています。

また、**Allele type**の種類は以下の通りです:

- **intact**:入力のFASTAアレルと完全に一致するアレル
- **indels**:50塩基以内の置換、欠失、挿入、逆位
- **sv**:50塩基以上の置換、欠失、挿入、逆位
- **Intact**:入力のFASTAアレルと完全に一致するアレル
- **Indels**:50塩基以内の置換、欠失、挿入、逆位を含むアレル
- **SV**:50塩基以上の置換、欠失、挿入、逆位を含むアレル


<img src="https://user-images.githubusercontent.com/15861316/274521067-4d217251-4c62-4dc9-9c05-7f5377dd3025.png" width="75%">
Expand All @@ -308,9 +308,8 @@ read_plot.html および read_plot.pdf は、各アレルの割合を図示し
> PCRアンプリコンを用いたターゲットシーケンシングでは、増幅バイアスのため **% of reads**が実際のアレルの割合と一致しないことがあります。
> とくに大型欠失が存在する場合、欠失アレルが顕著に増幅されることから、実際のアレル割合を反映しない可能性が高まります。
### 5. read_all.csv / read_summary.csv
### 5. read_summary.xlsx

- read_all.csv:各リードがどのアレルに分類されたかが記録されています。
- read_summary.csv:各アレルのリード数と存在割合が記述されています。


Expand Down
40 changes: 39 additions & 1 deletion docs/ROADMAP.md → docs/RELEASE.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
## 🐛 Bug Fixes
## 🔧 Maintenance
## ⛔️ Deprecated
+ [ ] XXX [Commit Detail](https://github.com/akikuno/DAJIN2/commit/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx)
- XXX [Commit Detail](https://github.com/akikuno/DAJIN2/commit/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx)
-->

<!-- 💡 ToDo
Expand All @@ -17,6 +17,44 @@
- nCATSがほしい…
-->

# v0.4.1 (2024-02-13)

## 📝 Documentation

- Added documentation for a new feature in `README.md`: DAJIN2 can now detect complex mutations characteristic of genome editing, such as insertions occurring in regions where deletions have occurred.

## 🚀 New Features

- Introduced `cssplits_handler.detect_insertion_within_deletion` to extract insertion sequences within deletions. This addresses cases where minimap2 may align bases that partially match the reference through local alignment, potentially failing to detect them as insertions. This enhancement ensures the proper detection of insertion sequences. [Commit Detail](https://github.com/akikuno/DAJIN2/commit/7651e20852b94ed4d5bb38539bb56229dcc8b763)

- Added `report.insertion_refractor.py` to include original insertion information in the consensus for mappings made by insertion. This addition enables the listing of both insertions and deletions within the insertion allele on a single HTML file. [Commit Detail](https://github.com/akikuno/DAJIN2/commit/e6c3b636bb2ba537d1341d1042341afd6583dd0b)

## 🔧 Maintenance

- Updated `insertions_to_fasta.py`. [Commit Detail](https://github.com/akikuno/DAJIN2/commit/7927feb0bb4f3091537aaebabd60a441456a3413)
- Modified the approach to reduce randomness by replacing set or frozenset with list or tuple, and using `random.sample()` for subsetting reads.
- Refactored `call_consensus_insertion_sequence`.
- Fixed a bug in `extract_score_and_sequence` to ensure correct appending of scores for the insertions_merged_subset.

- Changed the function name of `report` to be more explicit. [Commit Detail](https://github.com/akikuno/DAJIN2/commit/93132c5beba17278c7d67b76817bb13dfaae57a3)

- Updated `utils.report_report_generator` [Commit Detail](https://github.com/akikuno/DAJIN2/commit/821f06f05b5ed2f4ba2d7baad6159d774d2e5db0)
- Capitalized "Allele" (e.g., control) and "Allele type" (e.g., intact).
- Changed the output format of read_all and read_summary from CSV to XLSX.
- Corrected the order of the Legend to follow a logical sequence from control to sample, and then to specific insertions.

- Updated `utils.io.read_xlsx` to switch from using pandas to openpyxl due to the DeprecationWarning in Pandas being cumbersome. [Commit Detail](https://github.com/akikuno/DAJIN2/commit/5d942bace8417bb973441b360a0ec31d77d81e24)

## 🐛 Bug Fixes

- Added `=` to the prefix for valid cstag recognition when there is an `n` in inversion. [Commit Detail](https://github.com/akikuno/DAJIN2/commit/747ff3ece221a8c1e4f1ba1b696c4751618b4992)

- Modified the io.load_from_csv function to trim spaces before and after each field, addressing an error caused by spaces in batch.csv. [Commit Detail](https://github.com/akikuno/DAJIN2/commit/f5d49230f8ebd37061a27d6767d3c1954b8f8576)

## ⛔️ Deprecated

- Removed `reads_all.csv`. This CSV file, which showed the allele for each read, is no longer reported due to its limited usefulness and because the same information can be obtained from the BAM file. [Commit Detail](https://github.com/akikuno/DAJIN2/commit/76e3eaee320deb79cbf3cf97cc6aed69c5bbc3ef)

-------------

# Past Logs
Expand Down
4 changes: 3 additions & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
@@ -1,11 +1,13 @@
numpy >= 1.20.0
scipy >= 1.6.0
scipy >= 1.6.0
pandas >= 1.0.0
openpyxl >= 3.0.0
rapidfuzz >=3.0.0
statsmodels >= 0.13.5
scikit-learn >= 1.0.0

openpyxl >= 3.0.0

mappy >= 2.24
pysam >= 0.19.0

Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@

setuptools.setup(
name="DAJIN2",
version="0.4.0",
version="0.4.1",
author="Akihiro Kuno",
author_email="[email protected]",
description="One-step genotyping tools for targeted long-read sequencing",
Expand Down
12 changes: 7 additions & 5 deletions src/DAJIN2/core/clustering/label_extractor.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
from __future__ import annotations

import random
import uuid

from pathlib import Path
from itertools import groupby

Expand All @@ -18,12 +19,13 @@ def extract_labels(classif_sample, TEMPDIR, SAMPLE_NAME, CONTROL_NAME) -> list[d
classif_sample.sort(key=lambda x: x["ALLELE"])
for allele, group in groupby(classif_sample, key=lambda x: x["ALLELE"]):
"""Cache data to temporary files"""
RANDOM_INT = random.randint(0, 10**10)
if Path(TEMPDIR, CONTROL_NAME, "midsv", f"{allele}.json").exists():
path_control = Path(TEMPDIR, CONTROL_NAME, "midsv", f"{allele}.json")
else:
path_control = Path(TEMPDIR, CONTROL_NAME, "midsv", f"{allele}_{SAMPLE_NAME}.json")
path_sample = Path(TEMPDIR, SAMPLE_NAME, "clustering", f"{allele}_{RANDOM_INT}.json")

unique_id = str(uuid.uuid4())
path_sample = Path(TEMPDIR, SAMPLE_NAME, "clustering", f"tmp_{allele}_{unique_id}.jsonl")
io.write_jsonl(data=group, file_path=path_sample)

"""Load mutation_loci and knockin_loci."""
Expand All @@ -46,8 +48,8 @@ def extract_labels(classif_sample, TEMPDIR, SAMPLE_NAME, CONTROL_NAME) -> list[d
scores_sample = annotate_score(path_sample, mutation_score, mutation_loci)
scores_control = annotate_score(path_control, mutation_score, mutation_loci, is_control=True)

path_score_sample = Path(TEMPDIR, SAMPLE_NAME, "clustering", f"{allele}_score_{RANDOM_INT}.json")
path_score_control = Path(TEMPDIR, CONTROL_NAME, "clustering", f"{allele}_score_{RANDOM_INT}.json")
path_score_sample = Path(TEMPDIR, SAMPLE_NAME, "clustering", f"tmp_{allele}_score_sample_{unique_id}.jsonl")
path_score_control = Path(TEMPDIR, SAMPLE_NAME, "clustering", f"tmp_{allele}_score_control_{unique_id}.jsonl")
io.write_jsonl(data=scores_sample, file_path=path_score_sample)
io.write_jsonl(data=scores_control, file_path=path_score_control)

Expand Down
2 changes: 1 addition & 1 deletion src/DAJIN2/core/consensus/consensus.py
Original file line number Diff line number Diff line change
Expand Up @@ -115,4 +115,4 @@ def call_consensus(tempdir: Path, sample_name: str, clust_sample: list[dict]) ->
key = ConsensusKey(allele, label, clust[0]["PERCENT"])
cons_percentages[key] = cons_percentage
cons_sequences[key] = call_sequence(cons_percentage)
return dict(cons_percentages), dict(cons_sequences)
return cons_percentages, cons_sequences
14 changes: 8 additions & 6 deletions src/DAJIN2/core/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -173,7 +173,7 @@ def execute_control(arguments: dict):
# Output BAM files
###########################################################
logger.info(f"Output BAM files of {arguments['control']}...")
report.report_bam.output_bam(
report.report_bam.export_to_bam(
ARGS.tempdir, ARGS.control_name, ARGS.genome_coordinates, ARGS.threads, is_control=True
)
###########################################################
Expand Down Expand Up @@ -307,14 +307,16 @@ def execute_sample(arguments: dict):
# RESULT
io.write_jsonl(RESULT_SAMPLE, Path(ARGS.tempdir, "result", f"{ARGS.sample_name}.jsonl"))
# FASTA
report.report_files.to_fasta(ARGS.tempdir, ARGS.sample_name, cons_sequence)
report.report_files.to_fasta_reference(ARGS.tempdir, ARGS.sample_name)
report.report_files.export_to_fasta(ARGS.tempdir, ARGS.sample_name, cons_sequence)
report.report_files.export_reference_to_fasta(ARGS.tempdir, ARGS.sample_name)
# HTML
report.report_files.to_html(ARGS.tempdir, ARGS.sample_name, cons_percentage)
report.report_files.export_to_html(ARGS.tempdir, ARGS.sample_name, cons_percentage)
# CSV (Allele Info)
report.report_mutation.to_csv(ARGS.tempdir, ARGS.sample_name, ARGS.genome_coordinates, cons_percentage)
report.report_mutation.export_to_csv(ARGS.tempdir, ARGS.sample_name, ARGS.genome_coordinates, cons_percentage)
# BAM
report.report_bam.output_bam(ARGS.tempdir, ARGS.sample_name, ARGS.genome_coordinates, ARGS.threads, RESULT_SAMPLE)
report.report_bam.export_to_bam(
ARGS.tempdir, ARGS.sample_name, ARGS.genome_coordinates, ARGS.threads, RESULT_SAMPLE
)
for path_bam_igvjs in Path(ARGS.tempdir, "cache", ".igvjs").glob(f"{ARGS.control_name}_control.bam*"):
shutil.copy(path_bam_igvjs, Path(ARGS.tempdir, "report", ".igvjs", ARGS.sample_name))
# VCF
Expand Down
2 changes: 1 addition & 1 deletion src/DAJIN2/core/preprocess/directories.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ def create_temporal_directories(TEMPDIR: Path, NAME: str, is_control=False) -> N
Path(TEMPDIR, "result").mkdir(parents=True, exist_ok=True)
SUBDIRS = ["fasta", "fastq", "sam", "midsv", "mutation_loci", "clustering", "consensus"]
if is_control is False:
SUBDIRS.extend(["knockin_loci", "classification"])
SUBDIRS.extend(["cstag", "knockin_loci", "classification"])
for subdir in SUBDIRS:
Path(TEMPDIR, NAME, subdir).mkdir(parents=True, exist_ok=True)

Expand Down
Loading

0 comments on commit 777bb74

Please sign in to comment.