Skip to content

Commit

Permalink
Merge pull request #4 from anuradhawick/dev
Browse files Browse the repository at this point in the history
Dev
  • Loading branch information
anuradhawick authored Sep 7, 2024
2 parents b44079d + 0d45c60 commit 34c0387
Show file tree
Hide file tree
Showing 8 changed files with 162 additions and 22 deletions.
2 changes: 1 addition & 1 deletion Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

8 changes: 7 additions & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "rsbio-seq"
version = "0.1.2"
version = "0.1.3"
edition = "2021"
authors = [
"Anuradha Wickramarachchi <[email protected]>",
Expand All @@ -22,3 +22,9 @@ pyo3 = { version = "0.22.0", features = ["abi3-py39"] }

[features]
extension-module = ["pyo3/extension-module"]

[profile.release]
opt-level = 3
lto = true
codegen-units = 1
panic = "abort"
60 changes: 48 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,40 +4,57 @@
[![Downloads](https://static.pepy.tech/badge/rsbio-seq)](https://pepy.tech/project/rsbio-seq)
[![PyPI - Version](https://img.shields.io/pypi/v/rsbio-seq)](https://pypi.org/project/rsbio-seq/)
[![Upload to PyPI](https://github.com/anuradhawick/rsbio-seq/actions/workflows/pypi.yml/badge.svg)](https://github.com/anuradhawick/rsbio-seq/actions/workflows/pypi.yml)
[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)

RSBio-Seq intends to provide just reading facility on common sequence formats (FASTA/FASTQ) in both raw and compressed formats.
<div align="center">
<pre>
██████ ███████ ██████ ██ ██████ ███████ ███████ ██████
██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██
██████ ███████ ██████ ██ ██ ██ █████ ███████ █████ ██ ██
██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ▄▄ ██
██ ██ ███████ ██████ ██ ██████ ███████ ███████ ██████
▀▀
</pre>
</div>

## Installation
RSBio-Seq intends to provide reading/writing facility on common sequence formats (FASTA/FASTQ) in both raw (`fasta`, `fa`, `fna`, `fastq`, `fq`) and compressed formats (`.gz`).

## Installation

### 1. From PyPI (Recommended)

Simple use the following command
Use the following command to install from PyPI.

```bash
pip install rsbio-seq
```

### 2. Build and install from source

To build you need to have the following installed.
To build from source, make sure you have the following programs installed.

- Rust - https://www.rust-lang.org/tools/install
- Maturin - https://www.maturin.rs/installation
- Python environment with Python >=3.9
- Python environment with Python >=3.9 - https://www.python.org/downloads/

To build and install the development version of the wheel.

```bash
maturin develop # this installs the development version in the env
maturin develop --rust # this installs a release version in the env
```

To build a wheel for installation
To build a release mode wheel for installation, use this command.

```bash
maturin build --release
```

You will find the `whl` file inside the `target/wheels` directory. Your `whl` file will have a name depicting your python environment and CPU architecture.
You will find the `whl` file inside the `target/wheels` directory. Your `whl` file will have a name depicting your python environment and CPU architecture. The built wheel can be installed using this command.

```bash
pip install target/wheels/*.whl
```

## Usage

Expand All @@ -46,29 +63,34 @@ Once installed you can import the library and use as follows.
### Reading

```python
from rsbio_seq import SeqReader, SeqWriter, Sequence
from rsbio_seq import SeqReader, Sequence, ascii_to_phred

# each seq entry is of type Sequence
seq: Sequence

# reading
for seq in SeqReader("path/to/seq.fasta.gz"):
print(seq.id)
print(seq.seq)
# for fastq quality line
print(seq.qual)
print(seq.qual) # prints IIII
print(ascii_to_phred(seq.qual)) # prints [40, 40, 40, 40]
# optional description attribute
print(seq.desc)
```

### Writing

```python
from rsbio_seq import SeqWriter, Sequence, phred_to_ascii

# writing fasta
seq = Sequence("id", "desc", "ACGT") # id, description, sequence
writer = SeqWriter("out.fasta")
writer.write(seq)
writer.close()

# writing fastq
seq = Sequence("id", "desc", "ACGT") # id, description, sequence
seq = Sequence("id", "desc", "ACGT", "IIII") # id, description, sequence, quality
writer = SeqWriter("out.fastq")
writer.write(seq)
writer.close()
Expand All @@ -78,9 +100,23 @@ seq = Sequence("id", "desc", "ACGT", "IIII") # id, description, sequence, qualit
writer = SeqWriter("out.fq.gz")
writer.write(seq)
writer.close()

# writing gzipped with phred score translation
qual = phred_to_ascii([40, 40, 40, 40])
seq = Sequence("id", "desc", "ACGT", qual) # id, description, sequence, quality
writer = SeqWriter("out.fq.gz")
writer.write(seq)
writer.close()
```

Note: `close()` is needed if you want to read within the same program scope. Otherwise, rust will automatically do this for you.
Note: `close()` is only required if you want to read the file again in the same function/code scope. Closing opened files is a good practice either way.

We provide two utility functions for your convenience.

* `phred_to_ascii` - convert phred scores list of numbers to a string
* `ascii_to_phred` - convert the quality string to a list of numbers

RSBio-Seq reads and write quality string in ascii format only. Please use these helper functions to translate if you intend to read them.

## Authors

Expand Down
3 changes: 2 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,9 @@ classifiers = [
"Programming Language :: Python :: Implementation :: CPython",
"Programming Language :: Python :: Implementation :: PyPy",
]
dynamic = ["version", "readme", "description", "license", "authors"]
dynamic = ["version", "readme", "description", "authors"]
keywords = ["bioinformatics", "genomics"]
license = { file = "LICENSE" }

[project.urls]
Documentation = "https://github.com/anuradhawick/rsbio-seq/"
Expand Down
34 changes: 32 additions & 2 deletions rsbio_seq.pyi
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from typing import Iterator, Union
from typing import Iterator, Union, List

class Sequence:
"""
Expand Down Expand Up @@ -83,4 +83,34 @@ class SeqWriter:
"""
...

__all__ = ["Sequence", "SeqReader", "SeqWriter"]
def phred_to_ascii(scores: List[int]) -> str:
"""
Convert a list of Phred scores to a string of ASCII quality values.
Each Phred score is converted by adding 33 to it (Phred+33 encoding),
and then transforming it into the corresponding ASCII character.
Args:
scores (List[int]): A list of integers where each integer represents a Phred score.
Returns:
str: A string where each character represents an ASCII quality value corresponding to the input Phred scores.
"""
...

def ascii_to_phred(qual: str) -> List[int]:
"""
Convert a string of ASCII quality values back to a list of Phred scores.
Each ASCII character is converted back to a Phred score by subtracting 33
from its ASCII value. This function assumes standard Phred+33 encoding is used.
Args:
qual (str): A string of ASCII characters representing quality scores.
Returns:
List[int]: A list of integers where each integer is a Phred score derived from the input ASCII characters.
"""
...

__all__ = ["Sequence", "SeqReader", "SeqWriter", "phred_to_ascii", "ascii_to_phred"]
18 changes: 18 additions & 0 deletions src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -70,11 +70,29 @@ impl SeqWriter {
}
}

#[pyfunction]
pub fn phred_to_ascii(scores: Vec<u8>) -> PyResult<String> {
Ok(scores
.iter()
.map(|&score| (score + 33) as char) // Convert Phred score to ASCII
.collect())
}

#[pyfunction]
pub fn ascii_to_phred(qual: String) -> PyResult<Vec<u8>> {
Ok(qual
.chars()
.map(|c| (c as u8).saturating_sub(33)) // Convert ASCII to Phred score
.collect())
}

/// Sequence reader for rust
#[pymodule]
fn rsbio_seq(m: &Bound<'_, PyModule>) -> PyResult<()> {
m.add_class::<Sequence>()?;
m.add_class::<SeqReader>()?;
m.add_class::<SeqWriter>()?;
m.add_function(wrap_pyfunction!(phred_to_ascii, m)?)?;
m.add_function(wrap_pyfunction!(ascii_to_phred, m)?)?;
Ok(())
}
12 changes: 7 additions & 5 deletions tests/perf.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
from tqdm import tqdm
from Bio import SeqIO as BioSeqIO, SeqRecord, Seq
import sys
import gzip
from rsbio_seq import SeqReader, SeqWriter, Sequence

from Bio import Seq
from Bio import SeqIO as BioSeqIO
from Bio import SeqRecord
from rsbio_seq import SeqReader, Sequence, SeqWriter, phred_to_ascii
from tqdm import tqdm


# Generates a specified number of Sequence objects with a repeated "ACGT" sequence.
Expand All @@ -26,7 +28,7 @@ def test_fq_seqs_rs(count: int):
seq = "ACGT" * 1_000_000
qual_phred = [40, 40, 40, 40] * 1_000_000
for c in range(count):
qual = "".join([chr(x + 33) for x in qual_phred])
qual = phred_to_ascii(qual_phred)
yield Sequence(f"rec_id_{c+1}", f"description_{c+1}", seq, qual)


Expand Down
47 changes: 47 additions & 0 deletions tests/phred_test.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
from rsbio_seq import ascii_to_phred, phred_to_ascii


def test_to_phred():
# Test with typical input
ascii_values = '!"#$%'
expected_phred = [0, 1, 2, 3, 4]
assert (
ascii_to_phred(ascii_values) == expected_phred
), "Failed: ASCII to Phred conversion with typical input"

# Test with empty string
ascii_values_empty = ""
expected_phred_empty = []
assert (
ascii_to_phred(ascii_values_empty) == expected_phred_empty
), "Failed: ASCII to Phred conversion with empty input"

# Test with lower boundary edge case
ascii_values_low = chr(33)
expected_phred_low = [0]
assert (
ascii_to_phred(ascii_values_low) == expected_phred_low
), "Failed: ASCII to Phred conversion with lower boundary"


def test_to_ascii():
# Test with typical input
phred_scores = [0, 1, 2, 3, 4]
expected_ascii = '!"#$%'
assert (
phred_to_ascii(phred_scores) == expected_ascii
), "Failed: Phred to ASCII conversion with typical input"

# Test with empty list
phred_scores_empty = []
expected_ascii_empty = ""
assert (
phred_to_ascii(phred_scores_empty) == expected_ascii_empty
), "Failed: Phred to ASCII conversion with empty input"

# Test with zero only
phred_scores_zero = [0]
expected_ascii_zero = "!"
assert (
phred_to_ascii(phred_scores_zero) == expected_ascii_zero
), "Failed: Phred to ASCII conversion with zero"

0 comments on commit 34c0387

Please sign in to comment.