Skip to content

Commit

Permalink
Merge pull request #16 from ylab-hi/dev
Browse files Browse the repository at this point in the history
docs: Polish the documentation
  • Loading branch information
cauliyang authored Oct 12, 2024
2 parents 32f4d3a + 07996cd commit d3a509a
Show file tree
Hide file tree
Showing 4 changed files with 156 additions and 99 deletions.
115 changes: 67 additions & 48 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,47 +10,48 @@
<!--toc:start-->

- [ **DeepChopper** ](#-deepchopper-)
- [Quick Start: Try DeepChopper Online](#quick-start-try-deepchopper-online)
- [Install](#install)
- [Usage](#usage)
- [🚀 Quick Start: Try DeepChopper Online](#-quick-start-try-deepchopper-online)
- [📦 Installation](#-installation)
- [🛠️ Usage](#️-usage)
- [Command-Line Interface](#command-line-interface)
- [Library](#library)
- [Cite](#cite)
- [🤜 Contribution](#-contribution)
- [Python Library](#python-library)
- [📚 Cite](#-cite)
- [🤝 Contribution](#-contribution)
- [Build Environment](#build-environment)
- [Install Dependencies](#install-dependencies)
- [📬 Support](#-support)

<!--toc:end-->

DeepChopper leverages language model to accurately detect and chop artificial sequences which may cause chimeric reads, ensuring higher quality and more reliable sequencing results.
🧬 DeepChopper leverages language model to accurately detect and chop artificial sequences which may cause chimeric reads, ensuring higher quality and more reliable sequencing results.
By integrating seamlessly with existing workflows, DeepChopper provides a robust solution for researchers and bioinformatics working with NanoPore direct-RNA sequencing data.

## Quick Start: Try DeepChopper Online
## 🚀 Quick Start: Try DeepChopper Online

Experience DeepChopper instantly through our user-friendly web interface. No installation required!

Simply click the button below to launch the web application and start exploring DeepChopper's capabilities:

[![Open in Hugging Face Spaces](https://huggingface.co/datasets/huggingface/badges/resolve/main/open-in-hf-spaces-md.svg)](https://huggingface.co/spaces/yangliz5/deepchopper)

This online version provides a convenient way to:
**What you can do online:**

- 📤 Upload your sequencing data
- 🔬 Run DeepChopper's analysis
- 📊 Visualize results
- 🎛️ Experiment with different parameters

- Upload your sequencing data
- Run DeepChopper's analysis
- Visualize results
- Experiment with different parameters
Perfect for quick tests or demonstrations! However, for extensive analyses or custom workflows, we recommend installing DeepChopper locally.

It's perfect for quick tests or when you want to showcase DeepChopper's functionality without local setup.
However, for more extensive analyses or custom workflows, we recommend installing DeepChopper on your machine.
Because the online version is limited to one FASTQ record at a time, it may not be suitable for large-scale projects.
> ⚠️ Note: The online version is limited to one FASTQ record at a time and may not be suitable for large-scale projects.
## Install
## 📦 Installation

DeepChopper can be installed using pip, the Python package installer. Follow these steps to install:
DeepChopper can be installed using pip, the Python package installer.
Follow these steps to install:

1. Ensure you have Python 3.10 or later installed on your system.

2. It's recommended to create a virtual environment:
2. Create a virtual environment (recommended):

```bash
python -m venv deepchopper_env
Expand All @@ -69,66 +70,69 @@ DeepChopper can be installed using pip, the Python package installer. Follow the
deepchopper --help
```

Note: If you encounter any issues, please check our GitHub repository for troubleshooting guides or to report a problem.
🆘 Trouble installing? Check our [Troubleshooting Guide](./docs/troubleshooting.md) or [open an issue](https://github.com/ylab-hi/DeepChopper/issues).

## Usage
## 🛠️ Usage

We provide a [complete guide](./documentation/tutorial.md) on how to use DeepChopper for NanoPore direct-RNA sequencing data.
Below is a brief overview of the command-line interface and library usage.
For a comprehensive guide, check out our [full tutorial](./documentation/tutorial.md).
Here's a quick overview:

### Command-Line Interface

DeepChopper provides a command-line interface (CLI) for easy access to its features. In total, there are three commands: `encode`, `predict`, and `chop`.
DeepChopper can be used to encode, predict, and chop chimeric reads in direct-RNA sequencing data.
DeepChopper offers three main commands: `encode`, `predict`, and `chop`.

Firstly, we need to encode the input data using the `encode` command, which will generate a `.parquet` file.
1. **Encode** your input data:

```bash
deepchopper endcode <input.fq>
```
```bash
deepchopper encode <input.fq>
```

Next, we can use the `predict` command to predict chimeric reads in the encoded data.
2. **Predict** chimeric reads:

```bash
deepchopper predict <input.parquet> --ouput-path predictions
```
```bash
deepchopper predict <input.parquet> --output-path predictions
```

If you have GPUs, you can use the `--gpus` flag to specify the GPU device.
Using GPUs? Add the `--gpus` flag:

```bash
deepchopper predict <input.parquet> --ouput-path predictions --gpus 2
```
```bash
deepchopper predict <input.parquet> --output-path predictions --gpus 2
```

Finally, we can use the `chop` command to chop the chimeric reads in the input data.
3. **Chop** the chimeric reads:

```bash
deepchopper chop <predictions> raw.fq
```
```bash
deepchopper chop <predictions> raw.fq
```

Besides, DeepChopper provides a web-based user interface for users to interact with the tool.
However, the web-based application can only take one FASTQ record at a time.
Want a GUI? Launch the web interface (note: limited to one FASTQ record at a time):

```bash
deepchopper web
```

### Library
### Python Library

Integrate DeepChopper into your Python scripts:

```python
import deepchopper

model = deepchopper.DeepChopper.from_pretrained("yangliz5/deepchopper")
# Your analysis code here
```

## Cite
## 📚 Cite

If you use DeepChopper in your research, please cite the following paper:
If DeepChopper aids your research, please cite our paper:

```bibtex
```

## 🤜 Contribution
## 🤝 Contribution

We welcome contributions! Here's how to set up your development environment:

### Build Environment

Expand All @@ -146,3 +150,18 @@ pip install pipx
pipx install --suffix @master git+https://github.com/python-poetry/poetry.git@master
poetry@master install
```

🎉 Ready to contribute? Check out our [Contribution Guidelines](./CONTRIBUTING.md) to get started!

## 📬 Support

Need help? Have questions?

- 📖 Check our [Documentation](./docs)
- 💬 Join our [Community Forum](https://github.com/ylab-hi/DeepChopper/discussions)
- 🐛 [Report issues](https://github.com/ylab-hi/DeepChopper/issues)

---

DeepChopper is developed with ❤️ by the YLab team.
Happy sequencing! 🧬🔬
97 changes: 64 additions & 33 deletions documentation/tutorial.md
Original file line number Diff line number Diff line change
@@ -1,34 +1,43 @@
# Tutorial: Using DeepChopper for Nanopore Direct-RNA Sequencing Data Analysis

This tutorial will guide you through the process of using DeepChopper to identify and remove chimeric artificial reads in Nanopore direct-RNA sequencing data. We'll cover each step from data acquisition to the final chopping of chimeric reads.
Welcome to the DeepChopper tutorial! This guide will walk you through the process of identifying and removing chimeric artificial reads in Nanopore direct-RNA sequencing data.
Whether you're new to bioinformatics or an experienced researcher, this tutorial will help you get the most out of DeepChopper.

## Table of Contents

- [Tutorial: Using DeepChopper for Nanopore Direct-RNA Sequencing Data Analysis](#tutorial-using-deepchopper-for-nanopore-direct-rna-sequencing-data-analysis)
- [Table of Contents](#table-of-contents)
- [1. Get the Data](#1-get-the-data)
- [Prerequisites](#prerequisites)
- [1. Data Acquisition](#1-data-acquisition)
- [2. Basecall Using Dorado](#2-basecall-using-dorado)
- [3. DeepChopper Encode](#3-deepchopper-encode)
- [4. DeepChopper Predict](#4-deepchopper-predict)
- [5. DeepChopper Chop](#5-deepchopper-chop)
- [Conclusion](#conclusion)
- [3. Encoding Data with DeepChopper](#3-encoding-data-with-deepchopper)
- [4. Predicting Chimeric Reads](#4-predicting-chimeric-reads)
- [5. Chopping Artificial Sequences](#5-chopping-artificial-sequences)
- [Next Steps](#next-steps)
- [Troubleshooting](#troubleshooting)

## 1. Get the Data
## Prerequisites

First, you need to obtain your Nanopore direct-RNA sequencing data.
This data is typically in the form of POD5 files.
Before we begin, ensure you have the following installed:

- DeepChopper (latest version)
- Dorado (Oxford Nanopore's basecaller)
- Sufficient storage space for Nanopore data

## 1. Data Acquisition

Start by obtaining your Nanopore direct-RNA sequencing data (POD5 files).

```bash
# Example: Download sample data (replace with your actual data source)
wget https://raw.githubusercontent.com/ylab-hi/DeepChopper/refs/heads/main/tests/data/200cases.pod5
```

Ensure you have sufficient storage space, as Nanopore data can be quite large.
💡 **Tip**: Organize your data in a dedicated project folder for easy management.

## 2. Basecall Using Dorado

Next, we'll use Dorado, Oxford Nanopore's high-performance basecaller, to convert the raw signal data into nucleotide sequences.
It's important to run Dorado without the trimming option to preserve potential chimeric sequences for DeepChopper to analyze.
Convert raw signal data to nucleotide sequences using Dorado.

```bash
# Install Dorado (if not already installed)
Expand All @@ -41,65 +50,87 @@ dorado basecaller \
> raw.fastq
```

⚠️ **Important**: Use the `--not_trim` option to preserve potential chimeric sequences.

Replace `path/to/your/pod5/files/` with the directory containing your POD5 files.
The output will be a FASTQ file containing the basecalled sequences.

## 3. DeepChopper Encode
## 3. Encoding Data with DeepChopper

Now that we have our basecalled sequences, we'll use DeepChopper to encode the data.
This step prepares the data for the prediction model.
Prepare your data for the prediction model:

```bash
# Encode the FASTQ file
deepchopper encode raw.fastq
```

If you have a large dataset, you can use `--chunk` to encode dataset by chunk, which avoid memory issues:
For large datasets, use chunking to avoid memory issues:

```bash
deepchopper encode raw.fastq --chunk --chunk-size 100000
```

This command will generate a Parquet file (`encoded_data.parquet`) or multiple Parquets files (if encoding by chunk) containing the encoded sequences.
🔍 **Output**: Look for `encoded_data.parquet` or multiple `.parquet` files if chunking.

## 4. DeepChopper Predict
## 4. Predicting Chimeric Reads

With our encoded data, we can now use DeepChopper to predict chimeric reads.
Analyze the encoded data to identify potential chimeric reads:

```bash
# Predict artifical sequences for reads
deepchopper predict raw.parquet --ouput-path predictions

# Predict artifical sequences for reads using GPU
deepchopper predict raw.parquet --ouput-path predictions --gpus 2
```

For chunked data:

# if encoded by chunk
# deepchopper predict raw_chunk1.parquet --ouput-path predictions_chunk1
# deepchopper predict raw_chunk2.parquet --ouput-path predictions_chunk2
```bash
deepchopper predict raw_chunk1.parquet --output-path predictions_chunk1
deepchopper predict raw_chunk2.parquet --output-path predictions_chunk2
```

📊 **Results**: Check the `predictions` folder for output files.

This step will analyze the encoded data and produce results containing predictions, indicating whether it's likely to be chimeric or not.

## 5. DeepChopper Chop
## 5. Chopping Artificial Sequences

Finally, we'll use DeepChopper to chop the identified chimeric reads, removing artificial sequences and preserving the genuine RNA sequences.
Remove identified artificial sequences:

```bash
# Chop chimeric reads
# Chop artificial sequences
deepchopper chop predictions/0 raw.fastq
```

For chunked predictions:

# if encoded by chunk
# deepchopper chop predictions_chunk1/0 prediction_chunk2/0 raw.fastq
```bash
deepchopper chop predictions_chunk1/0 predictions_chunk2/0 raw.fastq
```

🎉 **Success**: Look for the output file with the `.chop.fq.bgz` suffix.

This command takes the original FASTQ file (`raw.fastq`) and the predictions (`predictions`), and produces a new FASTQ file (with suffix `.chop.fq.bgz`) with the chimeric reads chopped.

## Conclusion
## Next Steps

- Explore advanced DeepChopper options with `deepchopper --help`
- Use your cleaned data for downstream analyses
- Check our documentation for integration with other bioinformatics tools

## Troubleshooting

- **Issue**: Out of memory errors
**Solution**: Try using the `--chunk` option in the encode step

- **Issue**: Slow processing
**Solution**: Ensure you're using GPU acceleration if available

You've now successfully processed your Nanopore direct-RNA sequencing data using DeepChopper!
The file with suffix `.chop.fq.bgz` contains your sequencing data with chimeric artificial reads identified and removed, providing you with higher quality data for your downstream analyses.
- **Issue**: Unexpected results
**Solution**: Verify input data quality and check DeepChopper version

Remember to adjust file paths and names according to your specific setup and data.
For more advanced usage and options, refer to the DeepChopper documentation or use the `--help` flag with each command.
For more help, visit our [GitHub Issues](https://github.com/ylab-hi/DeepChopper/issues) page.

Happy sequencing!
Happy sequencing, and may your data be artifical-chimera-free! 🧬🔍
Loading

0 comments on commit d3a509a

Please sign in to comment.