diff --git a/README.md b/README.md index fd623bb..ab83f60 100644 --- a/README.md +++ b/README.md @@ -10,47 +10,48 @@ - [ **DeepChopper** ](#-deepchopper-) - - [Quick Start: Try DeepChopper Online](#quick-start-try-deepchopper-online) - - [Install](#install) - - [Usage](#usage) + - [🚀 Quick Start: Try DeepChopper Online](#-quick-start-try-deepchopper-online) + - [📦 Installation](#-installation) + - [🛠️ Usage](#️-usage) - [Command-Line Interface](#command-line-interface) - - [Library](#library) - - [Cite](#cite) - - [🤜 Contribution](#-contribution) + - [Python Library](#python-library) + - [📚 Cite](#-cite) + - [🤝 Contribution](#-contribution) - [Build Environment](#build-environment) - [Install Dependencies](#install-dependencies) + - [📬 Support](#-support) -DeepChopper leverages language model to accurately detect and chop artificial sequences which may cause chimeric reads, ensuring higher quality and more reliable sequencing results. +🧬 DeepChopper leverages language model to accurately detect and chop artificial sequences which may cause chimeric reads, ensuring higher quality and more reliable sequencing results. By integrating seamlessly with existing workflows, DeepChopper provides a robust solution for researchers and bioinformatics working with NanoPore direct-RNA sequencing data. -## Quick Start: Try DeepChopper Online +## 🚀 Quick Start: Try DeepChopper Online Experience DeepChopper instantly through our user-friendly web interface. No installation required! - Simply click the button below to launch the web application and start exploring DeepChopper's capabilities: [![Open in Hugging Face Spaces](https://huggingface.co/datasets/huggingface/badges/resolve/main/open-in-hf-spaces-md.svg)](https://huggingface.co/spaces/yangliz5/deepchopper) -This online version provides a convenient way to: +**What you can do online:** + +- 📤 Upload your sequencing data +- 🔬 Run DeepChopper's analysis +- 📊 Visualize results +- 🎛️ Experiment with different parameters -- Upload your sequencing data -- Run DeepChopper's analysis -- Visualize results -- Experiment with different parameters +Perfect for quick tests or demonstrations! However, for extensive analyses or custom workflows, we recommend installing DeepChopper locally. -It's perfect for quick tests or when you want to showcase DeepChopper's functionality without local setup. -However, for more extensive analyses or custom workflows, we recommend installing DeepChopper on your machine. -Because the online version is limited to one FASTQ record at a time, it may not be suitable for large-scale projects. +> ⚠️ Note: The online version is limited to one FASTQ record at a time and may not be suitable for large-scale projects. -## Install +## 📦 Installation -DeepChopper can be installed using pip, the Python package installer. Follow these steps to install: +DeepChopper can be installed using pip, the Python package installer. +Follow these steps to install: 1. Ensure you have Python 3.10 or later installed on your system. -2. It's recommended to create a virtual environment: +2. Create a virtual environment (recommended): ```bash python -m venv deepchopper_env @@ -69,66 +70,69 @@ DeepChopper can be installed using pip, the Python package installer. Follow the deepchopper --help ``` -Note: If you encounter any issues, please check our GitHub repository for troubleshooting guides or to report a problem. +🆘 Trouble installing? Check our [Troubleshooting Guide](./docs/troubleshooting.md) or [open an issue](https://github.com/ylab-hi/DeepChopper/issues). -## Usage +## 🛠️ Usage -We provide a [complete guide](./documentation/tutorial.md) on how to use DeepChopper for NanoPore direct-RNA sequencing data. -Below is a brief overview of the command-line interface and library usage. +For a comprehensive guide, check out our [full tutorial](./documentation/tutorial.md). +Here's a quick overview: ### Command-Line Interface -DeepChopper provides a command-line interface (CLI) for easy access to its features. In total, there are three commands: `encode`, `predict`, and `chop`. -DeepChopper can be used to encode, predict, and chop chimeric reads in direct-RNA sequencing data. +DeepChopper offers three main commands: `encode`, `predict`, and `chop`. -Firstly, we need to encode the input data using the `encode` command, which will generate a `.parquet` file. +1. **Encode** your input data: -```bash -deepchopper endcode -``` + ```bash + deepchopper encode + ``` -Next, we can use the `predict` command to predict chimeric reads in the encoded data. +2. **Predict** chimeric reads: -```bash -deepchopper predict --ouput-path predictions -``` + ```bash + deepchopper predict --output-path predictions + ``` -If you have GPUs, you can use the `--gpus` flag to specify the GPU device. + Using GPUs? Add the `--gpus` flag: -```bash -deepchopper predict --ouput-path predictions --gpus 2 -``` + ```bash + deepchopper predict --output-path predictions --gpus 2 + ``` -Finally, we can use the `chop` command to chop the chimeric reads in the input data. +3. **Chop** the chimeric reads: -```bash -deepchopper chop raw.fq -``` + ```bash + deepchopper chop raw.fq + ``` -Besides, DeepChopper provides a web-based user interface for users to interact with the tool. -However, the web-based application can only take one FASTQ record at a time. +Want a GUI? Launch the web interface (note: limited to one FASTQ record at a time): ```bash deepchopper web ``` -### Library +### Python Library + +Integrate DeepChopper into your Python scripts: ```python import deepchopper model = deepchopper.DeepChopper.from_pretrained("yangliz5/deepchopper") +# Your analysis code here ``` -## Cite +## 📚 Cite -If you use DeepChopper in your research, please cite the following paper: +If DeepChopper aids your research, please cite our paper: ```bibtex ``` -## 🤜 Contribution +## 🤝 Contribution + +We welcome contributions! Here's how to set up your development environment: ### Build Environment @@ -146,3 +150,18 @@ pip install pipx pipx install --suffix @master git+https://github.com/python-poetry/poetry.git@master poetry@master install ``` + +🎉 Ready to contribute? Check out our [Contribution Guidelines](./CONTRIBUTING.md) to get started! + +## 📬 Support + +Need help? Have questions? + +- 📖 Check our [Documentation](./docs) +- 💬 Join our [Community Forum](https://github.com/ylab-hi/DeepChopper/discussions) +- 🐛 [Report issues](https://github.com/ylab-hi/DeepChopper/issues) + +--- + +DeepChopper is developed with ❤️ by the YLab team. +Happy sequencing! 🧬🔬 diff --git a/documentation/tutorial.md b/documentation/tutorial.md index a91ee16..512dcf5 100644 --- a/documentation/tutorial.md +++ b/documentation/tutorial.md @@ -1,34 +1,43 @@ # Tutorial: Using DeepChopper for Nanopore Direct-RNA Sequencing Data Analysis -This tutorial will guide you through the process of using DeepChopper to identify and remove chimeric artificial reads in Nanopore direct-RNA sequencing data. We'll cover each step from data acquisition to the final chopping of chimeric reads. +Welcome to the DeepChopper tutorial! This guide will walk you through the process of identifying and removing chimeric artificial reads in Nanopore direct-RNA sequencing data. +Whether you're new to bioinformatics or an experienced researcher, this tutorial will help you get the most out of DeepChopper. ## Table of Contents - [Tutorial: Using DeepChopper for Nanopore Direct-RNA Sequencing Data Analysis](#tutorial-using-deepchopper-for-nanopore-direct-rna-sequencing-data-analysis) - [Table of Contents](#table-of-contents) - - [1. Get the Data](#1-get-the-data) + - [Prerequisites](#prerequisites) + - [1. Data Acquisition](#1-data-acquisition) - [2. Basecall Using Dorado](#2-basecall-using-dorado) - - [3. DeepChopper Encode](#3-deepchopper-encode) - - [4. DeepChopper Predict](#4-deepchopper-predict) - - [5. DeepChopper Chop](#5-deepchopper-chop) - - [Conclusion](#conclusion) + - [3. Encoding Data with DeepChopper](#3-encoding-data-with-deepchopper) + - [4. Predicting Chimeric Reads](#4-predicting-chimeric-reads) + - [5. Chopping Artificial Sequences](#5-chopping-artificial-sequences) + - [Next Steps](#next-steps) + - [Troubleshooting](#troubleshooting) -## 1. Get the Data +## Prerequisites -First, you need to obtain your Nanopore direct-RNA sequencing data. -This data is typically in the form of POD5 files. +Before we begin, ensure you have the following installed: + +- DeepChopper (latest version) +- Dorado (Oxford Nanopore's basecaller) +- Sufficient storage space for Nanopore data + +## 1. Data Acquisition + +Start by obtaining your Nanopore direct-RNA sequencing data (POD5 files). ```bash # Example: Download sample data (replace with your actual data source) wget https://raw.githubusercontent.com/ylab-hi/DeepChopper/refs/heads/main/tests/data/200cases.pod5 ``` -Ensure you have sufficient storage space, as Nanopore data can be quite large. +💡 **Tip**: Organize your data in a dedicated project folder for easy management. ## 2. Basecall Using Dorado -Next, we'll use Dorado, Oxford Nanopore's high-performance basecaller, to convert the raw signal data into nucleotide sequences. -It's important to run Dorado without the trimming option to preserve potential chimeric sequences for DeepChopper to analyze. +Convert raw signal data to nucleotide sequences using Dorado. ```bash # Install Dorado (if not already installed) @@ -41,30 +50,31 @@ dorado basecaller \ > raw.fastq ``` +⚠️ **Important**: Use the `--not_trim` option to preserve potential chimeric sequences. + Replace `path/to/your/pod5/files/` with the directory containing your POD5 files. The output will be a FASTQ file containing the basecalled sequences. -## 3. DeepChopper Encode +## 3. Encoding Data with DeepChopper -Now that we have our basecalled sequences, we'll use DeepChopper to encode the data. -This step prepares the data for the prediction model. +Prepare your data for the prediction model: ```bash # Encode the FASTQ file deepchopper encode raw.fastq ``` -If you have a large dataset, you can use `--chunk` to encode dataset by chunk, which avoid memory issues: +For large datasets, use chunking to avoid memory issues: ```bash deepchopper encode raw.fastq --chunk --chunk-size 100000 ``` -This command will generate a Parquet file (`encoded_data.parquet`) or multiple Parquets files (if encoding by chunk) containing the encoded sequences. +🔍 **Output**: Look for `encoded_data.parquet` or multiple `.parquet` files if chunking. -## 4. DeepChopper Predict +## 4. Predicting Chimeric Reads -With our encoded data, we can now use DeepChopper to predict chimeric reads. +Analyze the encoded data to identify potential chimeric reads: ```bash # Predict artifical sequences for reads @@ -72,34 +82,55 @@ deepchopper predict raw.parquet --ouput-path predictions # Predict artifical sequences for reads using GPU deepchopper predict raw.parquet --ouput-path predictions --gpus 2 +``` + +For chunked data: -# if encoded by chunk -# deepchopper predict raw_chunk1.parquet --ouput-path predictions_chunk1 -# deepchopper predict raw_chunk2.parquet --ouput-path predictions_chunk2 +```bash +deepchopper predict raw_chunk1.parquet --output-path predictions_chunk1 +deepchopper predict raw_chunk2.parquet --output-path predictions_chunk2 ``` +📊 **Results**: Check the `predictions` folder for output files. + This step will analyze the encoded data and produce results containing predictions, indicating whether it's likely to be chimeric or not. -## 5. DeepChopper Chop +## 5. Chopping Artificial Sequences -Finally, we'll use DeepChopper to chop the identified chimeric reads, removing artificial sequences and preserving the genuine RNA sequences. +Remove identified artificial sequences: ```bash -# Chop chimeric reads +# Chop artificial sequences deepchopper chop predictions/0 raw.fastq +``` + +For chunked predictions: -# if encoded by chunk -# deepchopper chop predictions_chunk1/0 prediction_chunk2/0 raw.fastq +```bash +deepchopper chop predictions_chunk1/0 predictions_chunk2/0 raw.fastq ``` +🎉 **Success**: Look for the output file with the `.chop.fq.bgz` suffix. + This command takes the original FASTQ file (`raw.fastq`) and the predictions (`predictions`), and produces a new FASTQ file (with suffix `.chop.fq.bgz`) with the chimeric reads chopped. -## Conclusion +## Next Steps + +- Explore advanced DeepChopper options with `deepchopper --help` +- Use your cleaned data for downstream analyses +- Check our documentation for integration with other bioinformatics tools + +## Troubleshooting + +- **Issue**: Out of memory errors + **Solution**: Try using the `--chunk` option in the encode step + +- **Issue**: Slow processing + **Solution**: Ensure you're using GPU acceleration if available -You've now successfully processed your Nanopore direct-RNA sequencing data using DeepChopper! -The file with suffix `.chop.fq.bgz` contains your sequencing data with chimeric artificial reads identified and removed, providing you with higher quality data for your downstream analyses. +- **Issue**: Unexpected results + **Solution**: Verify input data quality and check DeepChopper version -Remember to adjust file paths and names according to your specific setup and data. -For more advanced usage and options, refer to the DeepChopper documentation or use the `--help` flag with each command. +For more help, visit our [GitHub Issues](https://github.com/ylab-hi/DeepChopper/issues) page. -Happy sequencing! +Happy sequencing, and may your data be artifical-chimera-free! 🧬🔍 diff --git a/environment.yaml b/environment.yaml index 6189253..24b6063 100644 --- a/environment.yaml +++ b/environment.yaml @@ -21,25 +21,32 @@ channels: # compatibility is usually guaranteed dependencies: - - python=3.10 - - pytorch=2.* - - torchvision=0.* - - lightning=2.* - - torchmetrics=0.* - - hydra-core=1.* - - rich=13.* - - pre-commit=3.* - - pytest=7.* + - python>=3.10 + - pip + - pytorch>=2.1.0 + - torchvision + - torchaudio + - pytorch-lightning>=2.1.2 + - torchmetrics>=1.2.0 + - rich>=13.7.0 + - transformers>=4.37.2 + - safetensors>=0.4.2 + - datasets>=2.17.1 + - evaluate>=0.4.1 + - typer>=0.12.0 + - scikit-learn>=1.5.2 + - hydra-core>=1.3.2 + - omegaconf>=2.3.0 + - rust + - pip: + - gradio==5.0.1 + - fastapi==0.112.2 + - deepchopper-cli>=1.0.1 + - maturin>=1.2.1,<2 # --------- loggers --------- # - - wandb + # - wandb # - neptune-client # - mlflow # - comet-ml - # - aim>=3.16.2 # no lower than 3.16.2, see https://github.com/aimhubio/aim/issues/2550 - - - pip>=23 - - pip: - - hydra-optuna-sweeper - - hydra-colorlog - - rootutils + # - aim>=3.16.2 # no lower than 3.16.2, see https://github.com/aimhubio/aim/issues/2550 \ No newline at end of file diff --git a/pyproject.toml b/pyproject.toml index 5a40837..b0ffbd1 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -48,7 +48,7 @@ dependencies = [ "scikit-learn>=1.5.2", "hydra-core>=1.3.2", "omegaconf>=2.3.0", - # "deepchopper-cli>=1.0.1", + "deepchopper-cli>=1.0.1", ] [project.urls]