This project develops a novel approach to sequence similarity searching in bioinformatics by implementing a BLAST-like algorithm that operates on sequence embeddings rather than raw sequence data. By leveraging modern machine learning techniques and embedding methods, we aim to improve both the speed and sensitivity of sequence similarity searches compared to traditional BLAST (Basic Local Alignment Search Tool) algorithms.
- Background
- Methods
- Project Goals
- Project Structure
- Prerequisites
- Installation
- Usage
- Building and Running with Docker
- Contributing
- License
- Contact
BLAST is a fundamental tool in bioinformatics for comparing biological sequence information, such as DNA sequences of genes or amino acid sequences of proteins. Traditional BLAST algorithms work directly on the sequence data, using heuristics to find regions of local similarity between sequences. While effective, these methods can be computationally intensive for large databases.
Recent advancements in machine learning, particularly in natural language processing, have shown that embedding techniques can capture complex relationships in sequential data. This project applies similar principles to biological sequences, hypothesizing that conducting similarity searches in embedding space could lead to faster and potentially more sensitive results.
Our approach consists of several key components:
-
Sequence Embedding: We use advanced embedding techniques, primarily the ProtBERT model, to convert amino acid sequences into high-dimensional vector representations. These embeddings aim to capture the functional and structural properties of the proteins.
-
Embedding-based Seeding: Instead of using k-mers as in traditional BLAST, we identify seed regions by finding similar subsequences in the embedding space using efficient nearest neighbor search algorithms (KD-Tree).
-
Alignment Extension: We extend the seeds to form larger alignments, adapting traditional dynamic programming approaches to work with embedded representations.
-
Scoring: We develop a scoring system that combines similarity in embedding space with biologically relevant scoring matrices.
-
Database Indexing: We create an efficient index of the embedded database sequences to enable rapid searching.
The repository is organized as follows:
blast-embedding/
├── src/
│ ├── embedding/
│ │ └── sequence_embeddings.py
│ ├── algorithm/
│ │ └── refined_embedding_blast.py
│ └── benchmarking/
│ └── blast_comparison.py
├── data/
│ └── sample_database.fasta
├── tests/
├── docs/
├── notebooks/
├── results/
├── Dockerfile
├── requirements.txt
├── run.py
├── embedded_blast.ipynb
└── README.md
- Python 3.9+
- Docker (for containerized usage)
- 8GB+ RAM recommended for running embedding models
To set up the development environment:
-
Clone the repository:
git clone https://github.com/yourusername/blast-embedding.git cd blast-embedding
-
Create a virtual environment:
python -m venv venv source venv/bin/activate # On Windows, use `venv\Scripts\activate`
-
Install dependencies:
pip install -r requirements.txt
To run the benchmarking script locally:
python run.py
This will run the embedding-based BLAST and compare it with NCBI BLAST using a default query sequence.
Docker provides an isolated environment to run the project, ensuring consistency across different systems. Follow these steps to build and run the project using Docker:
-
Build the Docker image:
docker build -t blast-embedding .
This command builds a Docker image named 'blast-embedding' based on the instructions in the Dockerfile.
-
Run the Docker container:
docker run -it --rm blast-embedding
This command starts a container from the 'blast-embedding' image, runs the benchmarking script with a default query sequence, and removes the container after execution.
-
To use a custom query sequence:
docker run -it --rm -e QUERY_SEQUENCE="YOURSEQUENCEHERE" blast-embedding
Replace "YOURSEQUENCEHERE" with your actual protein sequence.
-
To run an interactive shell in the container:
docker run -it --rm --entrypoint /bin/bash blast-embedding
This allows you to explore the container's file system and run commands manually.
-
To mount a local directory and save results:
docker run -it --rm -v /path/to/local/directory:/app/results blast-embedding
Replace "/path/to/local/directory" with the actual path on your host machine.
To use your own database:
- Replace the
data/sample_database.fasta
file with your FASTA format database. - Rebuild the Docker image:
docker build -t blast-embedding .
We welcome contributions to the BLAST in Embedding Space project! Please follow these steps to contribute:
- Fork the repository
- Create a new branch (
git checkout -b feature/AmazingFeature
) - Make your changes
- Commit your changes (
git commit -m 'Add some AmazingFeature'
) - Push to the branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
Please read CONTRIBUTING.md for details on our code of conduct and the process for submitting pull requests.
This project is licensed under the MIT License - see the LICENSE.md file for details.
Alexander Titus - Send me a note on LinkedIn
Project Link: https://github.com/In-Vivo-Group/embedded-blast
- This project builds upon the work of many researchers in the fields of bioinformatics and machine learning.
- We thank the developers of the ProtBERT model and other open-source tools used in this project.
This is a research project and the embedding-based BLAST is a proof-of-concept. It may not be as comprehensive or accurate as established BLAST implementations. The project is designed to explore new approaches to sequence similarity search and may evolve significantly as research progresses.