Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Opfi available on Bioconda #207

Merged
merged 9 commits into from
Sep 30, 2021
2 changes: 1 addition & 1 deletion .github/workflows/feature.yml
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ jobs:
sudo cp -r GenericRepeatFinder/bin /usr/local/bin/
- name: Install dependencies
run: |
cd lib/pilercr1.06 && sudo make install && cd ../..
sudo apt-get install -y pilercr
sudo apt install -y ncbi-blast+
sudo apt install -y mmseqs2
sudo apt install -y diamond-aligner
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/master.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ jobs:
git clone https://github.com/bioinfolabmu/GenericRepeatFinder.git && sudo cp GenericRepeatFinder/bin/* /usr/local/bin/ && sudo chmod 755 /usr/local/bin/grf* /usr/local/bin/ltr_finder
- name: Install dependencies
run: |
cd lib/pilercr1.06 && sudo make install && cd ../..
sudo apt-get install -y pilercr
sudo apt install -y ncbi-blast+
sudo apt install -y mmseqs2
sudo apt install -y diamond-aligner
Expand Down
31 changes: 19 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,34 +2,41 @@

[![Documentation Status](https://readthedocs.org/projects/opfi/badge/?version=latest)](https://opfi.readthedocs.io/en/latest/?badge=latest)
[![PyPI](http://img.shields.io/pypi/v/opfi.svg)](https://pypi.python.org/pypi/opfi/)
[![Anaconda-Server Badge](https://anaconda.org/bioconda/opfi/badges/installer/conda.svg)](https://conda.anaconda.org/bioconda)

A python package for discovery, annotation, and analysis of gene clusters in genomics or metagenomics datasets.

## Requirements
## Installation

The recommended way to install Opfi is with [Bioconda](https://bioconda.github.io/), which requires the [conda](https://docs.conda.io/en/latest/) package manager. This will install Opfi and all of its dependencies (which you can read more about [here](https://opfi.readthedocs.io/en/latest/installation.html)).

Currently, Bioconda supports only 64-bit Linux and Mac OS. Windows users can still install Opfi with pip (see below); however, the complete installation procedure has not been fully tested on a Windows system.

### Install with conda (Linux and Mac OS only)

At a minimum, the NCBI BLAST+ software suite should be installed and on the user's PATH. BLAST+ installation instruction can be found [here](https://www.ncbi.nlm.nih.gov/books/NBK279671/). For annotation of CRISPR arrays, Opfi uses PILER-CR, which can be downloaded from the software [home page](https://www.drive5.com/pilercr/). A modified version of PILER-CR that detects mini (two repeat) CRISPR arrays is also available, and can be built with GNU make after cloning or downloading Opfi:
First, set up conda and Bioconda following the [quickstart](https://bioconda.github.io/user/install.html) guide. Once this is done, run:

```
cd lib/pilercr1.06
sudo make install
conda install -c bioconda opfi
```

## Installation

You can install Opfi with Pip:
And that's it! Note that this will install Opfi in the conda environment that is currently active. To create a fresh environment with Opfi installed, do:

```
pip3 install opfi
conda create --name opfi-env -c bioconda opfi
conda activate opfi-env
```

Alternatively, you can install the latest version on Github:
### Install with pip

This method does not automatically install non-Python dependencies, so they will need to be installed separately, following their individual installation instructions. A complete list of required software is available [here](https://opfi.readthedocs.io/en/latest/installation.html#dependencies). Once this step is complete, install Opfi with pip by running:

```
git clone https://github.com/wilkelab/Opfi.git
cd Opfi
pip3 install .
pip install opfi
```

For information about installing for development, check out the [documentation site](https://opfi.readthedocs.io/en/latest/installation.html).

## Gene Finder

Gene Finder iteratively executes homology searches to identify gene clusters of interest. Below is an example script that sets up a search for putative CRISPR-Cas systems in the Rippkaea orientalis PCC 8802 (cyanobacteria) genome. Data inputs are provided in the Opfi tutorial (`tutorials/tutorial.ipynb`).
Expand Down
2 changes: 1 addition & 1 deletion docs/examples.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ In this example, we will annotate and visualize CRISPR-Cas systems in the cyanob

You can download the complete assembled genome `here <https://www.ncbi.nlm.nih.gov/assembly/GCF_000024045.1/>`_; it is also available at `<https://github.com/wilkelab/Opfi>`_ under ``tutorials``, along with the other data files necessary to run these examples, and an interactive jupyter notebook version of this tutorial.

To run the code snippets here, Opfi must be installed, along with NCBI BLAST+ **and** PILER-CR. More detailed installation instructions can be found in the :ref:`installation` section.
This tutorial assumes the user has already installed Opfi and all dependencies (if installing with conda, this is done automatically). Some familiarity with BLAST and the basic homology search algorithm may also be helpful, but is not required.

1. Use the makeblastdb utility to convert a Cas protein database to BLAST format
################################################################################
Expand Down
6 changes: 3 additions & 3 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,9 @@
Opfi
====

Welcome to the Opfi documentation site! Opfi is a modular, rule-based framework for creating gene cassette identification pipelines, particularly for large genomics or metagenomics datasets.
Welcome to the Opfi documentation site! Opfi is a modular, rule-based framework for creating gene cluster identification pipelines, particularly for large genomics or metagenomics datasets.

Opfi is implemented entirely in Python, and can be downloaded from the Python package index. It consists of two major modules: Gene Finder, for discovery of novel gene cassettes, and Operon Analyzer, for rule-based filtering, deduplication, visualization, and re-annotation of systems identified by Gene Finder.
Opfi is implemented entirely in Python, and can be downloaded with conda or the from the Python Package Index. It consists of two major modules: Gene Finder, for discovery of novel gene clusters, and Operon Analyzer, for rule-based filtering, deduplication, visualization, and re-annotation of systems identified by Gene Finder.

Contents
--------
Expand All @@ -20,4 +20,4 @@ Contents
examples
tips
modules
contributing
contributing
100 changes: 76 additions & 24 deletions docs/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,46 +6,98 @@ Getting Started
Installation
------------

You can install Opfi with Pip:
The recommended way to install Opfi is with `Bioconda <https://bioconda.github.io/>`_, which requires the `conda <https://docs.conda.io/en/latest/>`_ package manager. This will install Opfi and all of its dependencies (which you can read more about below, see :ref:`dependencies`).

Currently, Bioconda supports only 64-bit Linux and Mac OS. Windows users can still install Opfi with pip (see below); however, the complete installation procedure has not been fully tested on a Windows system.

.. _install-with-conda:

Install with conda (Linux and Mac OS only)
##########################################

First, set up conda and Bioconda following the `quickstart <https://bioconda.github.io/user/install.html>`_ guide. Once this is done, run:

.. code-block:: bash

pip3 install opfi
conda install -c bioconda opfi

Alternatively, you can install the latest version on Github:
And that's it! Note that this will install Opfi in the conda environment that is currently active. To create a fresh environment with Opfi installed, do:

.. code-block:: bash

git clone https://github.com/alexismhill3/Opfi.git
conda create --name opfi-env -c bioconda opfi
conda activate opfi-env

.. _install-with-pip:

Install with pip
################

This method does not automatically install non-Python dependencies, so they will need to be installed separately, following their individual installation instructions. A complete list of required software is provided below, see :ref:`dependencies`. Once this step is complete, install Opfi with pip by running:

.. code-block:: bash

pip install opfi

Install from source
###################

Finally, the latest development build may be installed directly from Github. First, non-Python :ref:`dependencies` will need to be installed in the working environment. An easy way to do this is to first install Opfi with conda using the :ref:`install-with-conda` method (we'll re-install the development version of the Opfi package in the next step). Alternatively, dependencies can be installed individually.

Once dependencies have been installed in the working environment, run the following code to download and install the development build:

.. code-block:: bash

git clone https://github.com/wilkelab/Opfi.git
cd Opfi
pip install . # or pip install -e . for an editable version
pip install -r requirements # if conda was used, this can be skipped

Testing the build
#################

Regardless of installation method, users can download and run Opfi's suite of unit tests to confirm that the build is working as expected. First download the tests from Github:

.. code-block:: bash

git clone https://github.com/wilkelab/Opfi
cd Opfi
pip3 install .

And then run the test suite using pytest:

.. code-block:: bash

pytest --runslow --runmmseqs --rundiamond

This may take a minute or so to complete.

.. _dependencies:

Dependencies
------------

Opfi makes use of several third-party softwares for finding and annotating genomic features. Depending on your use case, you may not need to install all of these; however, at a minimum users should have the NCBI BLAST+ application installed in their environment. The following table provides more details about required/optional dependencies, including links to application homepages.

.. csv-table::
:header: "Application", "Required", "Description", "Anaconda distribution"
Opfi uses the following bioinformatics software packages to find and annotate genomic features:

"`NCBI BLAST+ <https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastDocs>`_", "Yes", "Protein and nucleic acid homology search tool", https://anaconda.org/bioconda/blast
"`Diamond <https://github.com/bbuchfink/diamond>`_", "No", "Alternative to BLAST+ for fast protein homology searches", https://anaconda.org/bioconda/diamond
"`MMseqs2 <https://github.com/soedinglab/MMseqs2>`_", "No", "Alternative to BLAST+ for fast protein homology searches", https://anaconda.org/bioconda/mmseqs2
"`PILER-CR <https://www.drive5.com/pilercr/>`_", "No", "CRISPR repeat detection", https://anaconda.org/bioconda/piler-cr
"`GenericRepeatFinder <https://github.com/bioinfolabmu/GenericRepeatFinder>`_", "No", "Transposon-associated repeat detection", "NA"
.. csv-table:: Software dependencies
:header: "Application", "Description"

Testing your build
------------------
"`NCBI BLAST+ <https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastDocs>`_", "Protein and nucleic acid homology search tool"
"`Diamond <https://github.com/bbuchfink/diamond>`_", "Alternative to BLAST+ for fast protein homology searches"
"`MMseqs2 <https://github.com/soedinglab/MMseqs2>`_", "Alternative to BLAST+ for fast protein homology searches"
"`PILER-CR <https://www.drive5.com/pilercr/>`_", "CRISPR repeat detection"
"`Generic Repeat Finder <https://github.com/bioinfolabmu/GenericRepeatFinder>`_", "Transposon-associated repeat detection"

Users who opt to build Opfi from source can test their build by running ``pytest`` from the project root directory. The following flags will direct pytest to run specific sets of tests (in addition to the core suite):
The first three (BLAST+, Diamond, and MMseqs2) are popular homology search applications, that is, programs that look for local similarities between input sequences (either protein or nucleic acid) and a target. These are used by Opfi in :class:`gene_finder.pipeline.Pipeline` for annotation of genes or non-coding regions of interest in the input genome/contig. The user specifies which homology search tool to use during pipeline setup (see :class:`gene_finder.pipeline.Pipeline` for details). Note that the BLAST+ distribution contains multiple programs for homology searching, three of which (blastp, blastn, and PSI-BLAST) are currently supported by Opfi.

* ``--runmmseqs``: Run tests that require MMseqs2.
* ``--rundiamond``: Run tests that require Diamond.
* ``--runslow``: Run integration/end-to-end tests.
* ``--runprop``: Run very slow property tests.
The following table summarizes the main difference between each homology search program. It may help users decide which application will best meet their needs. Note that performance tests are inherently hardware and context dependent, so this should be taken as a loose guide, rather than a definitive comparison.

For most users, running ``pytest --runslow`` is recommended.
.. csv-table:: Comparison of homology search programs supported by Opfi
:header: "Application", "Relative sensitivity", "Relative speed", "Requires a protein or nucleic acid sequence database?"

.. note::
"Diamond", `+`, `++++`, "protein"
"MMseqs2", `++`, `+++`, "protein"
"blastp", `+++`, `++`, "protein"
"PSI-BLAST", `++++`, `+`, "protein"
"blastn", "NA", "NA", "nucleic acid"

Several tests in the core suite require :program:`BLAST`, :program:`PILER-CR`, and/or :program:`GenericRepeatFinder`. Running ``pytest`` without first installing these dependencies will cause these tests to fail.
The last two software dependencies, PILER-CR and Generic Repeat Finder (GRF), deal with annotation of repetive sequences in DNA. PILER-CR identifies CRISPR arrays, regions of alternatating ~30 bp direct repeat and variable sequences that play a role in prokaryotic immunity. GRF identifies repeats associated with transposable elements, such as terminal inverted repeats (TIRs) and long terminal repeats (LTRs).
6 changes: 3 additions & 3 deletions docs/tips.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ Inputs and Outputs
Building sequence databases
---------------------------

To search for gene cassettes with Opfi, users must compile representative protein (or nucleic acid) sequences for any genes expected in target casettes (or for any non-essential accessory genes of interest). These may be from a pre-existing, private collection of sequences (perhaps from a previous bioinformatics analysis). Alternatively, users may download sequences from a publically available database such as `Uniprot <https://www.uniprot.org/>`_ (maintained by the `European Bioinformatics Institute <https://www.ebi.ac.uk/>`_ ) or one of the `databases <https://www.ncbi.nlm.nih.gov/>`_ provided by the National Center for Biotechnology Information.
To search for gene clusters with Opfi, users must compile representative protein (or nucleic acid) sequences for any genes expected in target clusters (or for any non-essential accessory genes of interest). These may be from a pre-existing, private collection of sequences (perhaps from a previous bioinformatics analysis). Alternatively, users may download sequences from a publically available database such as `Uniprot <https://www.uniprot.org/>`_ (maintained by the `European Bioinformatics Institute <https://www.ebi.ac.uk/>`_ ) or one of the `databases <https://www.ncbi.nlm.nih.gov/>`_ provided by the National Center for Biotechnology Information.

Once target sequences have been compiled, they must be converted to an application-specific database format. Opfi currently supports :program:`BLAST+`, :program:`mmseqs2`, and :program:`diamond` for homology searching:

Expand Down Expand Up @@ -47,7 +47,7 @@ The sequence definition (defline) comes directly after the ``>`` character, and
Annotating sequence databases
#############################

To take full advantage of the rule-based filtering methods in :mod:`operon_analyzer.rules`, users are encouraged to annotate reference sequences with a name/label that is easily searched. Labels can be as broad or as specific as is necessary to provide meaningful annotation of target gene cassettes.
To take full advantage of the rule-based filtering methods in :mod:`operon_analyzer.rules`, users are encouraged to annotate reference sequences with a name/label that is easily searched. Labels can be as broad or as specific as is necessary to provide meaningful annotation of target gene clusters.

Gene labels are parsed from sequence deflines; specifically, Opfi looks for the second word/token following the ``>`` character. For example, the following FASTA sequence has been annotated with the label "cas1":

Expand Down Expand Up @@ -147,7 +147,7 @@ Results from :class:`gene_finder.pipeline.Pipeline` searches are written to a si
:file: csv/example_output.csv
:header-rows: 0

The first two columns contain the input genome/contig sequence ID (sometimes called an accession number) and the coordinates of the candidate gene cassette, respectively. Since an input file can have multiple genomic sequences, these two fields together uniquely specify a candidate gene cassette. Each row represents a single annotated feature in the candidate locus. Features from the same candidate are always grouped together in the CSV.
The first two columns contain the input genome/contig sequence ID (sometimes called an accession number) and the coordinates of the candidate gene cluster, respectively. Since an input file can have multiple genomic sequences, these two fields together uniquely specify a candidate gene cluster. Each row represents a single annotated feature in the candidate locus. Features from the same candidate are always grouped together in the CSV.

Descriptions of each output field are provided below. Alignment statistic naming conventions are from the BLAST documentation, see `BLAST+ appendices <https://www.ncbi.nlm.nih.gov/books/NBK279684/>`_ (specifically "outfmt" in table C1). This `glossary <https://www.ncbi.nlm.nih.gov/books/NBK62051/>`_ of common BLAST terms may also be useful in interpreting alignment statistic meaning.

Expand Down
13 changes: 0 additions & 13 deletions tests/integration/test_gene_finder.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,13 +63,8 @@ def test_with_blast(temporary_directory):
genomic_data = "tests/integration/integration_data/contigs/v_crass_J520_whole.fasta"
conf_file = "tests/integration/configs/blast_integration_test.yaml"
p = create_pipeline(conf_file)

results = p.run(data=genomic_data, output_directory=temporary_directory.name)

assert len(results) == 1
hits = results["NZ_CCKB01000071.1"]["Loc_78093-114093"]["Hits"]
assert len(hits) == 11
assert "Array_0" in hits


@pytest.mark.slow
Expand All @@ -85,9 +80,7 @@ def test_multi_seq_fasta(temporary_directory):
tmp, genomic_data = merge_data()
conf_file = "tests/integration/configs/blast_integration_test.yaml"
p = create_pipeline(conf_file)

results = p.run(data=genomic_data, output_directory=temporary_directory.name)

assert len(results) == 2
tmp.cleanup()

Expand All @@ -103,10 +96,8 @@ def test_gzip_fasta(temporary_directory):
conf_file = "tests/integration/configs/blast_integration_test.yaml"
p = create_pipeline(conf_file)
results = p.run(data=data, gzip=True, output_directory=temporary_directory.name)

hits = results["KB405063.1"]["Loc_0-23815"]["Hits"]
assert "Cas_all_hit-0" in hits
assert "Array_0" in hits


@pytest.mark.slow
Expand All @@ -124,15 +115,11 @@ def test_record_all_hits_1(temporary_directory):
genomic_data = "tests/integration/integration_data/contigs/record_all_hits_test_1"
conf_file = "tests/integration/configs/blast_integration_test.yaml"
p = create_pipeline(conf_file)

p.run(data=genomic_data, record_all_hits=True, output_directory=temporary_directory.name)

with open(os.path.join(temporary_directory.name, "gene_finder_hits.json"), "r") as f:
hits = json.load(f)["KB405063.1"]["hits"]
assert len(hits) == 3
assert "tnsAB" in hits
assert "cas_all" in hits
assert "CRISPR" in hits


@pytest.mark.slow
Expand Down
2 changes: 1 addition & 1 deletion tests/integration/test_operon_analyze.py
Original file line number Diff line number Diff line change
Expand Up @@ -193,7 +193,7 @@ def visualize(condition: str):
if op is None:
continue
good_operons.append(op)
plot_operons(good_operons, tempdir)
plot_operons(good_operons, tempdir, nucl_per_line=25000)
files = os.listdir(tempdir)
count = len([f for f in files if f.endswith(".png")])
except Exception as e:
Expand Down
Loading